Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions weave/trait/telemetry/wdm/telemetry_wdm_trait.proto
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
//
// Copyright (c) 2016 Nest Labs, Inc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the copyright header: 2020, Google LLC, Apache license.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// All rights reserved.
//
// This document is the property of Nest. It is considered
// confidential and proprietary information.
//
// This document may not be reproduced or transmitted in any form,
// in whole or in part, without the express written permission of
// Nest.
//

// @file
// This file specifies a WDM Telemetry trait that specifies the
// various telemetry counters that are associated with WDM

syntax = "proto3";

package weave.trait.telemetry.wdm;

import "wdl/wdl_options.proto";
import "google/protobuf/timestamp.proto";

option java_outer_classname = "WeaveInternalTelemetryWdmTrait";
option objc_class_prefix = "SCM";

// @brief
// TelemetryWdmTrait
//
// A telemetry trait for the WDM that specifies various counters
// associated with WDM updates and subscriptions.
//
message TelemetryWdmTrait {
option (wdl.message_type) = TRAIT;
option (wdl.trait) = {
stability: ALPHA,
id: 0x1781,
vendor_id: 0x0000,
version: 1
};
option (wdl.properties) = {
writable: READ_ONLY,
};

// Events associated with WDM updates
message TelemetryWdmUpdateStatsEvent {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to log the update and subscription failures as first class events rather than remembering the last event in a different event? The semantics of most of the field in this trait make quite a lot of sense as properties. As events I would suggest we expose 3 different events: subscription failure (fields: reason), update failure (fields: reason, profile, status_code), and counts (fields: successful_updates, unsuccessful_updates, successful_subscriptions, subscription_attempts). If desired, the last subscription failure and last update failure could also be exposed as properties, so they are available for inspection on connectivity recovery

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the last failures would better serve as trait properties. That makes sense especially if these events are not uploaded periodically and visibility of the last failures might fade unless preserved as properties.
If we break these into multiple events, then the subscription and update failures could be triggered on the edges while the counter is a periodic one with a relatively long cadence(~20-30 mins)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified to what was discussed offline , update on edge events for failure with properties in case the events are dropped lost due to lack of buffer space.

option (wdl.message_type) = EVENT;
option (wdl.event) = {
id: 1
event_importance: EVENT_IMPORTANCE_DEBUG
};

// This field holds the count of successful updates since the
// last restart.
uint32 successful_updates_count = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these metrics aggregated across different subscriptions?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is accumulated across multiple subscriptions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this looks like an aggregate of all Updates(presumably unconditional and conditional). It might be good to separate conditional vs unconditional updates also?


// This field holds the count of unsuccessful updates since the
// last restart.
uint32 unsuccessful_updates_count = 2;

// This field holds the count of successful subscriptions since the
// last restart.
uint32 successful_subscriptions_count = 3;

// This field holds the count of subscription attempts since the
// last restart.
uint32 subscription_attempt_count = 4;

// This field holds the timestamp of the last update failure.
google.protobuf.Timestamp last_update_failure_time = 5 [(wdl.param) = {
timestamp_constraints: { signed: true, precision: 0.001, width: 64 }
}];

// This field holds the timestamp of the last subscription failure.
google.protobuf.Timestamp last_subscription_failure_time = 6 [(wdl.param) = {
timestamp_constraints: { signed: true, precision: 0.001, width: 64 }
}];


// This field holds the error code of the last update failure.
int32 last_update_failure_reason = 7;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the last failure update, it would also make some sense to note the address of the subscriber.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// This field holds the profile ID of the last update failure.
uint32 last_update_failure_profile = 8;

// This field holds the status code of the last update failure.
uint32 last_update_failure_status_code = 9;

// This field holds the error code of the last subscription failure.
int32 last_subscription_failure_reason = 10;

}
}