Delve Telemetry & Monitoring

Delve Engineering
7 min readJun 14, 2016

--

My name is Luka Bozic and I’m a software engineer working on Delve. My job is to enable each Delve stakeholder to easily answer the question: “How’s this Delve feature performing?”

It sounds like a very simple question but it usually hides lots of additional questions, depending on the context. When asked by Program Managers, it usually means “How many people are using this feature? How do they use it? Does it bring us more users?”. When asked by engineers, it’s mostly about “Is my feature working as expected? Are there any bugs?”.

My Orc army is pillaging with 99.99% success and my ring has been used 2 times in the last hour.

Our organization’s mantra is “Be data driven”. And in order to make data driven decisions, you need… well, data. That’s where our telemetry and monitoring pipelines come to the rescue.

High level principles

When designing the architecture of our telemetry, monitoring, and picking the tools, we stick to one of Delve Engineering’s main principles: making it easy to fall into The Pit of Success. We do that by simplifying the code instrumentation and data fetching process. We aim to go beyond simplifying, we want to provide as much as possible out of the box for our developers and PMs.

The Pit of Success ?

When developing Delve, telemetry is added at the time of feature implementation, as opposed to “when needed”. It can be painful and costly to add missing instrumentation after you’ve already deployed and need insights right now.

I work as part of the Delve Monitoring team. Having a monitoring team, at least in our organization, does not mean that those guys are the only ones responsible for instrumenting the code, creating feature reports, monitoring the feature health, etc… Ownership of monitoring is distributed among feature owners. The monitoring team’s role is to provide feature teams with the infrastructure, tools, and guidance on how to get insights.

Logging

When designing the logging system, first we need to identify the logging needs. There are four main reasons why we need telemetry:

  • Detection: We need to be able to detect problems in our service.
  • Diagnostics: We need to be able to investigate problems.
  • Usage: We need to know how our service is used.
  • Performance: we need to know how fast our service is.

With our Logger, we are logging 3 types of logs:

  • Event: For reporting usage. Only explicit user actions (click, scroll, hover…).
  • Trace: Additional information to help investigating and diagnosing issues.
  • Error: Used to report any problems detected.

Each of the log types can contain additional information if needed. The logger adds some information to all log types by default: mostly context information such as correlationId, version, and time-stamps.

Having a high level log classification like this helps us to have generic reports, such as Top Errors or an Overall Usage report.

As Øystein already mentioned, most of the logging is added by default in our Flux architecture. Every Action is being logged by default.

IAction interface
Action example

QoS/Reliability Monitors

QoS (Quality of Service) could be considered the backbone of Delve’s monitoring system. What do we actually mean by QoS? QoS in our context is a scenario (transaction) based way of monitoring the service health, reliability, performance and implicitly usage.

The base unit of our QoS is a Scenario. A QoS Scenario is uniquely defined by its name, which is automatically added to the start and end events. Instrumented in the code, it usually wraps part of the code that we want to monitor.

Sample QoS Monitor in C# code

There are 5 QoS event types that we use:

  • Start: This marks the beginning of a scenario.
  • Success: This concludes the successful run of a scenario.
  • ExpectedFailure: The scenario didn’t finish with success (happy path), but the failure is known/expected.
  • UnexpectedFailure: The scenario didn’t run successfully.
  • Unknown : The monitor was disposed but no end event (success, expected of unexpected failure) was logged.

Every monitor ends on its dispose call. During the dispose, the monitor is evaluated, the duration of the execution is calculated and only one end event is logged. This is done to ensure that we have equal number of start and end events. If there are multiple end events happening during one scenario run, the following priorities are used to choose which event gets logged:

UnexpectedFailure > ExpectedFailure > Success

QoS Metrics

The QoS logging described above allows us to compute a number of metrics that we use for health and reliability monitoring. For example:

ICE (Ideal Customer Experience) = successes / starts

ACE (Adjusted Customer Experience) = (successes + expected failures) / starts

In addition to the above, QoS also allows us to determine usage by the number of Starts and performance of scenarios by reading the duration property from end events.

QoS Reports

Using our tools, we can see the QoS results for each of our scenarios. We have a report listing all our scenarios and their QoS metrics, so we can easily identify features with reliability issues. We also have a per scenario trend chart so we can see how the scenario behaves over time.

QoS trend chart — Showing reliability rend and usage (number of times scenario was run) by day

Since we log the duration of every QoS scenario, we get some out-of-the-box performance reports as well.

Performance report showing 75th and 95th percentile for a scenario

QoS & Flux architecture

One of the best examples of our Pit of Success principle is the way we do network calls in Delve. Every network call is, by default, wrapped within a QoS monitor. This ensures each call is properly instrumented.

Data Layer Promise Executor that wraps all network calls with a QoS Monitor
Network call to Delve middle tier to favorite a document with QoS monitor

Alerting in Delve

If things go bad and something starts failing, we sure want to know it ASAP and mitigate the issue before our support lines become hot! That’s why we are relying on our Monitoring and Alerting system.

We use internal tools to create Monitors based on QoS logs. Every Monitor is defined with a QoS scenario name, severity level and thresholds. There are 3 threshold parameters that we use for our monitors:

  • Reliability: Success rate s(ACE or ICE).
  • Time Window: How far back are we looking to evaluate the monitor?
  • Minimum sample size: How many scenario runs do we need (minimum) to raise an alert? We don’t want to raise alerts if the reliability is low, but with only a few samples.

If all 3 thresholds are violated we create an incident. Depending on the severity, incidents are triaged by the teams during normal office hours or On Call Engineers (OCE) get the phone call to mitigate the issue immediately.

Our OCE getting a Sev1 call in the middle of the night

Tools

Our telemetry system also uses internal tools, it is designed to process 3 different paths:

  • Hot path: with maximum data latency less than 60 seconds we can detect issues as soon as they appear. We only use this path to send QoS metrics.
  • Warm path: With 5 min latency; we use this to store our logs, as a proxy to other analytics tools, and to forward to the cold path.
  • Cold path: Used for long term retention and mostly for historical usage reporting.

In combination with this system we are using an analytics tool code-named Kusto, externally known as Application Insights Analytics. Due to its performance and ability to quickly get insights about almost anything, Kusto has quickly become our team’s favorite telemetry/analytics tool.

Compliance & Privacy limitations

The greatest challenge when working with telemetry in our team is probably dealing with the privacy requirements. One of Office 365’s competitive advantages is how we treat your data. As part of Office 365, Delve abides by these requirements and takes our responsibility seriously, logging only system metadata. Delve never logs any specifics about our users or their content.

Lots of effort is used making sure we are using tools that handle data properly, according to our privacy standards. That is why we are not using any of the popular, third-party analytics tools.

Besides just picking the tools, a lot of attention is paid to how we use our telemetry and monitoring system in order to avoid leaking any customer data.

Beyond Delve

Principles and tools described here go beyond just Delve. New services and apps that we are developing here in Oslo are following our lead.

Let us know what telemetry tools you use! What is your approach to telemetry and monitoring? What are your biggest challenges in this area?

--

--