A Scalable Telemetry Acquisition Strategy: Part 1

Jack Pullikottil
4 min readOct 10, 2022

--

This is a two-part post developing the elements of a telemetry acquisition strategy that we’ve been pursuing over several years with multiple partner teams within the company.

Telemetry is an important type of signal to assess the health of our product portfolio and extent to which our customers get value from our products. That makes it one of the more important type of data source that goes into shipping a refereed metric (discussed in my earlier post).

As data sources go, it is also considered a sensitive one with significant controls and oversight around how they are collected, secured, and used. This is a key concern for our data engineering stack — but a topic for a later time. The focus of today’s conversation is dealing with challenges associated with sourcing such telemetry in a company like Microsoft.

The product portfolio in Microsoft is huge: List of Microsoft software — Wikipedia and this list does not show SKU variations which have their own operational customization, internal platform engineering, stuff under development, recent acquisitions, research etc.

Microsoft also has a lot of diversity when it comes to how it builds software. It has operated many high scale services for decades and spins up new ones literally every day. A considerable fraction of its code base is not service centric (as can be seen from the list). This is the living output of tens of thousands of developers (some of the rows on that list have team sizes in the thousands). The company also maintains tons of business-critical legacy while simultaneously spawning unending innovation in developer tools and libraries (both internal and external). It also has an ever-growing appetite to use (and contribute to) the latest and greatest in open-source technologies.

This diversity is further amplified by the fact that these developer teams operate in different product organizations with different business models — and therefore prioritization. This has a deep impact on how teams approach dev-ops practices. Engineering edicts seldom span large organizations, and almost never endure as organizations morph to reflect changing business goals and operational realities.

This heterogeneity gives the company many competitive advantages — but it also makes for some daunting challenges which may not even exist in other engineering shops where the dev stack looks mostly the same all over. Telemetry is simply one such area that has direct and outsized impact to a central data team like ours. There are multiple ecosystems of tools, libraries and services for instrumentation, collection, event sinks, data lakes, processes and compliance barriers etc.

Here are some examples of how this heterogeneity throws up challenges for telemetry consumers that try to build cross product metrics (Ex. user engagement metric across consumer products):

1. Event Schema

a. Envelope Serialization: Can be different on account of the instrumentation stack and collectors used: CSV, Json (flattened vs. nested) etc.

b. Mandatory attributes that are missing, malformed or violate types.

c. Value encoding: A user or device can be identified by different Id types — pseudonymous, anonymous, with different encoding styles, with and without prefixes identifying types.

d. Metadata consistency: Events often include metadata attributes which are critical for aggregation. There must be a finite set of values used for “OS Type”. What happens if a new value starts showing up?

2. Semantics: What should go into a “Service’ attribute? The container host, the logical name of the Service family, the Micro Service, the Micro Service and its deployed instance?

3. Routing & SLA: With multiple telemetry collection and distribution stacks, finding and setting up ingress routes with predictable SLAs must be tackled on a case-by-case basis.

4. Sampling: No consistent sampling strategy across Apps & Products and no single place where the sampling strategy is documented.

5. Correlation Identifiers: A legion of correlation identifiers exist with different semantics and scope than span across products and services.

6. Completeness: A variety of deployment related settings including compliance related isolation can lead to the event feed being incomplete or shunted off to a new data sink

All these issues are easily recognized by data teams everywhere. Microsoft simply takes these challenges to a whole new level, at a time when the company is striving towards a data driven culture focused on the understanding the customer experience. Acquiring and leveraging telemetry for refereed metrics imposes a significant tax.

It's also not like this is a new problem or that it hasn’t been tackled in different parts and at different points. There are great models in the company’s history and in different orgs where they have been tackled with strong sponsorship from executives.

In my next post, I’ll draw on some of these lessons, identify gaps and sketch out the elements of a strategy for taking this on. But here’s a preview — we’re not going to make a dent on it by trying to reduce heterogeneity. And we may not have to.

--

--