Metric Definitions & Fragmentation
We’re in election week here in the United States. As I write this, the Senate has been called, but not the House. In keeping with the times, fact checkers continue to have their hands full battling disinformation: 2022 Election Archives — FactCheck.org. As seen in public life and on social media, it is exceedingly hard to have a structured & rational conversation around problems if we don’t have a shared objective representation of reality. The same applies to corporate decision making in professionally run enterprises everywhere.
The primary currency for data driven decisions in a modern corporation is a metric — which is a value computed using a definition. A metric’s definition is different things to different people.
- To most consumers of metrics, a metric’s definition is synonymous with its descriptive name and semantic (Eg. Microsoft M365 Monthly Active User). It can also be closely identified to its source (or owner) of publication.
- A select few will equate the definition to the underlying declarative code (typically in SQL or similar analogs) and the assets on which the logic is defined. And this can vary widely in their complexity. At one extreme, we have very simple measures expressed over a single source with a simple aggregation (often seen in the case of operational metrics built over product telemetry). On the other hand, as often seen in the case of refereed metrics (see here for an earlier post describing this term), the definition can be defined on assets which themselves have complex lineage with layers of derivative processing and dozens — if not hundreds of source signals. More importantly, those assets may themselves reflect more granular definitions made with other stakeholders including the authoritative owners of sources.
For instance, in the above example, someone had to decide what an “Active” user means. For something like a Microsoft 365 subscriber who can use a variety of different apps and services, consumed on different types of surfaces, these granular definitions are typically negotiated on a per product, per surface basis and typically involves individuals from different functional organizations (finance, sales etc.) in addition to the product teams themselves. Other examples:
o How do we define whether a user is commercial or consumer?
o If a user has many subscriptions, which one do we designate as primary?
o How do we identify whether a customer is a European customer?
o How do we identify customers who represent charities, governments, educational institutions etc.
o How do we identify an inactive customer — i.e. someone who has used the product in the recent past?
So, beyond the complex code spread across layered pipelines and owned by different teams, the definition of a metric also reflects consensus around key pieces of business logic embedded throughout its construction. We’ll soon see why this last bit is important.
This sketch helps illustrate how the definitions are much like an iceberg. Very little is seen on the surface.
A metric definition fragments when an alternative implementation comes into existence that has a similar name and semantic description and is portrayed as an equivalent.
In many cases, this isn’t much of an issue when the consumer base for the metric is small and localized (such as members of the same engineering crew). However, when it happens for a highly visible metric, and when we have two different values or values that don’t add up, this can quickly become a charged issue. Throwing up inconsistent metric values — especially with a senior audience who drive a principal part of their decision making based on these numbers, is the equivalent of throwing a flashbang. After the smoke clears, there is going to be a good deal of confusion and anxiety. Key stakeholders will immediately want to know why there are two values, what is different about them and why they cannot be reconciled — and no one in the room will have great answers. Even worse, a reasonable answer is going to take some significant investigation. It will require data scientists, program managers, data engineers and many cases product engineers to fully quantify the delta. And once they do, they must find a way to explain it to a room which has limited patience and appetite to understand how the sausages get made. In the interim, metric consumers are accounting for two values of the metric which may drive different decisions. It is a recipe for stress, frustration, an erosion of trust and friction.
Unlike the unfortunate disinformation in our public sphere, the typical root causes of metric fragmentation are usually bereft of malicious intent and can be distilled to a few key reasons:
1. Lack of cross organizational awareness. A team may simply not know that a metric that it defined, happens to semantically overlap with a high-profile metric. In a company like Microsoft where it is relatively simple to compute and publish dashboards and scorecards, this can easily happen.
2. Difficulty in accessing the definitional code and/or the underlying assets, used to compute the metric. In the absence of quick and integrated means to access this information from the scorecard itself, people are likely to make assumptions about how a metric is calculated and make expedient choices which will likely involve using alternative sources and re-engineering the definition.
3. Definitions that hard to grok and difficult to reuse. Ease of understanding and likelihood of reuse typically go hand-in-hand. Two common use cases for reusing a metric definition (using the same example) are as follows
a. Produce a new dimensional cut of the same metric. For ex. Microsoft M365 Monthly Active User by Subscription Type.
b. Use the definition of the metric but filtered to a specific segment. For ex. Microsoft M365 Monthly Active User for users exposed to Teams Preview
If the implementation of a metric is complex or the owners of the implementation cannot accommodate (or prioritize) variants from the original, or if the underlying data sources cannot be easily updated to include the additional information — any of these can lead to alternative implementations that can drift from the original.
A key aspect related to this bullet, is that modern data stacks typically use processing engines that support a hybrid of declarative and imperative programming paradigms & languages. This is a quick introduction that highlights the differences: Declarative vs imperative programming: 5 key differences (educative.io). Without getting into the pros vs. cons of these paradigms, the key takeaway is that it is a lot more tractable to build tooling that can check code for semantic equivalence (i.e. does code X mean the same as code Y), when the definition is expressed in declarative languages (like SQL) rather than imperative ones (like C# & Java). In other words, unless metric definitions live in languages which have precise declarative semantics, it becomes difficult to compare two metric definitions or to programmatically create variants of a definition (like shown above to produce new custom cuts or to recompute it for a custom segment).
As a central data and programmability team, it should be no surprise that metric fragmentation is a primary concern for IDEAs. In fact, it was one of the founding reasons why this team came into existence. Our data scientists spend considerable time vetting definitional fragmentation with stakeholders and partner teams. But this is unsurprisingly an expensive exercise reserved for the most highly visible metric definitions. To address this issue in a scalable fashion, we do have a couple of ongoing investments which we’ll touch upon briefly.
The first is a cross organizational & collaborative data modeling effort, underpinned by rules enforced by our data engineering stack to protect and promote reuse of definitions at the entity grain (Ex. user, device, tenant etc.). We call this the Unified Data Model (UDM) and a key goal is to corral as much as complexity (& imperative code) as possible to beneath “blessed” attributes at the entity grain layer and then to reuse these assets as broadly as possible across teams.
The second is our investments in the self-serve low code space including support for metric authoring. The user experience restricts metric definitions only on approved entity grain assets from the UDM. While the experience is predominantly UX driven, definitions are captured in a Json DSL that is fully declarative and therefore programmatically comparable and rewritable for applicable use cases. The tradeoff is that it is limited in expressiveness — being roughly equivalent to a subset of the SQL Select statement. Since most refereed metrics can be defined on top of entity grain assets, the combination of using a declarative representation in conjunction with a data model that is continuously enriched with the entity grain definitions offers a pragmatic way to address the risk of fragmentation, without compromising the needs of the business.
Finally, we continue to invest in discoverability of definitional code and asset lineage for metric consumers — right from our scorecards, by embedding links from dashboards and scorecards to our self-serve tools and data catalogs.
As an issue that impacts most large data driven organizations, I’d love to hear how teams operating in similar domains tackle this problem.
Until next time. Back to obsessing over who wins the House and celebrating metric owners in public life fighting disinformation everywhere.