Trusting Metrics at Pinterest
Ryan Cooke | Engineering Manager
Pinterest, like many tech companies, relies heavily on data to help inform decisions and power machine learning. This strategy fundamentally depends on the data being accurate. Over the years we’ve worked to improve the processes behind trusting core decision-making data and ensure metrics remain accurate.
How Data Can Go Wrong
For those who haven’t thought about metrics, this may seem like an odd problem. Something like Daily Active Users (DAU) may sound simple to measure, but here are some examples of how such a simple metric may go wrong:
- An extension version of the app could decide to keep auth tokens fresh; they could auto login the users every day, regardless of if the user actually used the extension.
- A user being active across midnight may count as an active user on both days for one platform, but not for another platform.
- Non-qualifying activity such as using a browser extension could be counted as DAU.
Despite these cases, logging active users tends to be one of the simpler cases, so I’ll continue to use it as an example.
To make sure that a metric is accurate, we’ve created a “certification” process. This certification process (shown in Figure 1) has three key steps:
Step 1: Determine the correct behavior
Before starting, it’s essential to have an unambiguous view of what the correct behavior is. We create two documents for every certified metric. A product spec, which gives a clear definition (such as an active user):
- Visits the Pinterest website or native mobile apps
- Pins or repins using an offsite tool such as a Web browser extension, Web Pin it button, Mobile Web Facebook Messenger extension, or the iOS/Android share extension
This definition gives the driving core for thinking about the metric, but the doc continues on to outline edge cases and describe the correct behavior in a variety of cases (e.g. “opens Pinterest, but leaves before UI is visible” should not count as an active user).
The product spec is supplemented by a technical spec that is written more for engineers. It aligns on the exact method of logging the event and the metadata, and it documents exactly how each platform achieved the consistent implementation.
These steps may seem obvious, but every metric we’ve audited has had definitional issues. The product spec and technical spec ensure that everybody means the exact same thing when thinking about a metric. These specs can also provide a framework for thinking through future use cases. I recommend that the consumers of the data create the definition, and the rest of the specs are written around the definition. It’s reasonable for engineers to offer feedback for what can be measured with a simple solution. The simpler the logging solution, the less likely it is to have issues.
Step 2: Implement the metric/fix existing issues
The next major step is to make the metric accurate. The first layer of this can be done easily by the client engineers just following the definition and implementing accordingly. In practice, there will be edge cases and oddities that are unreasonable to expect a client engineer to anticipate.
To find edge cases, we create data checker criteria that help prove the data matches our expectations. These generally will send an alert if the data doesn’t pass a specific “check.” For the DAU example, one of our more successful checker types looks for “unexplained DAU.” This looks for users that have the active user event but lack other activity we would expect, such as seeing a pin. We rarely aim for 100% perfection on these checks, just enough to catch if something is odd. The iteration between finding issues via data and shipping fixes tends to determine how long a certification takes to complete.
Additionally, we can prove a metric is accurate through UI tests. For DAU, the most simple case would be to open the app, log in and confirm the user is logged as a DAU. Unlike the data checks, these allow us to find issues before a change ships. We generally decide the UI tests to write based on the product spec cases we outline and the volume of events for which the use case accounts.
Step 3: Keep the metric accurate
After the above steps, you should have a metric that is accurate. But if you want the metric to continue to be accurate, you need to do a bit more. The UI tests and data checkers made in the last step are run recurrently to prove the metric is accurate. This means that failures in the UI tests can’t be released, and data checkers will lead to investigations. It’s key to keep in mind that uncovering why a metric moved in production is much more difficult and potentially costly than catching it before release.
Lastly, on a recurring basis (quarterly or annually, depending on the metric), we do health checks on the metric. This includes reviewing the checkers to make sure they are still running effectively and working with product teams to ensure that there are not new use cases that need to be considered.
The above three steps outline our metric certification process. This process is only possible thanks to significant cross functional work of engineers on every platform (iOS, Web, Android), data engineers, and tireless TPMs. We only recommend it for metrics that need to be as close to perfect as possible.