Building a culture around metrics and anomaly detection

Kenny Bastani
Apache Pinot Developer Blog
5 min readJul 27, 2020

Anomaly detection is a very broad term. Usually it means that you want to see if things are running as usual. This could go from your business metrics down to the lowest level of how your systems are running. Anomaly detection is an entire process. It’s not just a tool that you get out of the box that measures time series data. Similar to DevOps, anomaly detection is a culture of different roles engaging in a process that combines tooling with human analysis.

“Are our expectations wrong or has the world around us changed?” says Alexander Pucher, an expert of modern anomaly detection at big internet companies. I recently had a chance to interview Alexander on an episode of The Little Tech podcast about a unified, comprehensive process for reporting and data analysis.

Anomaly detection is not just a single level of insights. As you go down in the hierarchy of events and metrics, different parts of an organization are interested in different insights. Eventually, you get down to the desire to have something that does anomaly detection on real-time data.

We can consider this “smell of smoke” to be the first step of anomaly detection, and it can be costly without the right culture and tools to be aware of early indicators that lead to problems.

Anomaly detection is a part of a bigger process. For example, let’s say I have an organization and there’s a definition of business as usual. Then suddenly, a problem comes along. Whether or not that problem is being monitored, you get a smell of smoke coming from either customers or users. This is the first step of knowing that something is off, and it is oftentimes the slowest part of the whole problem resolution process. We can consider this “smell of smoke” to be the first step of anomaly detection, and it can be costly without the right culture and tools to be aware of early indicators that lead to problems.

Alexander is a researcher and open source developer that helped create a tool called ThirdEye for anomaly detection at LinkedIn. ThirdEye is a part of the Apache Pinot ecosystem of projects, which both came from early lessons learned at LinkedIn.

Investigating anomalies using the ThirdEye tool

Alexander says that “you need an additional tool that helps you understand whether a change in time series data is actually meaningful.”

ThirdEye as a system is a platform that allows you to integrate your metrics (quantitative information) with events (knowledge or qualitative information) and combine the two so you can distinguish between meaningless anomalies and those ones that matter.

As a business, overall, you need to make sure that you’re making the progress you’re expected to make. The business starts with an approximate expectation of where things are going. The insights that go along with these metrics are being observed by business folks at many different levels. When those folks find something in the metrics that have diverged from these expectations, you’ll get questions about why these anomalies happened.

“At LinkedIn, our data analysts or a dedicated ops team would be the ones that have to answer these questions” said Alexander. “The answers are often not as satisfying or clear enough to be worth implementing a change to avoid the issue.”

The goal of much of Alexander’s work at LinkedIn was discovering the kind of answers that could be automated versus the ones that required creative exploration. Spending less time on repetitive analysis for the things that can be automated allows engineers to stay focused on creating differentiated value for a business.

Alexander goes on to say “whenever you look at data, it’s extremely important to know or try to understand the process that generated the data that you’re looking at.”

If you take this process of interpreting data in a business sense, one thing that Alexander has learned, perhaps the most critical thing, is “to understand whether or not an anomaly actually has an impact on the business”.

Alexander points out that domain expertise is an enormously important part of how different groups and roles understand the meaning of a metric as well as an anomaly. “You have to include the human element of observing and interpreting the meaning of that event. It’s a collaborative process of a machine and multiple humans to figure out what is going on. If we can keep the process online and find the root cause early, it’s much less stressful for everyone” says Alexander.

You have to include the human element of observing and interpreting the meaning of that event.

When it comes to the COVID-19 pandemic that has continually surprised U.S. politicians, scientists, and the public since the first infection was reported in early May, time series data and charts have largely dominated the conversation around public policy. The news media has focused much of its scrutiny based on the charts, and it becomes politicized to justify the narrative around reporting and public policy.

“What does a case actually mean? The cases are defined differently for every U.S. state” says Alexander, “usually there is one entity that controls the decision making for what a page view is. Many different parts of the business have different definitions for what an entity is.”

Here, Alexander hearkens back to some of the fundamental ideas behind domain-driven design, which is an entire process and culture about how the business contextualizes the meaning of certain domain entities used in APIs.

As a part of my conversation with Alexander, there are some key takeaways worth mentioning. There seems to be a poorly understood organizational process and methodology around designing metrics for analysis. In software engineering, we have methodologies, such as DDD or DevOps, that help developers understand how to collaborate with the business when developing software. When applied to the process of measuring and analyzing data, these organizational practices are left to experimentation and self-guided research. Perhaps what we need to improve analytics across the board is a more broad approach to designing, collecting, and reporting metrics.

The entire conversation with Alexander can be listened to on The Little Tech Podcast.

To chat with Alexander and other members of the Apache Pinot community, please join the conversation on Slack.

--

--

Kenny Bastani
Apache Pinot Developer Blog

Passionate technology evangelist and open source software advocate. International speaker & author of O’Reilly’s Cloud Native Java.