Data Quality: Timeseries Anomaly Detection at Scale with Thirdeye
Is this number correct? If you work with analytics, you hear this question a lot. Data quality is hard.
AB Tasty provides solutions to create online experiments and optimize user experiences. Based on the figures we give, our clients make changes impacting millions of end-users. Without data quality insurance, our clients could end up making catastrophic business decisions.
At AB Tasty, we do hear “Is this number correct?” a lot.
Over the last few years, we realized we were spending too much time on data quality questions. A client would run an experiment. He would wait some weeks, then look at the results. If in doubt, he would compare the numbers with another analytics tool. If still in doubt, he would open a support ticket. After a few checks, the ticket would end in the data team’s hands. A data problem could go unnoticed for weeks. Also, debugging was unnecessarily difficult: finding the root cause of a problem gets harder when the problem gets older.
We understood we had to catch data quality issues proactively.
This blog describes our journey to build new monitoring tools and the solutions we found to improve data quality operations.
Architecture
The figure below shows a simplified architecture of our data collection system. Our clients send events, we transform these events, then we write them in a data warehouse. Our clients look at their experiment results in a reporting UI.
Background
We use standard DevOps monitoring tools to monitor this pipeline: request rate, network usage, message number, throughput, etc…
We are able to detect if the data does not flow correctly, but we don’t have a fine-grained view on the data. Because 4000 clients are sending events, a single client going down cannot be spotted in these DevOps metrics. Also, one client could have all the transactions transformed from dollar to euros, the DevOps metrics couldn’t care less.
In effect, we are monitoring the 3Vs: volume, velocity, and variety, but we are blind to the additional 2Vs: veracity and value.
Before digging into solutions, we needed to understand how incidents could surface. We identified 3 stakeholders that could break the system:
- The client. Tag/SDK integration can break. For instance, a client’s contractor can mistakenly remove the integration on Android. Such incidents are limited to a single client. Often, the problem is specific to an analytics dimension: device type, subdomain, payment method, etc.
- AB Tasty. Bugs happen. Such incidents affect a subgroup of clients. Often, the problem is specific to a business dimension: product package, feature, etc.
- Other internet players. Browsers, addons, providers… The web ecosystem is constantly evolving. Such incidents affect all clients, but tiny sub-populations. For instance, some cookies can get broken by a new browser version. The impacted population grows as the browser gets updated.
Wanderings
We started by building anomaly detection into the streaming pipeline. The system performed aggregations, breakdown by dimensions, and used simple rules to detect anomalies.
Streaming anomaly detection sounds cool, but we realized it has major drawbacks. When a streaming anomaly happened, we had to check the impact in the DataWarehouse. A data engineer would manually write and run SQL queries, then plug the results into a visualization tool. If no problem was found in the DataWarehouse, it was not easy to understand if the anomaly was legit, caused by a bug in the detection system, or missed in the manual SQL exploration.
Long historical baselines and lots of dimensions did not scale well: the streaming system had to hold a lot of states in memory.
Also, the streaming design made the data aggregations and the anomaly detection tightly coupled: adding new rules was hard.
With these learnings, we set our expectations for a successful data quality platform:
- Separate data collection and anomaly detection responsibilities.
- Check quality where data is ultimately consumed: in the Data Warehouse.
- Provide a framework to create and update detection rules easily.
- Allow arbitrary detection frequency (every minute, every hour, every day, etc..)
- Provide a UI to jump from an anomaly to root cause analysis easily.
- Connect the system with our operational tools (Jira, Slack)
Meet ThirdEye
Before building this system ourselves, we checked for commercial and open-source solutions. 18 months is a lifetime in the data world: early 2020, the data quality tooling space was almost non-existent. We decided to try ThirdEye, an open-source data quality platform created at Linkedin, used on top of the Apache Pinot datastore.
ThirdEye was actually checking all our needs: it’s described as a platform for realtime monitoring of time series and interactive root-cause analysis. Linkedin has three great articles presenting Thirdeye, so we won’t present it in detail. Below a few screenshots so you can get an idea of the features:
Integrating ThirdEye
To connect Thirdeye to our data, we first had to implement a BigQuery connector. We contributed it back to the open-source project. We then identified 3 metrics in BigQuery that could help detect incidents:
We implemented detection rules for these 3 metrics. There is a lot of nitty-gritty details involved to achieve correct anomaly detection performance. Think business type, transient errors, holidays, sales, lockdown: false positives are everywhere! This topic is heavy so we will discuss it in another article.
With 3 detection rules for each 4000 clients, creating and updating config yamls in the UI was not an option. We created a small batch job to generate and update the rules. The job is plugged to our business database, and we customize detection rules based on business info such as SLA, product package, and industry.
Alerts are sent to Jira and forwarded to a dedicated Slack channel. Data engineers check the anomalies, and forward them to support teams.
Notice that alerts are decoupled from anomaly detection. This means the system can detect anomalies without raising an alert. We did not think of this at first, but it proved very useful. We can test detection rules in production without being spammed on Slack. We also avoid unnecessary alerts on special days such as Christmas and Black Friday.
Performance
With 3 metrics for each 4000 clients, ThirdEye manages 12 000 detection rules per hour. One detection job corresponds to one or two queries.
We made sure the jobs were evenly spread on an hour, so we have a constant load of 4.3 queries/seconds, which is very reasonable for BigQuery.
8000 rules use ML models. BigQuery is doing most of the data work, but timeseries ML models and threshold logics are computed in the workers. Workers are auto-scaled on Kubernetes, and we observe that 4 pods of 8Gb, 1.4 vCPU is generally enough.
Overall, we found the system to be quite cheap, and easy to maintain.
At the beginning, anomaly detections were raising too many false positives. This was putting the project at risk: too many alerts means nobody checks the alerts. After hours of finetuning, we have between 1 and 4 alerts per day, and we achieve a ratio of 85% true positive. We do miss some incidents (false negatives), but we found the current settings already have a great positive impact on the operations, without burdening the data engineers.
Here are some examples of operational successes:
- early April 2021, a new data privacy law came into effect in France. Dozens of clients ended up breaking their tag/SDK integration. We were able to catch all these incidents and reach out to the clients in less than one day.
- we detected a problem for a new client that could have put his proof of concept period at risk. We ensured client satisfaction and ultimately won the client.
- we discovered that some clients were regularly cutting our tools for SEO purpose. We reached out to them and fixed our SEO impact.
Conclusion
We identified the root cause of data incidents, and the holes in our monitoring. We understood the drawbacks of streaming anomaly detection. With ThirdEye, we built a system that monitors 12k metrics on an hourly basis. Today, we can detect and understand incidents involving data loss, identity loss, and data incorrectness.
We are still exploring Thirdeye capabilities. Some projects we are working on:
- fine-grained monitoring along dimensions: country, device, etc…
- meta-learning: give feedback to ThirdEye to automatically finetune alerts
- investigation reports: open the platform to other teams and share results
In a next article, we will explain how we built efficient anomaly detection rules in Thirdeye. Stay tuned!
Resources:
Blogs from Linkedin:
- Alexander Pucher, General introduction of ThirdEye capabilities, 2019
- Xiaohui Sun, Smart alerts in ThirdEye, 2019
- Yen-Jung Chang, Analyzing anomalies with ThirdEye, 2020
All images by the author.