How Are the Tech Giants Ensuring Data Quality with ML?

Overview of data quality techniques used by Netflix, Uber, and Airbnb

Jyoti Dhiman
Geek Culture

--

Photo by Markus Spiske on Unsplash

This is the second post in the data quality series, please find the first post here where I wrote about the role of machine learning in data quality.

Coming to the topic now, so I have done some study on how Data Quality is ensured across the data-driven industry giants. My list is not exhaustive but a good start to a practical understanding of the data quality mechanisms at different organizations. The case studies are mainly around how the tech giants are using machine learning to ensure data quality, please read on to find out more.

UBER

Uber is one of my favorite products. Since I don’t know how to drive(well I tried learning, but highways, man), it had been a boon for me and gave me a feeling of independence like anything.

Uber operates at a scale of around 14M trips per day, now at this scale, thinking of data quality, is fun to build but scary too. So how Uber does it?

Uber has a mechanism called Data Quality Monitor or DQM which does an analysis of historical patterns and generates alerts for anomalies.

For anomaly detection, DQM uses principal component analysis and does one step ahead forecast using holt-winters time series model. In simple words, if the predicted value does not match the historical trend, data is considered anomalous. This projection is done twice, one considering the latest data point and one without the latest data point which amplifies any drastic result in the latest data point.

Also, my favorite part, DQM also takes care of not overwhelming data owners with continuous alerts by assigning scores to how bad the anomaly is and does a prioritization of the alerts. This helps with not sending too many alerts.

NETFLIX

Well, well, well. Netflix needs no introduction, I have binged so many series on Netflix that I have lost count. Again, one of my favorite products, the UX, the streaming quality, recommendations(well, it can do a little work there), but still a big fan!

Moving on from the rant, so how does Netflix use machine learning for data quality?

Netflix uses ML to track server health, not exactly data quality use case but the same principles can be applied to data quality too(at least that’s what I think)

The most common way to identify anomalous bad data is the use of thresholds with a rule-based system. But does any of us really know what a good threshold is?

This threshold can be ever-changing, never accurate and even if you find some good threshold, what about the data which is just below the threshold so won’t generate alerts but is still bad?

The same principle is with system health, a bad system will generate an alert when a threshold for a metric is reached but what about the system which is just below the threshold and is just as bad but not bad enough(get it ?:P)

So how does Netflix identify these bad systems?

Using DBSCAN Algorithm.

With DBSCAN, Netflix identifies system health by identifying system which is not performing like others with 93% accuracy. It’s a very good number when we consider statistical solutions. So, make notes, maybe this algorithm can help you out with some DQ use cases based on the same principles.

Anomaly detection at Netflix

For use cases like payment anomalies, signup anomalies, Netflix has a mechanism called Netflix RAD. Netflix has open sourced RAD as part of Surus

Netflix RAD was initially based on moving averages, deviations, time series, regression approach but found that they were not robust enough for high cardinality data and then settled on using Robust Principal Component Analysis (RPCA) to detect anomalies.

If you are interested in details, the robust version of PCA (RPCA) identifies a low-rank representation, random noise, and a set of outliers by repeatedly calculating the SVD and applying “thresholds” to the singular values and error for each iteration. For more information please refer to the original paper by Candes et al. (2009). (check implementation here)

AIRBNB

I am a big fan of Airbnb blogs, they are very informative and fun to read. Do check out if you haven’t yet. Also, I am currently reading “The Airbnb story”, a very interesting read to find how it all started.

Now moving on to DQ at Airbnb, so Airbnb started the Data Quality initiative in 2019. For DQ checks, Airbnb has a super cool process called the Midas Certification process which defines the ultimate gold standard of data. Basically, if a dataset has this certificate, it is pure gold. As it has passed an extensive process of documentation, reviews, validation, and verification.

The checks consist of automated testing based on historical data for accuracy, follows best practices, is a single source of truth, and has thorough documentation.

In addition to the Midas certification described above, Airbnb also has an Anomaly detection for payment data implemented via Fast Fourier Transform.

Okay, I am remembering engineering math classes now. I guess that’s all for this article. Hope this was not too much information at one go, I have also linked the algorithm links for better understanding.

The idea of the article was to provide some practical examples of machine learning usage for data quality because no matter how much reading we do, nothing beats knowing actually how they are being used.

Hope this helps!
JD

--

--