Data metrology at BlaBlaCar

Our tool to monitor data consistency

Thomas Tygreat

Published in

BlaBlaCar

5 min readJul 17, 2018

Why do we need a tool?

When you are a data-engineer, one of the most important topics is the health of your data. You must provide clean and reliable data to your consumers.

When the data is missing, decision makers are blind. Even worse: when the data is not consistent, wrong decisions will be made!

There are many things that can lead to “dirty” data:

The data-source is not ready yet
Something has changed in production
An external API has been modified
A technical error occurred
…

Most of the time, you know that something happened, because errors are visible in logs:

An example of an error which is visible in Airflow

But sometimes the error is silent and you are not aware of it, unless you actually check the data.

That’s why we decided to implement a metrology tool, that would run automatically and allow us to detect issues in a glimpse, every day.

How does it work?

We defined a list of metrics that we wanted to monitor. Some of them are functional: the number of signups, the activity on the platform. And some of them are technical, such as the number of rows per tables.

We started to record these metrics every day. The next step was to detect weird values, such as suspicious drops or peaks. Our first idea was to compare the values with the global average.

The purple line represents the global average

The delta chart — Difference between daily values and the average

The last step was to define the “acceptable” range. The statistical outliers are considered as suspicious and need to be checked.

This setup worked well at the beginning, but we quickly faced an issue. Due to the variation of activity during the week, there’s a huge gap between the average and the daily value.

Some metrics have a big variation along the week

It’s almost impossible to tell the difference between normal variation and technical issue. We needed a smarter way than comparing with the global average. We decided to compute the average on the same day of week, to take into account the cyclic activity.

The blue chart is the average value, computed for each day of the week

It allows to detect weird values easily.

We also implemented an hour-based system for the trackers data, which needs to be monitored more precisely.

Here the average is computed for each hour on the same day

Technical implementation

We wanted to keep it simple and efficient. We started with a generic table to store all metric history. The average and delta are pre-computed.

The calculation of average and delta is always the same, whatever the metrics. For this reason, we decided to write a query with a dynamic parameter. It allows to compute all the metrics within a single python loop.

This parametrized query computes the averages of all the metrics

The views have the same generic structure

The rules to compute the metrics are set in the views definition

The dashboard

We used Tableau to build a dashboard on top of this. We review the dashboard every morning. The first tab has been designed to check the status of all metrics.

A quick view on the metrics that need special attention

The other tabs display the details for each metric.

Monitor the volumetry

Last but not least, we needed to follow the database volumetry on a weekly basis. We wanted two things:

Two levels of details: schema and table
Two metrics: the global size + the size evolution compared to previous week

The size evolution can be computed easily with the SQL “lag” function

The dashboard is split into two parts:

The top panel indicates the evolution, allowing detection of big variations.
The bottom panel indicates the whole size.

Conclusion

The metrology tool helped us to improve data quality at BlaBlaCar. It’s been designed according to our needs:

All the metrics are refreshed automatically
The process to add new metrics is very simple, thanks to the generic structure
Issues are highlighted by computing delta between daily values and average on the same day

After implementing this tool, we noticed a global improvement on the data reliability. We anticipate more, we communicate issues to analysts, and we can fix them quicker.

The next step would be to fire alerts when there’s an issue, so that we can be even more reactive.