Achieving High-Quality Data

& Making Successfully Mistakes at the Same Time.

Published in

Retargetly

5 min readMay 15, 2019

In order to ensure that our data actually has great value, there are several steps in the way, like cleaning, processing and analyzing it in order to create the desired shape. And to accomplish each one successfully, we need tools that assure us of their correct functioning and alert us when something isn’t right.

In this particular case, we were experiencing difficulties in finding a tool to alert us when there was a failure in one of the first processes, data intake.

At this stage it is very important that everything flows normally, to make sure that we have excellent raw material to work with later.

We were in need of a tool that could alert us every time there was a significant drop in data intake so we could act on it.

This was a process that could previously be done manually, but as the company grew, so did the volumes of data we handle. Tasks turned more complex and the need to automate some tools became pressing. Leo, one of our data scientists, got down to work to find a solution that would adapt to our use case.

It’s important for us to share this process and make it public. Not only because we receive useful feedback from this kind of articles, but just like we feed ourselves from the experiments of other companies and teams, we hope they can be nourished from ours as well.

The Process

The one that could have been

To face this issue as a first approach we built an alarm that ran every two hours, comparing values from 8 previous hours to the values from same hour same day from weeks before, alerting when any partner had an atypical value more than 3 times in those 8 hours. An atypical value meant a z-score that was below the z-score threshold established from the mean and standard deviation of previous historical data. This alarm was not accurate enough and did not bypass false positives well, showing irrelevant alerts too frequently.

The chosen one

Given that the first approach was anything but perfect, we decided to try another one. We analysed 90 days of historical data for our problem.

The following figure is an example:

Batch data ingestion from partner 130, which had an undesired ingestion drop.

At first glance, almost every curve has a behaviour resembling an electrical signal, more specifically an alternating current with a lot of noise. Given this seasonality¹ we identified, we decided to treat the data as a time series.

This analogy we drew between data traffic and current also gave us the idea that we should smooth the curves, as if we were rectifying the current, in order to reduce noise and excessive skewness towards large absolute values in the curves.

Thus we applied two transformations to the curves:

1) We decided to convert each curve to a natural logarithmic scale reducing the variability in the range of values and smoothing local maximum and minimum data points. This also helps the visualization of very different ranges appearing throughout time.

Figure 2. Log Scale for Batch data ingestion from partner 130.

As an example of the advantage of log scale (figure 2), one can clearly spot drop in ingestion started sometime around the first half of February.

2) We smoothed the curves further by calculating a simple moving average² for the series. This tackled the issue of false positives very well.

The red line in the next figure is the SMA. We calculated the rolling mean with a window of two previous days as an acceptable value for the calculation of the rolling mean.

Simple Moving Average and Log Scale for Batch Data Ingestion from partner 130.

Figure 4 ( zoom of figure 3). Here you can see the falling slope clearly.

The resulting smoothed curves helped us identify tendencies in the data, in our case changes in the data volume that meant undesired drops. This can be interpreted as discrete slopes, and we handled them as such.

In this manner we calculated the slopes between 2 days, then between the first and last of 7 days and then of a 24 days gap, to choose the best suitable time window.

Then we plotted a histogram illustrating the frequencies of each slope occurrence, like the one in figure 5.

Figure 5. Frequencies of slopes from one day to another, from 90 days of historical data for partner 130.

As we are interested in pronounced falling slopes, we only considered negative slopes and then we established a threshold from the data. Any slope below this threshold would be considered an anomaly, as it would not be a frequent value for the partner concerned, and so a case to alert. The threshold was established from the quantile of 5% probability.

Conclusions

After the preliminary analysis, we proceeded to build daily alarms for different types of data.

We built a daily alarm that gathers data from the previous 35 days and alerts when last-day slopes are identified as an anomaly. The thresholds are recalculated each time.
We also added a threshold of a minimum volume of data expected from each partner.
We also established a safeguard threshold for all partners with the thresholds obtained from 90 days of data.

We only download data from partners that are considered relevant to us, specifically a list of partners with a minimum of 10 apparitions and a minimum mean of hits per day in 90 days of data. This list is updated periodically.

1 — Seasonality in a time series is a regular pattern of changes that repeats over S time periods, where S defines the number of time periods until the pattern repeats again.

2 — Simple Moving Average definition, here

What’s ahead for us

Approaches we would also like to try in the future and would love to get feedback for:

K-means clustering to detect anomalies.
Fourier approximations to detect anomalies.
Local Outlier Factor and Isolation Forest for outlier detection.

Special thanks to Leo Albina, one of our Data Science team members, who conducted the experiment and executed successfully.

Also thanks to the Devolopers and Data Scientists who provided advice and support:

Mathias Longo (Chief Data Scientist), Leonardo Lucianna (Data Scientist), Federico Nieves-(CTO), Diego Stallo-(VP of Tech), Francisco Rangel-(BackEnd).