Machine Learning from P.O.C to Production — Part 2

Stable ML Models from Data Engineer point of view

Florian DEZE

Published in

CodeShake

6 min readNov 23, 2022

source: https://airbyte.com/blog/etl-framework-vs-etl-script

In part 1, we had introduced concepts, roles and vocabulary necessary to understand the basics of an ML project.

Performance in ML projects comes from two essential parts, the features and the model. So why does a Data Engineer have a responsibility for it ? The main problem in production is potential instability within the features at calculation time or from data source during data pipeline processes. We will see in this article why instability exists, how to prevent it and what are the responsibilities of a Data Engineer?

A Data Engineer should always secure its pipeline’s proper execution. You can have four level of security and quality over these executions:

Level 1 Orchestration, monitoring and retry

This is the basic level of construction of pipelines. Your pipeline should not be executed by simple sequential script, but by a proper orchestration tool, which will:

Be triggered by scheduler or events
Orchestrate tasks: monitor status of each task, wait between tasks
Parallelise tasks when necessary to optimise execution time
Present status execution of each steps
Enable relaunching/retry of a task or the whole pipeline
Log tasks status (success, failure, retry, etc…)

This orchestration tool will be combined with monitoring and alerting applications to ingest logs and create alerts or performance metrics.

Level 2 Reproducibility and Re-writing history

When your pipeline does not execute correctly, you will need to relaunch it. But it can be more tricky than just clicking on a button.

Let’s illustrate it with a use case:

You construct a pipeline executing once every day. This pipeline will execute SQL requests, with dynamic injection of time parameters (like: “WHERE date_col > current_date() — 31”). Those results will be appended in a table, with partition on insertion time. Your pipeline failed for some reason (network or service interruption).

Problem: You need to relaunch the pipeline but you couldn’t do it during the day of failure (you missed the alert, correction was too long, etc…). You want to relaunch it the next day, but since the SQL requests are parameterised with launching day, you will not have the same result as if it was yesterday. Also, data won’t be correctly partitioned in your database.

Solution: Some orchestrators offer functions or predefined variables which enable them to simulate the launch at a previous day, hour etc.… . Be aware of this feature when choosing your orchestration solution.

Special note: Systems dependent on your results might not have worked properly during the day of failure. To be able to analyse problems correctly on your whole product, you need to trace the success or failure of your pipeline and publish this information. You can do it using a database table, by adding technical columns like ‘record_creation_date’, ‘update_date’ etc.. . You can also create another table containing success, failure for each table managed by your pipeline, at table or column level.

Level 3 Quality test during execution

A pipeline is composed of N steps. If your final(s) step(s) expose the result, you might want to only give the result if you are sure of the quality of the data you publish. By adding quality tests between your steps and conditioning your pipeline orchestration over the success / failure of each test, you can prevent from publishing error prone data.

Here some category and example of tests:

Accuracy: Is the information reliable?

Count the number of lines before and after joins between tables
Check the range of values (ex: between [0, 1], in [category_1, category_2], etc…)
Comparison of coherence between two or more fields (ex: start_date < end_date)

Consistency: Does information stored in one place match relevant data stored elsewhere?

Compare information between different sources (for same keys, having the same values for the same fields)

Integrity: Can you join data from different sources correctly?

Count number of lines with unique keys = size of your table (if deduplicated)

Level 4 Quality test of external pipelines

Quality depends on data sources and previous pipelines. Even if you can’t interfere with their processus, you need to take into account the possibility of failure from the source systems. There are three use cases to be aware of:

Instability introduced by systems

Your project will begin by a data pipeline which will aggregate multiple data sources, each one of them depending on other product(s) / application(s).

Eventually, those products or your data pipeline will encounter bugs, interruption or late delivery date, which will result in degraded features (inconsistent, outdated, etc.).

Let’s see an example: Most of the data you use will be updated between 1:00am and 3:00 am. You want to start your data pipeline at 3:05am daily in order to refresh your database with the last information for the next updates of features. But if you access the history of updates of those data sources, you might see some tables updated at 4:00am or even 4:00pm at random days of the year. So your pipeline will update only parts of the data and even create inconsistency later if some features are aggregation between fields in updated and outdated data sources.

That information is helpful for a data scientist when a model is defined (since the past historical data used to train/validate/test the model are by definition up to date). So you can create the following metrics in order to follow this instability over time:

Mean Time to Late Publish (MTLP): mean delay duration if there is any delay.
Late Publish Ratio: number of times there is delay divided by the number of updates.

Instability introduced by history recovery

When the data scientist takes historical data from the data warehouse, that data might have been updated since their first insertion. Sometimes to correct errors, sometimes for business reasons due to rules behind the product. At this point, the corrected historical data possess more accurate information than the recent data and way more than the last real time or batch data you will ingest in your ML model. Thus, it will decrease performance for production.

Solutions:

Enable traceability over those changes if possible. Sometimes when data is published, you can find technical dates like date of creation of record and date of update. You can count the number of times there is a mismatch between the two dates (but it will only gave you the mismatch over the table, not a particular field)
Create metrics to evaluate the quality of the system and monitor over time.
Based on the features created, check if those features are important for your model. The more important the feature, the more exigent you need to be on your metrics.

Metrics you can use to evaluate proportion of re-writing:

Mean Time to Correct (MTC): mean duration before correction/update of information
Correction Ratio by Line: number of lines corrected divide by the number of lines
Correction Ratio by Day: number of days with correction divided by number of days

Instability introduced by structures

Most ML projects consume data you don’t own. Producers are the ones which define their structure, availability and life cycle. Which means your table or fields inside your table can disappear. Structures (table schemas) can change drastically.

The prevention should be made at the company level, by introducing a rule or a system which disables drastic changes in schema, apply alerting and transition duration for consumer’s adaptation.

To conclude, if you want to have a stable project, you need to implement the following features:

For the pipelines execution:

✔️ I define/execute tests on the pipeline data (accuracy, consistency, etc…)

✔️ I define/execute tests on my pipeline code (unit tests, integration tests)

✔️ I define/use precise version of the tools used

✔️ I have a retry policy for the pipeline

✔️ I monitor and have an alert system on my pipelines

✔️ I have scalable pipelines

✔️ I trace all status/events of my pipeline (success, late delivery, failure, etc…)

✔️ I trace all the status of my data source (update, partial update, not updated, etc…)

✔️ I trace all the status of my raw data results (update, partial update, not updated, etc…)

For data quality:

✔️ I measure the quality of my pipeline (delayed, partial update, etc…)

✔️ I measure the quality of my data source (range values, ratio of nulls, etc…)

✔️ I measure the quality of my raw tables (range values, ratio of nulls, etc…)

✔️ I review at fix periods the errors and quality problems and try to improve my process (or external process with other team)

The next part will introduce the role of a data scientist in the stability of an ML project.