Watch Out: Monitoring in Data Science Operations

Carsten Arndt
TUI Tech Blog
Published in
4 min readMay 26, 2022

Comprehensive monitoring is essential for the long-term success of Data Science projects. It is particularly important when it comes to detecting issues like defects in the data pipeline or silent failures since these issues can have a major impact in the long run. Below, we describe these potential sources of failures and outline a few strategies we use at TUI to deal with them.

Silent Failures in Data Science Projects

Silent failures are the failures that are not detected by the exception reporting system. For example, misconfiguration may prevent a job scheduler from even starting a job. But since the job has not technically failed, but merely hasn’t started, the scheduler alert system won’t detect this failure. There are also situations in which errors are not reported to the scheduler, for example when the exception is caught, resulting in a zero error level. This error level leads to a graceful exit which the scheduler interprets as a success.

In data science projects silent failures can occur during scheduled model retraining, or during an update of the data necessary for the model retraining. If this leads to keeping the previous version of the model or data in use, the failure of the process may not be detected quickly.

These kinds of problems can of course also occur in non-Data Science software systems like data warehousing. But in those cases, failures have an immediate effect on dependent components like dashboards. Information from dashboards is integral to the work of the data warehouse consumers. These graceful failures in data warehousing result in out-of-date dashboards which are quickly noticed during the daily workflow.

Data Pipeline Issues

Defects in the data pipeline can lead to particularly challenging problems. For example, retraining a model might require a large amount of data. The volume can be affected due to intended and unintended changes in the underlying data delivery processes.

In other cases, a breakdown can occur due to failure of one of the components of the data pipeline. This can happen for example due to the non-availability of a technical resource or because of unexpected data drift that creates problems for one of the process steps. If these failures are either not detected or are not communicated downstream they can have a major impact on the quality of the underlying model.

Strategies For Successful Monitoring

For monitoring to be useful, it has to be adapted properly to the process that we are trying to monitor. It is useful to distinguish between scheduled jobs, and services that must be operational 24/7 like prediction APIs.

Monitoring Services

Services provided through APIs can be monitored by calling the APIs at regular intervals. In these cases it is usually not sufficient to just check whether the endpoint is actually available but the content of the response must also be tested to ensure that it satisfies specified conditions.

For example, we can set up a monitor to ensure that the prediction lies within acceptable limits derived from an assumed distribution. This monitor should send an alert if the prediction is outside the permitted interval, implying irregularities in the prediction.

The image above shows monitoring of services, such as API calls, per minute.

The monitoring shown in the image above checks for both availability and latency and shows irregularities, such as on 12 February.

Monitoring Jobs

The monitoring of scheduled jobs comes with its own distinct challenges. While services can be monitored by making synthetic calls just for monitoring purposes, scheduled jobs are executed with a given frequency (for example daily) such that their logs can only be gathered and analyzed during or after an execution. Moreover, while failures during the execution usually are recognized by the scheduling tools, it is harder to detect that a job hasn’t even commenced.

One solution is to associate a very specific log message with the successful execution of the job like a log line “All retraining steps succeeded.” Then, the log monitoring tool can be configured to look for that specific message in a given time interval. The monitor could send an alert if the log is not received within that interval. The monitoring can be further refined to watch not only over the successful execution of the whole job but also to check for irregularities in the execution of individual process steps.

Monitoring Data Pipelines

In a similar way the log messages can be used to check the health of the data pipelines. For example, the logs can be enriched with information like summary statistics or the volume of the data processed during a particular execution step. Then, monitors can be configured to raise alerts if the data volume falls below a given threshold or if a drift in data is detected.

Anticipatory Monitoring

Monitoring can be used not only to detect critical failures, but also for anticipating future challenges. For example, the graph below shows the volume of data ingested for a feature engineering job. Note the increase of data volume over time, indicating a potential need to allocate greater resources in the future and update various thresholds for alerts.

The image above is from monitoring of a job which runs once a day.

Conclusion

Monitoring has become a central tool in our daily routine. It enables us to quickly detect problems in production. It also allows us to create various performance metrics that help us to prepare for future developments.

These capabilities help us improve performance, and make the machine learning process more resilient.

--

--