Reporting a ML job’s failure due to a hard crash, or, how to tell people you have died while working

Sander Land
Cognite
Published in
3 min readJun 19, 2020

At Cognite we routinely need to deploy and scale computationally intensive machine learning models for data contextualization, such as entity matching and image recognition. Users submit their jobs through one of our APIs, and then we put the job in a message queue and wait for one of our workers to pick it up.

Our workers run on Python, since it’s the most commonly used programming language in this domain. The workers are deployed and scaled in the cloud using Kubernetes, which also restarts them when they crash — so you would think the models would be guaranteed to run robustly. However, it turns out that there are a few pitfalls when it comes to running models in a scalable and robust way. This post will walk you through — and hopefully help you avoid — some of them.

First approach

Our API server takes requests for machine learning model fitting and predictions. We upload the data to cloud storage and then use pub/sub to notify our workers that a new job is waiting for them. This ensures that the API server is quick to respond and can scale independently of the heavier ML tasks.

A worker node has a fixed machine learning model and processes messages as follows:

The problem

Sometimes a user submits an unreasonably large job, or a recent update of an algorithm turns out to be a bit more resource-intensive than expected. This will cause the worker to run out of memory, and the OOM Killer in Linux will kill the process. We now have what is effectively an infinite loop:

  • Receive pub/sub message with job.
  • Run the job, which causes the entire worker to die, never getting an exception or being able to report on the failure.
  • Kubernetes restarts the worker, which receives potentially the same message again.

Our resulting memory usage graph looks like shark-infested waters:

Memory use showing a worker pod repeatedly crashing and restarting

Our solution

To prevent the job from taking down the entire worker pod, we run each job in a separate process. Now the children die alone while their parent lives!

Another problem

However, Python process pools do not handle crashes well, as shown in this example:

Our solution, part 2

To prevent this, we made our own process pool library, which restarts processes when they die and returns an exception. Now we can update our job status to failed, and everything works.

In addition, since our workers often load complex machine learning models, we prefer the worker to be a class by default, rather than a function, so it’s easier to pre-load the model in its initializer.

The package is open-source and available here.

A final problem

Like the standard library, we use multiprocessing.Queue to communicate between processes.

This uses pickle to serialize objects. It turns out that this library handles the pickling in a separate thread, making it impossible to detect errors such as max recursion depth.

We have reported this issue at https://bugs.python.org/issue40195

Our solution, part 3

Our workaround for this final issue is simply to pickle our arguments ourselves, catching any exceptions during this process.

Conclusion

Rolling out your own solution for a robust and scalable backend for large machine learning tasks is challenging, but not impossible. We hope this article helped you avoid some of the pitfalls that we have encountered.

What challenges have you run into, and how did you solve them? Let us know in the comments.

--

--