Why you can’t scale your models without scaling your data: the less-talked-about side of MLOps

For our deep learning models, a data-first approach to ML infrastructure bridged the gap between Data Engineering and ML

Published in

motive-eng

8 min readJun 29, 2021

At KeepTruckin, we have several state-of-the-art deep learning models deployed to our dashcams. These models are used to detect lane lines, other vehicles on the road, and even driver distraction (cell phone use, smoking, etc). Read more about how we serve these models at scale here.

Deep learning models deployed to our dashcams are used to detect lane lines, other vehicles on the road, and even driver distraction

As an intern on the Data & ML Platform team at KeepTruckin, I was tasked with designing a system that addresses issues regarding the training data for the models. Here’s how (and why) it was done.

For our models to be of any use, they must be trained with tons of data, and many things must happen before the training phase. For starters, the underlying infrastructure should manage sourcing the data, preparing and storing it in a way that is suitable for training. In addition, ML engineers must be able to iterate over their models with the latest data, and track their experiments along the way.

A lack of proper automation on this front can result in complex manual processes, often involving storing intermediate output in files and tracking things in spreadsheets. These so-called pipeline jungles are unsustainable and should be avoided at all costs. Additionally, the underlying ML infrastructure must allow for flexibility and reproducibility of experiments: we discuss this more later in the post. To learn more about such challenges that arise in production ML systems, check out one of Google’s research papers from 2015: Hidden Technical Debt in Machine Learning Systems (this paper gave rise to the concept of MLOps).

Sourcing training data

This part of the process can be company-specific and surprisingly complex. At a high level, here’s what the process might look like for KeepTruckin’s vision models:

What our source training data cycle would look like without automation

As you can see, this cycle makes for a complex and repetitive process, with information to track along the way. Doing everything manually can take days. Again, this may vary greatly by use case, but if you ever find yourself repeating the same tasks over and over again, and having to keep track of things in spreadsheets, it may be time to redesign your data collection system.

And that’s what we did! First, we automated the processes to turn raw dashcam videos into adequate training data, essentially automating the process described above. We then designed a system to store, process, and update training data after it’s been sourced, annotated, and uploaded to the cloud.

I have my training data; now what?

Suppose we have our training data stored in the cloud somewhere, say in S3. Now, we want to store it somewhere in a more structured, queryable format. We also want to track the training, test, and validation sets that ML engineers use for training jobs (more on this later). Lastly, we want to continuously update this training data store with the latest training data. At KeepTruckin, we built a training data store with a Postgres database; here’s how it works:

Our data store stores three entities: they are examples, data sets, and snapshots.

The three entities we store in our data store

An example corresponds to one piece of training data: we store the location of the frame in S3, the location of its annotation in S3 and some tags associated with the frame (these may include time of day, road conditions, camera type, etc.).
A data set is a logical grouping of examples such that a given model only uses training data from one data set. This may be something like road_facing_data or driver_facing_data.
A snapshot is an immutable train/test/validation split of examples used for training jobs. A snapshot can be created in arbitrary ways; that is, you can query the examples in your snapshot in any way you like.

More on snapshots

The defining property of snapshots is immutability: once you create a snapshot, its contents never change. This means that if you add new training examples to a dataset, this does not affect any snapshots. Likewise, if at any point we re-label an image, we cannot modify an existing example record; we have to add a new one instead. As a result, when querying examples for a snapshot, we have to make sure to always take the latest one.

Why is this important? Because of reproducibility! Immutable snapshots guarantee that an ML researcher can reproduce training experiments at any time. This also means that a researcher can train several different models on the same snapshot, at any point in time and be certain that it was trained using the exact same data each time.

How do we get the latest training data?

Overview of the training data pipeline flow

To let our ML researchers re-train their models with the latest training data as it comes in, we built an API on top of Airflow, for ML engineers to easily build data pipelines for their training data. To construct the pipelines, we borrowed the Extract-Transform-Load (ETL) pattern from the data engineering field. Here’s how this process works:

We extract the locations of frames, locations of annotations, and the metadata for each example. This is easy because we tracked our training data every step of the way in the sourcing step.
This data is transformed into example objects to be stored in our training data store.
We load the examples into our data store.

Next, another action occurs within the data store itself: we assign the new examples to their corresponding data set.

Lastly, another ETL process occurs in order to create a snapshot:

We query the examples we want to include in the snapshot. This is the extract step.
Several transformations follow: first, we deduplicate the output of this query. If we encounter any re-annotated examples, we only take the latest one. Next, we randomly split the examples according to a train/test/validation split ratio. We also create an ID and a custom description of the snapshot, depending on how it was created.
For each split, we upload a text file containing the IDs of the examples in the split to S3, and we store the path to this file along with all the other snapshot metadata. This is arguably the load step.

To summarize, our pipeline fetches examples from S3, adds them to the dataset, and creates one or more snapshots.

Now, the interesting part:

How we built it

Apache Airflow as our task scheduler, with our tasks running in Kubernetes pods

We used Apache Airflow as a task scheduler, with tasks running in Kubernetes pods.

Recall that we have three parts to this pipeline: the first fetches examples, the next one refreshes the dataset, and the last one creates a snapshot. Each of these corresponds to a docker image, named run_example, run_dataset, and run_snapshot, respectively.

We then construct an Airflow DAG file, which is a Python file that tells Airflow what to do. For each component, we define a Kubernetes Pod Operator. This is an operator created by Airflow that spins up a Kubernetes Pod. We tell the operator to do the following inside the pod:

Pull the corresponding Docker image.
Run the image with the given arguments (start_time, end_time, and component-specific arguments such as the name of the component, etc.)

In this DAG file we specify the order of execution: first run_example, then run_dataset, and lastly, run_snapshot.

Note that this is all abstracted away from the user: we allow the user to declare each example, data set, or snapshot task as a simple instance of a Python object. Each of these objects has an .update() method, which takes in the relevant parameters and executes the required task. This is essentially what is being run in the docker images. We allow the user to declare a TrainingDataPipeline object, with parameters such as the CRON schedule, and the corresponding example, data set, and snapshot objects as dependent tasks. The pipeline object has a generate_dag_file() method which — you guessed it — does all the hard work behind the scenes and generates the DAG file for Airflow to pick up.

On incrementality and reproducibility

Because we don’t want to perform any more computation than is necessary, we can make our pipelines incremental. For instance, we can configure the pipeline to only fetch examples from the past 30 days, and only create snapshots with those new examples (or create a snapshot from all training data up to the current date).

But what if we do configure the pipeline this way, trigger a run, and then run this pipeline with the exact same input parameters 10 days later? Any new training data would be fetched, added to our dataset, and live in a snapshot that consists of different examples from the previous run. This is a problem: we triggered the same pipeline with the same inputs two different times and got two different outputs. For the purpose of reproducibility, we enforce idempotence in our pipelines by making each component work over data within a given start and end time, only modify data within that range, and deduplicate on each run.

Conclusion

While much of ML infrastructure is concerned with training, monitoring, and serving models, before any of that happens, we need training data. Additionally, we need a way to store this data in a way that is structured and queryable, and we need to be able to provide train/test/validation splits that are reproducible. Reproducibility comes up a lot. The purpose of this post was to investigate the importance of managing ever-growing training data sets for production ML models, and to describe how we at KeepTruckin built a system that addresses the issues at hand. By considering important concepts from data engineering, we were able to build a robust infrastructure that saves days of repetitive and tedious manual work. It’s no secret that training data is important in ML, but so is the way you handle it.

Come join us!

Check out our latest KeepTruckin opportunities on our Careers page and visit our Before You Apply page to learn more about our rad engineering team.