From Batch to Stream, Bringing New School MLOps to Old School Analytics @ CarMax

Published in

CarMax Engineering Blog

6 min readJul 25, 2024

The machine learning engineering practice has expanded greatly at CarMax since its beginnings in 2018. In that time we’ve more than tripled the number of engineers we have working in this space and expanded our use cases to several areas of the enterprise. From our humble beginnings exporting a single model from a data scientist’s laptop to powering several iconic customer experiences, we have grown and evolved our engineering practice. Along the way, we overcame several engineering challenges, delivered big wins as a team, and ultimately enabled a better car buying experience for our customers.

In this blog post, we will provide an overview of how our engineering practice has evolved, some of the challenges we have overcome, and how we are excited to continue to grow in the future!

A Laptop, a Model, and a Dream

Analytics, and later data science, has been established at CarMax for quite some time. Since the early days of CarMax, we have been using analytics to drive many of our decision-making processes, most notably how we price our vehicles. However, eventually our practice evolved such that we wanted to allow customers to engage with the output of our models directly through our website in the form of a recommendation system. In the early prototype system, a data scientist would run the recommendation model on their laptop and email the results to the front-end team every evening for them to update vehicle rankings on our website the next day! Clearly not a very sustainable system, but importantly we learned through measuring customer engagement with this early prototype that there was a large business opportunity if we could make the system more scalable and robust. Thus the dream of embedding machine learning directly in our products began.

Notebooks, Pipelines, and Python, Oh My!

Given this initial prototype, the challenge was set to turn it into a regularly running and consumable pipeline. This gave rise to the first ML engineering pattern, the Notebook Pipeline:

The notebook pipeline has a lot of advantages: the code is written in python within a Jupyter notebook (we used Databricks as our notebook environment), which means very little if any translation of the data scientists’ code needs to be done to productionize the pipeline. These notebooks are then strung together in a “pipeline” with each notebook reading from and writing to a delta table with the final result saved to an output delta table to be consumed by a downstream team. In order to schedule and manage the execution of the pipeline, a workflow orchestration solution such as Azure Data Factory can be leveraged to execute the pipeline at set intervals.

This approach allowed us to rapidly develop a more sustainable pattern to get the recommendations model off of an individual laptop and hosted in the cloud, a big win! However, as the complexity of our pipelines began to grow, this simple approach started to show some weaknesses.

Beyond the Batch, Bringing in “Realtime Data”

As the requirements for our pipelines continued to evolve, a growing desire to incorporate more “real-time” data about our customers arose to better provide a more personalized experience. Unfortunately, the “batched” nature of our pipelines was such that we could not execute our batches fast enough to incorporate this up-to-date information about our customers. This created our second major pattern, the Online/Offline Pipeline:

As the name implies, the Online/Offline Pipeline is broken into two parts:

The Offline portion of the pipeline is essentially the Notebook Pipeline, which is responsible for executing batch inference of the model.

The Online portion of the pipeline is where things begin to differ. It is responsible for ingesting the real-time source of data and sending it to the offline portion of the pipeline as well as incorporating that data into a post-processing of the model predictions before sending it to the downstream consumer. In building this portion of our pipelines we took our first steps into stream processing using technologies such as Event Hubs and Functions. We expanded our data storage appliances to NoSQL with CosmosDB and for the first time ever, we exposed model predictions behind an API rather than just exporting them to a table!

This pattern continued to expand and evolve as we integrated ML beyond recommendations to other parts of the business, creating capabilities in Marketing, Media Management, and Customer Service. However, yet again, as the complexity of the system increased, this pattern began to have some drawbacks, leading us to the pattern we use today.

Microservices, the Pickup Truck of MLOPs

The online/offline pattern served us well as we expanded our footprint, but had a number of drawbacks:

• Multiple deployment stacks & technologies — Our engineers had to maintain “offline” infrastructure in the form of notebooks, and “online” infrastructure in the form of function apps. This presented several challenges when it came to testing our pipelines as well as the additional complexity of adding multiple different deployments with inter-dependencies on one another.

• Duplicated functionality — Many of our model pipelines would leverage the same sources of data, however each pipeline would have its own data ingestion logic, and there was not an easy way to share logic or components between these pipelines, resulting in a lot of duplicated code and effort on the engineering side.

• Brittle environments — As a direct result of the first two points, these pipelines became very brittle, and it became difficult to diagnose the cause of failures in the system and push fixes out to each individual pipeline.

As a result of these challenges, in 2022 we re-evaluated our approach and adopted the paradigm we utilize today, Microservices.

*Figure 3: A Simple ML Microservice Architecture*

As illustrated above, this approach looks similar to the “Online” portion of the Online/Offline pipeline but has a few key differences:

• Separation of responsibilities — We’ve now split the data ingestion function and the model inference function into two separate, independently deployable microservices.

• Unified technology stack — Since there is no longer an “offline” portion of the pipeline, there is not a dependency on Jupyter notebooks to orchestrate pipeline components nor the infrastructure to support them, a large reduction in the footprint of our technology stack.

• Reusable components — Due to the modularity of the architecture, several components can be re-used for multiple different business use cases, including entire microservices!

• Models hosted as endpoints — Instead of embedding model artifacts directly within the pipeline we host them as their own separate endpoints within Azure ML.

This paradigm shift has been a dramatic accelerator for ML engineering at CarMax, reducing delivery times for new capabilities from being measured in quarters to being measured in weeks! Importantly, the ability to have data source services feed multiple different model inference services has meant that for new projects, if data already exists in a source service, we do not have to re-implement the data sourcing logic, saving even more engineering time and resources.

The Path Forward

The journey of MLOps at CarMax has been an exciting one. From humble beginnings deploying a single model off a laptop, to large scale enterprise microservice architectures, we are just getting started making buying or selling a vehicle at CarMax the best possible experience it can be. Building off the strong foundation we have assembled over the past two years, we are excited to tackle adding even more advanced capabilities into the system as machine learning continues to become a core component of our business.

If solving these types of challenges sounds exciting, we encourage you to apply to join us!