ML in Production @ CARS24 (Part 1)

Published in

CARS24 Data Science Blog

5 min readMay 17, 2022

ML Engineering or MLOps is one of the least understood and most important domains in the data science ecosystem. For a long time models developed never went into the real world and when they did, they behaved wrong!

Getting a model into the real world involves more than just building it. This is where most companies fail according to some crazy statistics such as this.

Out of 3Ds required in any DS solution, the third D (Deployment)is ignored by most data practitioners. MLOps is a set of tools and practices which help ensure that the development and deployment of ML solutions is orderly, repeatable and scalable.

Bridging the gap between ML model building and practical deployments is still a challenging task. This is where we are building our data platform.

In this series of blogs, we will highlight problems with workflows without MLOps and how we are solving them at CARS24, with respect to deployment and monitoring of ML workflows.

Static files

Traditional ML workflows store static data (make, model, variant info etc) in CSV/JSON or model features as pickle files. These are files are then stored in the same repository as code. This is not ideal for different reasons:

Binary file formats like pickle are not git friendly
Everything gets loaded into memory
Datatypes are important

Having these files in the code repository creates the following problems:

Frequent code builds due to change in a file (but no change in code)
Increased cold-start time during scaling
Increased docker image size

Ideally, we don’t need to load the entire file, as during runtime we would need only a few rows depending on the input. Given the variety of data that is required, approaches should be less opinionated and allow:

Quick Look Up
Outside code, so that it can be served
Query-able

Model/Data Versioning

Unlike software development workflows, ML workflows have model objects & data files too, which need versioning and have a different change cycle from code. Traditional ML workflows don’t have something like Git for such files. As a result, model & data files versioning is of the type _1, _1_final, _1_final_v1 and so on. As there is no versioning strategy, all historical unused files persist in the same code git repository, which is not a good practice and also increases the docker image size.

Another big issue is that there is no record of correlation between code, data and model files. A person would either put such records in git comment or maintain a file (which is a pain!). This also poses a problem when the project owner leaves and someone has to pick up that project!

Ideally, these files should have a versioning of their own, outside the code repository (Model/Data repository) and this versioning should be connected to the code versioning so as to maintain correlation between changes.

Choice of Architecture

An inconvenient truth of how a ML solution should scale also inherits incompetencies of the underlined language. Python being the de-facto standard in DS (as of 2022), allows us to write simple scripts but at the cost of being single threaded inherently.

In a traditional sense if a workflow is divided into its components,

Horizontal Scaling of a Typical solution

it will consume ~9 req/sec (123 (1+50+2+50+20) ms/req) and each deployment will consume around 1 CPU, to increase throughput we can either scale these monolith deployments or we can re architect these to something like following:

Here per request we are consuming ~2 CPUs, but since all are separate threads, we can do an asynchronous operation to process around 80 req/s based on the slowest process.

So, to achieve this and many other custom architectures, we are monetising Kubernetes via KServe and a lot more. All this allows us to scale to N and back to ZERO to optimise costs, performance with relative ease.

Inference Code

Sadly, most inference codes are copies of the training code, the steps to pre-process, predict and post-process are more or less the same. These training codes use libraries like pandas, numpy etc. Using these in production is not ideal as they are built to process larger chunks of data in single operation. For single/few rows of operations, the standard operators is more efficient route. As seen below, on small datasets, numpy operation is much slower than normal Python operation. This holds true for dictionary, dataframes etc as well.

Time comparison between numpy and list for a big dataset

Time comparison between numpy and list for a small dataset

Monitoring

Model performance on real-world data is often different from that of the training phase (if you haven’t faced this issue then congratulations on building the perfect model :) ). Traditional workflows don’t have practices which can monitor this drift (which in turn helps in re-training phase). There’s also no monitoring of resource utilisation by the service which is crucial for cost optimisation and user experience (mainly API response time). Traditional logging is done through log files which needs separate storage and frequent backups to have history.

Conclusion

In this post we have seen what practices are followed by most traditional ML Workflows and what are their shortcomings. This is not a comprehensive list of all the problems but those which we have faced and are solving at Cars24.

In future posts we will explain how we have solved each of these issues. MLOps field is still evolving and so will these solutions, or maybe change completely as we learn more :)

Authors : Swapnesh Khare , Senior ML Engineer @ CARS24, Rajesh Dhanda, ML Engineer, Abhay Kansal , Staff Data Scientist @ CARS24