ML in Production @ CARS24 (Part 2)

Swapnesh Khare
CARS24 Data Science Blog
4 min readJun 24, 2022

In our previous post, we gave a small introduction to MLOps and few shortcomings of ignoring MLOps practices in production.

This blog post focuses on how those practices are followed at Cars24 and their impact on the workflows.

Static Files

Static files are broadly of 2 types in our workflows, Data files (CSV, JSON, Pickle) and Model files. We have moved both of these types out of Git Repos into their respective stores.

Data Files
These files mainly contain model features, static data (make, model, variant info) etc. Such information is either needed once or with every request. Loading the entire data into the memory when only a few rows of data is needed per request is not ideal.

We load these files in a Feature Store instead, for which we have used Feast. Feast serves the following use cases:

  • Low latency feature retrieval (≤10ms)
  • Consistency between training/serving data
  • Maintain history of data
Feature Store Workflow

Model Files
We have moved the model files out of code repos to a Data Store managed by DVC (explained below). These are fetched and loaded once during initialization. If a new model is uploaded in the Data Store, we can simply call a method (load()) in our deployment to re-fetch and re-load the new model.

These two practices give us the following advantages:

  • No code build needed due to change in such files
  • Memory requirement is reduced
  • Docker image size reduced considerably
  • Cold-start time reduced during scaling

Model/Data Versioning

Just like code versioning is done through Git, a versioning mechanism for Model/Data files is also needed. These files now persist in the Data Store (GCS buckets) as explained above.

The versioning tool we use is DVC. It is similar to git but for model/data files. It gives us the following advantages:

  • DVC attaches with every git commit to track files, which solves the problem of ‘What goes with what
  • DVC enables us to move model/data files to a remote storage
  • DVC has git-like commands so its easy getting used to
Git+DVC flow

Inference Service

We deploy our ML Workflows on GKE, called Inference Services. These Inference Services are created using KServe, an open-source tool to make model serving easy. KServe minimises the deployment efforts while providing options to tweak settings depending on the person’s expertise.

A few points on why we chose KServe:

  • KServe standardised the code structure for any Inference Service
  • We don’t have to worry about scaling. KNative, which integrates with KServe takes care of that. If needed, we can easily tweak the autoscaling parameters
  • Option to fetch models from remote storage (Data Store)
  • Easy integration with external predictors like Triton Inference Server
  • Scale-to-Zero functionality by KNative
  • Canary Deployment

Here we also have control over machine type, Compute Optimised machines can be used on GKE depending on the use-case. One GKE cluster can run multiple Inference Services (or multiple instances of a service) on a single node to maximise resource utilisation. A queue-proxy runs behind each Inference Service which queues incoming requests and triggers scaling (depending on the scaling parameters) which prevents request failure.

KServe Inference Service Structure

Monitoring

Monitoring of ML deployments is an essential and often ignored practice. Not only do we need to monitor the model’s performance, but also of the service as a whole (helps to optimise code/cost). We monitor every Inference Service as follows:

  • All internal logging is stored in GCP Logs. These logs are automatically archived in GCS if needed.
  • Alerting is set up on the logs, which in case of error, triggers an email to the stakeholders
  • Further monitoring is done on KServe side through Prometheus and Grafana, which give better insights on the service’s performance
Sample ML Workflow Architecture

Conclusion

The points listed here don’t cover even 50% of what can be done through the right ML system design. Our plan as MLOps practice at CARS24 was to first solve the problems around deployment and monitoring for scale and reliability, which are the most essential parts as the company grows.

In the next post, we will discuss about a few enhancements we did to these Inference Services and their impact.

Authors: Swapnesh Khare, Senior ML Engineer @ CARS24, Rajesh Dhanda, ML Engineer, Abhay Kansal, Staff Data Scientist @ CARS24

--

--