Scaling Machine Learning at Careem

Published in

Tech @ Careem

8 min readMar 14, 2019

Motivation

At Careem, we believe that our mission is to simplify people’s lives in the region by delivering wowing experiences to solve their different problems.

By uplifting communities, supporting infrastructures, and solving local problems, we strive to be the enabler that improves the lives of millions of people. We understand that only with the help of data and AI-driven tech we can help our region leap-frog its core problems and help it steer its wheel more towards innovation and creativity.

Every day, our platform solves different challenging problems affecting the lives of our users across 120+ cities. These problems start with helping our customers get a safe ride from point A to point B but it doesn’t stop there. It extends to helping them get food and deliveries on time, as well as getting affordable and accessible means of transportation.

Each of these problems require a local and optimised solution. This emphasises a strong need for A.I. and data to create efficient solutions. When we take into consideration the number of services and the variety of the domains we are tackling, it becomes very clear that Careem needs thousands of production-level machine learning models to enlighten every single bit in our platform.

With this in mind, having a scalable and flexible machine learning platform that facilitates development and enables expeditious deployment of ML services is a must. To achieve this vision, we started with a gap analysis to understand what is preventing our tech from fully leveraging the power of A.I. Then, we came up with a list of requirements that any proposed solution needs to address.

1- Take ML models from ideation phase to production integration in hours.
2- Experiment, train and serve different model types using different libraries and frameworks in a scalable and cost-efficient way.
3- Leverage our different offline and real-time big data sources while training and serving ML models.
Finally, the solution should open the door for our mission as a team to democratise Machine Learning usage by different teams across the Careem organisation.

Due to the lack of an end-to-end solution that fulfills these needs and integrates smoothly with our different systems, we decided to build our in-house Machine Learning platform. Thanks to it, our data scientists/engineers have developed and served different ML models throughout the last year improving users’ experiences and optimizing our Marketplace engine affecting the everyday lives of millions of our customers and captains.

In this post, we introduce our platform by walking you through the ML development workflow of our Estimated Time Arrival Prediction Project.

Problem Definition

Customers rely on our ETA service to know how much time it’s going to take a Captain to get to their location once booked. In order to calculate an upfront ETA for a booking, our ETA service sends captain coordinates, booking coordinates, and other useful contextual information about the request. Then a machine learning API on top of our platform receives the request, predicts the ETA and returns it back to the consumer service.

    "user_id": 1,
    "cct_id": 122,
    "osm_distance": 400.9,
    "osm_eta": 47,
    "captain_long": 55.1486643,
    "captain_lat": 25.0911563,
    "booking_long": 54.1506833,
    "booking_lat": 25
     ...

Now, let’s deep dive into the ML development cycle using our in-house platform.

ML Development Cycle

In order to develop a machine learning service, we need to pass through these five steps.

Dataset Generation
Feature Engineering
Model Selection and Training
Model Evaluation
Model Serving

Generating a dataset

Let’s start with Dataset Generation. It is a crucial step in the development of any machine learning model. By having access to variant data sources, a dataset preparation job can easily be created using SQL or Spark APIs through our interface.

Finally, a dataset is generated and goes in our datasets repository in S3 labeled with all the necessary attributes (project name, generation time, target dimension, etc).

A dimension defines the granularity of the model rollout whether it targets a country, city, car type or an age group.

Scalable Dataset Generation on top of Apache Spark

By having all the data processing on top of Spark, we ensure the scalability of our different data generation pipelines and the right balance between the flexibility to implement complex use cases using Spark APIs and the ease of use of SQL.

Currently, we are also investing in building our feature store to shorten the needed time for features extraction as well as advancing our auto-ml capabilities.

Training a model

When it comes to training ML models, there are few considerations that we are trying to take care of.

a- Empowering Data Scientists/ML Engineers to use different ML libraries easily.

Our target as a platform is to enable everyone to use the right tool for the right problem and not to limit them to a specific library. We believe that the solution could be in using scikit-learn, catboost, tensorflow, MLib, etc. and sometimes it might be to include different algorithms from these libraries together.

We allow the usage of different ML libraries. Through the Trainer package, ML training algorithms can be easily created using our integrated libraries and APIs.

Different metrics and plots could be reported as well as other training metadata that gets captured and stored automatically.

b- Enable scalable training effortlessly

Currently, most of our use cases rely on offline training where the user gets to specify how much resources are needed for the training session and then the container starts running on the automatically-provisioned infrastructure. Training starts once triggered, the user gets access to all the logs of his running job. Once finished, the model deployment package gets created, versioned and stored on S3. We treat training job as production pipelines and so we ensure they have tight monitoring and alerting.

Our trainer integrates smoothly with Airflow through our pipelines creator module, allowing users to get pipelines seamlessly for their new projects.

Evaluating the model

Once the model is trained, the user can use the platform’s UI to view different metrics over different runs as well as dozens of ML interactive and static visualisations.

At the same time, all the captured metadata and job details can also be easily explored ensuring the reproducibility of the results, if needed.

We allow users to set a benchmark dataset (yeah, it is MNIST like dataset😉) that can be used to compare different models’ performance with each other. This also opens the door for our different data scientists to collaborate on the same project, if needed, to raise the bar on the quality of our solutions through a kaggle-like submission workflow.

Besides, model drift detection and evaluation is an important area that we are taking into consideration. Due to the fast-moving nature of our business, it is a no-brainer that a lot of our models have to be checked/updated regularly to ensure their continuous adoption of changing patterns.

Serving the model

Once a model is successfully trained and evaluated, now is the perfect timing for getting this model on production. Generally, there are two serving techniques adopted in the industry. The choice of each mostly depends on the use-case.

Real-time Serving

We allow our engineers to quickly build and deploy a scalable and production-ready API in a few minutes to serve all the models trained for the specified problem. For our ETA model, we need one API to serve different models based on the received requests. This means the model we serve for Cairo will be different from the one served for cities in emerging markets.

Each API has access to our AB Testing framework (Galileo) guaranteeing easy config based AB Testing between different models across different dimensions (for ETA, it will be cities for example). For a lot of our use-cases, pre-aggregated or real-time features may be needed. All of our APIs have access to a key-value store where different features are ingested by different data pipelines.

They also all follow the same contract, ensuring smooth integration with our other production services. Additionally, the stateless design of our API ensures automatic horizontal scaling when needed.

Thanks to our automated deployment service (we call it “one-click deployment”), we can refresh our models or any part of the deployment content in no-time ensuring reliability and stability of the production service. Besides, a tight integration with the centralized monitoring, alerting and logging infra is ensured. Moreover, all of the model’s predictions are logged and streamed to facilitate real-time model performance monitoring.

Putting everything together, we are able to cut the development and deployment time from days to few minutes even after including the effort needed for the continuous model update.

Batch Serving

Sometimes, all that is needed for leveraging AI is to store predictions in some data store and then let the consumer service look it up on demand. This is mostly useful for complex use-cases with high latency or when real-time inference isn’t a necessity for the ML solution.

We rely on batch prediction for model testing and real-world simulation as well as predictions pre-computation for complex algorithms.

Conclusion

Our investments in building an in-house ML platform comes from our belief that high ML adoption relies heavily on democratizing its access to everyone. It paves our way towards full auto-ml capabilities and set the base for other ML-based platforms that target specialised domains like Time-series Forecasting or Anomaly Detection.

By building a ML platform, we are taking one more step towards accelerating the impact of machine learning by increasing the number of teams and services that can take advantage of ML. Also through saving a significant amount of the effort and time needed for training and deploying models, we increase the output of our ML engineers and data scientists by multiples.

Interested in joining our journey? We are looking for rockstars to help us shape the future of the region! https://www.careem.com/en-ae/careers/