Enigma — ML Platform at MFine

Rohit Damodaran
mfine-technology
Published in
7 min readNov 23, 2022

1. Introduction

MFine is at the forefront of providing quality and on-demand healthcare services across India. Being an AI-driven healthcare platform, a majority of the services at MFine are powered by ML systems. The Data Science team has delivered and is working on a variety of AI projects ranging from text, image, audio, and signal processing data.

As we kept adding more and more ML models to our portfolio, it became challenging to track all our experiments and monitor/maintain the ones that were deployed. It was thus essential that a platform is in place to support the end-end machine learning lifecycle, right from data ingestion to model tracking, deployment, and monitoring.

The term MLOps is defined as the extension of the DevOps methodology to include Machine Learning and Data Science assets as first-class citizens within the DevOps ecology” Source: MLOps SIG.

MLOps, like DevOps, emerges from the understanding that the ML model development should be separated from the process that delivers it. MLOps inherits many principles from DevOps so there exist some similarities between the two principles. MLOps and DevOps best practices focus on process automation in continuous development to maximize speed, efficiency, and productivity.

1.1 The Rise of MLOps

MLOps (Machine Learning Operations) is a core function of Machine Learning Engineering, primarily focused on reducing the time and cost to develop ML models and streamlining the process of taking these models to production. This is then followed by maintaining and monitoring them.

MLOps Lifecycle (Source: Databricks)

By adopting an MLOps approach, data scientists and machine learning engineers can collaborate and increase the pace of model development and production, by implementing continuous integration and deployment (CI/CD) practices with proper monitoring, validation, and governance of ML models.

1.2 Why do we need MLOPs?

Productionizing machine learning is difficult. The machine learning lifecycle consists of many complex components such as data ingestion, data preparation, model training, model tuning, model deployment, model monitoring, explainability, and much more. It also requires collaboration and hand-offs across teams, from Data Engineering to Data Science to ML Engineering. Naturally, it requires stringent operational rigor to keep all these processes synchronous and working in tandem. MLOps encompasses the experimentation, iteration, and continuous improvement of the machine learning lifecycle.

2. Enigma

Enigma is an in-house ML platform at MFine, coupled with the AWS ecosystem, primarily built to quickly prepare data, train and track ML models, and deploy them.

2.1 Platform Overview

The figure below shows the different components of the platform. AWS recommends using separate accounts depending on the workload. At MFine, The training pipeline (triggers training jobs using high compute CPU/GPU resources) is set up in the Data Science account while the deployment pipeline is set up in the Production account.

  • JupyterHub — Multi-user, SSO-enabled version of Jupyter Notebook/Jupyter Lab.
  • AWS RDS — PostgreSQL database to store the metadata, model versions, and evaluation metrics.
  • AWS S3 — Dataset, models, and artifacts registry.
  • AWS ECR — Store and access Docker images.
  • AWS Batch — An AWS-managed service to run Batch Jobs.
  • Jenkins (CI/CD) — Used for triggering training and deployment pipelines.
  • Git — Version control.
Overview of the platform components

2.2 Training Pipeline

Training jobs can either run in the notebook (for training small models) or be submitted to AWS Batch, by specifying the compute resources (CPU, GPU, and memory) that would be required. Batch jobs run as Docker containers using pre-built or custom docker images to train the models.

Running a training job in Enigma

Once a training job is submitted, Enigma handles all the heavy lifting of shipping the code, building the docker image, and running the job in AWS Batch.

Enigma is capable of using Spot Instances for training which can help in saving up nearly 50–90% on GPU bills.

Metaflow is a popular orchestration tool used to execute Data Science workflows. The training is executed as a Directed Acyclic Graph (DAG) as shown below.

Training Flow (DAG)

Data Layer: Data preparation and preprocessing takes place at this layer. User would ideally pull data from S3 or any other source. The prepared data is then passed on to the Model Layer.

Model Layer: Model training and validation is done at this layer. Preprocessed data from the data layer is fed to the model for training.The models are versioned and stored in S3.

Evaluation Layer: The trained model(s) and test dataset are used for evaluation. The model evaluation metrics are stored in the database (RDS).

2.3 Deployment Pipeline

After training the model(s), the next step would be to deploy the best-performing model. A Jenkins CI/CD pipeline is set up to pull the model from S3, build the Docker image and deploy it to the Development, QA, and Production Environments.

2.4 Core Features

Data Accessibility

Enigma provides helper functions to easily access data from S3. To understand more about the Data Platform at MFine, you may want to give this blog a read.

Experiment Tracking

Every training job is tracked and versioned. Model-related metadata, evaluation metrics, and the version are stored in the database while the model files and artifacts are stored in S3. Experiment tracking enables the Data Scientist to easily compare and reproduce results.

Sometimes, just a small change in your hyperparameters could have a huge impact on the performance.

Dataset Versioning

Dataset versioning can help Data Scientists track and reuse datasets. This would also enable them to skip data preprocessing/feature engineering for future experiments.

Slack Notifications and Logging

Training status notifications and custom messages can be sent to Slack to get regular and on-the-go updates on the progress of the training. The complete logs are published in AWS CloudWatch.

Slack Notifications

How Enigma Uses Spot Instances?

A Spot Instance is an instance that uses spare EC2 capacity that is available for less than the On-Demand price. Because Spot Instances enable you to request unused EC2 instances at steep discounts, you can lower your Amazon EC2 costs significantly. The hourly price for a Spot Instance is called a Spot price.

The On-Demand price for a p2.xlarge GPU instance is $0.9 per hour whereas the spot price for the same instance is $0.27 per hour!

Amazon EC2 terminates, stops, or hibernates your Spot Instance when Amazon EC2 needs the capacity back or the Spot price exceeds the maximum price for your request. Amazon EC2 provides a Spot Instance interruption notice, which gives the instance a two-minute warning before it is interrupted.

This two-minute warning is used to take a “snapshot” of the training state, i.e, uploading the check-pointed model(s) to S3 and updating the metadata to the DB. A new Batch job is then created and the training resumes.

Spot Instance Termination Handling Mechanism

Model Monitoring (WIP)

The fundamental assumption while training/deploying any ML model is that the data that was used for training mimics the real world. However, every machine learning model degrades over time. When the model receives data that it has not seen in training, the performance can degrade significantly. Model monitoring helps you to identify performance-related issues effectively.

By monitoring models, you can track the distributions for the key model features and the model prediction. If the shift of data distribution significantly differs from the past, you can trigger an alert for making necessary updates to the model.

Deploying a model to production is just the beginning of the lifecycle of a machine learning model!

The service health and infrastructure (CPU, memory, and Disk IO) are monitored using Prometheus (collects and stores its metrics as time series data) and Grafana (data visualization), with alerts being sent to Slack.

We are currently developing a drift detection system that would take a sample of the production data and apply drift detection techniques to identify if the models in production have degraded or not.

A detailed blog on Model Monitoring to follow!

Final Thoughts

An ML Platform for a company becomes crucial as more and more data science projects are added to its portfolio. All major cloud providers have their own data science platform offerings. Good open-source-based options exist too. Different options may work best for different companies, depending on their machine learning use cases, the maturity of the team, whether they are in the data center or in the cloud, and what cloud provider they’ve selected.

With that being said, MLOPs is far from being mature. It is a newly coined term by the industry and it is rapidly evolving, but so are we! We constantly add new features to the platform to ensure that the data science process works smoothly. Stay tuned for more blogs on the same!

Please do follow if you’re curious to learn more about the tech we use at MFine.

--

--