What’s helping BRIDGEi2i scale Machine Learning?

A sneak peek into our in-house ML-Ops Solution

Published in

BRIDGEi2i

9 min readMay 13, 2021

The motivation behind MLCore — Our MLOps Platform

Machine learning is all the rage today and is becoming a part of every business out there. While what it’s capable of is simply beautiful, all good things come at their own costs. If machine learning systems aren’t structured thoughtfully from the beginning, they can easily turn into an ugly mess when put into production. This is why many companies are struggling to manage these complex systems and end up employing huge teams of data scientists and engineers to keep them up and running.

Companies think that putting together a big team of experts will do all the magic. But just scaling up the team size is not the solution to scaling up machine learning. Many presume that it’s only logical to have a large team of data scientists and engineers; the more, the merrier. It’s assumed that data scientists will develop the models, and engineers will take care of deployment. But this is where the problems start because it separates the engineers from the data scientists. While they are expected to operate as a team, their functions don’t overlap, so they remain isolated from one another until something collapses in production.

It is the inherent difference between the responsibilities of data scientists and engineers that make it so hard to manage machine learning in production.

Data scientists aren’t trained to care about production concerns like scaling; their hearts lie in experimenting with data. Similarly, engineers might not want to experiment enough with machine learning or its dynamic nature. So no matter how many of them we put together, it doesn’t help solve the machine learning production crisis. It only creates more friction and endless collaboration issues. You can’t tighten a nut by throwing more spanners at it.

“When working in teams, a common ground needs to be agreed on for efficient collaboration.”
Quantum Black’s main motivation behind designing Kedro

The solution to this crisis requires the creation of a platform. This platform should be designed to enforce production standards from the get-go. Such a platform should allow data scientists to iterate over their models very quickly while also addressing engineers’ pain points such as versioning, code quality, reproducibility, etc. Such a platform eventually becomes the common language that data scientists and engineers can speak in — thereby eliminating any collaboration issues.

In fact, any scientific discipline requires a controlled environment for experimentation to flourish. This allows scientists to quickly test out their hypotheses against a variety of data. The field of machine learning is no exception to this rule. Data scientists should have the liberty to experiment without having to worry about tracking the results in Google sheets. This is why it’s essential to move data scientists away from Jupyter notebooks and bring them onto a platform where they may play with the data to their heart’s content.

A platform also becomes a centralized place for the deployment of machine learning models, much like browsers are for web apps. This simplifies debugging issues with machine learning pipelines since there’s clear traceability from training data to deployed model. Browsers also abstract away the details of which hardware/OS combination the application would run on for web developers. A platform provides a similar abstraction for data scientists and saves them from worrying about hardware constraints or wasting time installing packages. All of this helps in reducing the length of the machine learning life cycle drastically.

With a big chunk of the machine learning workflow automated, the platform reduces the need for a large team for single projects. The platform, therefore, frees up resources; large teams can split into smaller teams and manage more projects. This gives businesses the liberty to achieve more with the same number of people.

This is how machine learning can be scaled, empowering small teams to run big-scale ML projects.

Teams shouldn’t have to worry about the challenges around managing an increased amount of data, models or scaling infrastructure that comes with large-scale projects. The platform takes care of it all making it the best means of achieving scale in the ML domain.

The need for a platform becomes much more apparent when we compare machine learning software with traditional software. Traditional software comes with all kinds of tools and frameworks that help simplify the software development life cycle. However, when it comes to machine learning there’s a scarcity in terms of useful tools.

“If you want testing for traditional Python application development, you can find at least 20 tools within 2 minutes of googling. If you want testing for machine learning models, there’s none.”
What I learned from looking at 200 machine learning tools by Chip Huyen

Figure 1: Gaps in tooling — Traditional software v/s ML software (Image source: CS329 slides (Stanford))

While over the last 3 to 4 years, there has been an explosion in the number of tools and platforms coming up for operationalizing machine learning, there’s still a lot of work that needs to be done within this space.

Enter MLCore

Seeing our project teams tussle, the challenges around scalability and noting the gaps in readily available tooling for ML inspired us to develop our own solution for these problems.

MLCore is BRIDGEi2i’s internal platform that acts as the integrated backbone for any machine learning project, including our AI accelerators. It enables our data scientists and engineers to easily build and deploy solutions at scale and has become the de-facto system for machine learning at BRIDGEi2i.

Figure 2: MLCore powering our AI accelerators (Image credits: Kishore Kumar N, Sekhar Rangam)

MLCore: Bridging the gap between training and deployment

By now it’s clear that the approach of data scientists developing models and handing them over to engineers for deployment cannot be used to produce scalable machine learning pipelines. It isolates the engineers from the data scientists creating communication gaps within the team. Moreover it assumes that the output of the training exercise i.e. the model artifact is separate from the process used to generate that artifact.

“If training happens in isolation from the deployment strategy, that is never going to translate well in production scenarios — leading to inconsistencies, silent failures, and eventually failed model deployments.”
Why ML should be written as pipelines from the get-go by Hamza Tahir

With MLCore, we bridged the gap between model training and deployment and created an automated workflow for the whole process. This significantly shortens the machine learning life cycle. It also makes moving models to production as simple as the click of a button.

Challenges we overcame to build MLCore

Building a platform like MLCore is not straightforward; it comes with a big set of challenges. However the good news is that most principles that apply for building traditional software platforms can be applied to machine learning platforms as well. It’s just that machine learning has a lot more moving pieces. We have to think about data being available, data moving around, model training and so on.

What finally helped us was to take a closer look at the machine learning workflow.

Some of the challenges we identified are listed below:

Streamline Experimentation: The dynamic nature of data makes machine learning an iterative process. Data Scientists have to explore different algorithms, try out a bunch of frameworks and tweak various hyperparameters before arriving at the desired model. While developing MLCore, a major challenge was to streamline experimentation by making it simpler, faster and easy to track.
Reproducible Pipelines: Another challenge while developing MLCore was the reproducibility of results. This meant taking care of tracking code, models as well as the data used to produce the models.
Testing: A machine learning system, not only involves thinking about traditional software testing concepts such as unit tests but also about model validation.
Faster Deployment Cycles: In the ML domain, data changes frequently. We wanted MLCore to have a pipeline that would support quick retraining and deployment cycles.
Monitoring: As data keeps changing, performance of deployed models degrade. Monitoring helps to initiate retraining, using new data. Hence, it is crucial to monitor both the predictions of a deployed model as well as the statistics of the data used to build that model.
Choice of Frameworks: After identifying problems we wanted to address via MLCore, our next challenge was choosing a coherent set of tools that would make up the platform. These are the tools that we chose to go with:

Git for versioning of code
Feast as our feature store
MLflow for tracking of experiments
DVC for data and model versioning
Docker and Kubernetes for deployment
GoCD for designing training and scoring pipelines

MLCore: The Value It Adds

MLCore was designed with the objective of transforming the way we do machine learning at BRIDGEi2i. The simplest way to understand the value it brings to the table would be through an example use case.

Let’s say a cab service company is operating its business in ten different states across India. They would like to forecast the number of customer complaint tickets that would be generated for any given state a month in advance based on historical data.

For the state of Karnataka, our team of data scientists has decided to construct a pipeline that involves building three different models.

Figure 4: ML pipeline for ticket forecasting of Karnataka

Now, let’s try to understand some of the ways in which MLCore is simplifying the design and management of machine learning pipelines, for example, as the one shown in figure (4):

Track and Version Models: A pipeline with three models, is not hard to track. The complexity arises when we design similar pipelines for the remaining nine states where the client is operating. Then we’ll have to track roughly 30–50 models or even more at the same time. Before MLCore, all these models would usually sit in Google drives of data scientists. But after MLCore, it didn’t matter if pipelines for a project produced 50 or even 100 models. It automatically takes care of tracking and versioning all model files.
Automatic Backup: In figure (4) above, each block may be a custom script sitting in the local systems of data scientists. If they happen to lose the code, then the experiment would have to be re-conducted from scratch. Such scenarios are completely avoided with the help of MLCore as everything is backed up.
Tracks Intermediate Outputs: Each of the blocks in the above figure may have intermediate outputs associated with them, which need to be tracked. All of this is usually handled manually by the data scientists on their local systems. But once we had developed MLCore our data scientists didn’t have to worry about which output was produced at which stage of the pipeline, since it does all the tracking for them.
Creates Traceability: Each model in figure (4) may be developed by a different data scientist. This means that different components of the pipeline are distributed across local systems of different data scientists. Hence, there’s no traceability of code or data used to construct the above pipeline. MLCore creates clear traceability among different components of a machine learning pipeline.
Configurable Pipelines: Earlier, if our data scientists wanted to update a pipeline like the one shown above by adding or removing models, they wouldn’t be able to do so without several hours of manual activity. But with MLCore, all they have to do is modify the template of the pipeline, and the rest is handled by the platform on its own.

Conclusion

Ever since its inception, MLCore has been truly transformational for us. It has helped us achieve scalability along with reproducibility and explainability of outcomes. It has become the common ground for collaboration for our data scientists and engineers. We are now able to run big projects using small teams because MLCore acts as the base for all machine learning projects.

MLCore has standardized the major components of the machine learning workflow. It is enabling our data scientists to focus on building models without worrying about the nuances of infrastructure and deployment. It has also significantly shortened the model development and deployment cycles.

The scope for MLCore is not limited to solving structured data problems, we are expanding its capabilities to process unstructured data as well. This would transform the way we do CV and NLP projects at BRIDGEi2i.

As shown in figure (2), MLCore acts as the base for our AI accelerators including Recommender, Forecasting, etc. For more information on our accelerators, please check out our website.

Special thanks to Pavan Kumar T V, Kishore Kumar N, Viswanath Puppala for their valuable insights which helped a great deal in shaping up the article 😄