MLOps in three parts: a data scientist’s perspective

Toby Coleman
bytehub-ai
Published in
4 min readFeb 3, 2021

What is MLOps, and why should data scientists care? In short, MLOps will help data scientists be more productive, and allow more of their work to be turned into valuable data products. In this article I’m going to explore how MLOps will change access to data, model development, and final deployment, and why the MLOps software industry should focus on building easy-to-use products that fit into data scientists’ existing workflows.

Do today’s ML software systems resemble the earliest cars in terms of complexity and reliability? Photo by Pascal Bernardon on Unsplash.

Background: machine learning projects today

As with many new technologies, machine-learning has gone through a period of huge growth both as it has become easier to use, and more people and organisations have come to realise the value it brings. A consequence of this rapid expansion is that many projects built around data and ML have been created in a haphazard way. So far this has worked, because the benefits gained from using ML more than compensate for the difficulties involved in successfully adopting the technology.

However if ML is adopted more widely we will need to find ways to make using it more efficient and reliable. Think of ML today as being like cars one hundred years ago: it can be expensive, requires a lot of tinkering to make it work, and can suffer from reliability problems.

Two complaints people often have are: (1) to do machine-learning we need highly trained data scientists, who are in short supply; and (2) once our data scientists have built a machine-learning system, turning it into a usable, reliable software product is slow and expensive.

Data scientists are often at the centre of any effort to adopt ML, and so solving these challenges requires that they adopt different tools and approaches to ML development.

Part 1: Access to data

Too often this is the most time-consuming part of a data scientist’s job, and involves accessing data from myriad sources in different formats then carrying out feature engineering to convert it into a form that ML models can use. When done in an ad-hoc way, it can result in a mess of code that is time-consuming to write, difficult to maintain, and hard to explain to the engineer tasked with deploying a model to a live, production system.

A promising solution to this problem is a feature store. At first this is very similar to database/data-warehouse containing the raw source data for all of your ML models. However, it takes the concept a step further by also storing all of the transformations you need to convert raw data into ML model inputs. This has three key benefits:

  1. Data preparation code is now neatly organised, versioned and sharable between different data teams;
  2. Model training scripts are now simpler, faster to write, and easier to understand; and
  3. The same feature store can serve ML features to model training scripts and to production systems, so it becomes much more straightforward to deploy a model into a live product.

Part 2: Keeping track

ML model training is by nature an iterative process: a data scientist will start with some intuition around which approach might work for a specific business problem, but will then need to experiment to discover how to implement and fine-tune a suitable system. Unlike with traditional software, it is difficult to know how well a given ML system will work until it has been tested, tweaked and iteratively improved.

Keeping track of these iterations is important to ensure:

  1. Repeatability — once you’ve found the steps required to create a great model, you need to make sure you don’t forget them; and
  2. Traceability — with a record of the different modelling approaches that you tried, you will have evidence of what worked and what didn’t.

Managing ML experimentation in this way means that data scientists spend less time going down blind-alleys, and have a simple way to identify and benchmark the best models.

Part 3: Deployment

How should a model be turned into a usable end-product, and how should it be maintained and updated as required? Without the right tools this can be slow, expensive and error-prone, requiring software engineers to carefully pick through unfamiliar ML code handed over by a data scientist.

Instead, we need ways to deploy ML models quickly and easily, while ensuring that:

  1. The production system gets fed up-to-date versions of the same data/features used by the data scientist when training; and
  2. Models can be versioned and automatically tested for characteristics like accuracy and fairness, so that only good quality ML makes it into the hands of the end-user.

Summary: MLOps today

The good news is that solving each of these problems is getting easier day-by-day, with an explosion of new startups and ML-related software tools. The next stage is for data scientists to adopt these tools and approaches more widely, but this requires that MLOps software becomes more usable and familiar. Software tools like Feast and KubeFlow are powerful solutions to many of the challenges outlined above but are often intimidating to set-up, and beyond the reach of smaller data science teams.

At ByteHub.ai we are building an easy-to-use feature store that helps data scientists become more productive. Contact us if you’d like to learn more.

--

--

Toby Coleman
bytehub-ai

Data Scientist and ML Engineer. Interested in time-series modelling and forecasting problems. Current project: https://github.com/bytehub-ai/bytehub/