Introducing the series: “What is MLOps?” — Part 1

Mikiko Bazeley

Published in

Ml Ops by Mikiko Bazeley

9 min readAug 22, 2022

Hey everyone, in this series I’ll be diving into the “What” and “Why” of MLOps.

This is Part 1 of the multi-part series that will include:

Part 2: Defining MLOps as Simply As Possible
Part 3: Why ML Ops Matters: Taking the ‘Oh Sh*t’ out of MLOops
Part 4: Goals of MLOps as Themes (Substack)
Part 5: Software and ML Before MLOps (Substack)
Part 6: The Challenges of Scaling ML As Software (Substack)

Links will be updated as the sections are published. Sections tagged with “Subtack” will be available in their entirety at my substack for free.

Introduction

When (A Lack Of) MLOps Ruined Your Week 💚⭐️

You’ve started attending in-person conferences and events again. There are a bunch more stalls than the last time you attended the conference and they’re billing themselves as end-to-end ML solutions (“from notebook to web app!”).

And while notebooks are awesome, there’s been growing unease in the back of your brain about all the custom built models being deployed at work, oftentimes by engineers that wonder how a scrub like you ever got your job given you don’t even Docker.

And when it seems like everytime the product team asks for a new model or feature, it’s a huge scramble to set up the pipeline and get the model from scratch.

This, however, is a Monday Problem models are deployed, even if they get pushed just past the wire and ruin family dinners (for the DevOps folks).

The Product Team ➡️ The Data Science Team ➡️ The Engineers

After the weekend you’re back at work and catching up with the other data scientists and ML engineers and talking about some of the newest papers that came out at NeurIPS.

When suddenly a series of events heralds an incredibly eventful week:

Monday (morning): The new Chief Product Office has released their eagerly awaited list of major initiatives, including: cutting spend on model training costs, ensuring models are compliant with new legal requirements, and improving ROI for data science projects.

Monday (afternoon): The Director of Data Science schedules a last minute team meeting where they’re previewing a new accountability process (triggered by the CPO’s list of announcements).
Tuesday (morning): Layoffs are being announced and the company offers a more attractive severance payment for those who volunteer. All the tenured engineering staff (specifically the ones who caught the models sent over the fence and were responsible for saving the projects from the scrap bin) have decided to hop off early. This week. All together. 😳 Folks who seem more engineering minded, including from your team, are being asked to step up until the chaos dies down. All the Monday announcements and initiatives still apply just… with a lot less people. But you figure it’ll sort of work out and you get invited to a bunch of slack channels that the engineers said were used for notifications about stuff failing.
Tuesday (afternoon): Slack gets flooded by notifications about failed pods, even though the Airflow logs seem to indicate that the data pipelines were successfully kicked off? Support is also indicating they’re getting flooded with customer complaints that some of the APIs are returning nonsense results or are taking forever to return predictions.

The real-time emotional state of the remaining engineers. As everything suddenly breaks.

Wednesday: As part of the announcements on Monday, every business partner starts pinging your team asking for dashboards and status updates on models in production. Because there is currently no solution for keeping track of all the models in production (other than posting an @here in various product and data science channels) beyond the current ELK stack, the current sprint’s tickets and backlog now needs to be re-prioritized. There is still chaos from Monday and Tuesday and some of the departing engineers are in a less than generous mood to do knowledge transferring.

Thursday: As part of the process of tracking down models (both in production and currently developed) and manually creating a google spreadsheet listing model owners, types of models, and current status of model development and deployment, everyone realizes that of all the known models: only 60–70% actually made it to production, and of those in production:
➡️ 20% have basically been forgotten and are stale (because they belonged to data scientists that left the company)
➡️ 40% have developed problems around scalability and resiliency
➡️ 20% are in okay shape
➡️ 20% actually aren’t in production and are still in pre-deployment but the data scientists got pulled off those projects to get put on new product features.

Friday: The icing on the case is the company finds out about a lawsuit about malicious use of their platform that involves racism and bias. There’s a general rush internally to try to investigate whether this is the case. Problem? No one’s actually sure that the platform HASN’T been used in a detrimental manner because of:
➡️ the lack of documentation and lineage around the datasets being using in model training,
➡️ the models that were neglected have been continuing to run (and were generally forgotten about), and
➡️ the siloing between the various engineering teams means there’s no real way to connect the the users receiving online predictions back to the offline training sets or why the experiments were designed in a particular way.

Saturday and Sunday: Work work work work werk. There goes your week and welcome to Monday. You still have 3–6 more months of this to go before everything looks up!

Motivation

The scenario I illustrated above, while fictitious, isn’t a complete caricature.

Some of those same situations have in fact happened to well-meaning (and not well-meaning) teams and individuals.

Heck if you ask most MLOps engineers when they decided to make the switch as a Data Scientist or DevOps Engineer (or Software Engineer), it was the week where they were asked to put a model in production and found that they either didn’t have infrastructure or people to do so. And when they did it was a painful and messy process ripe for automation.

Then came another 20 models.

We’re past the hype of data science and now in the land of companies shouting “Show Me The Money” when it comes to their data science teams and initiatives. We’re also in the kingdom of individuals, maybe like yourself, wanting to leverage the promises of AI and ML for their own business ideas as full-stack application developers.

My goal is to provide an opinionated and working understanding of what MLOps is (and isn’t) that will last for at least 25% of a Silicon Valley newscycle. And ideally to touch on principles and themes independent of any particular tech stack. The tech stack is still firming up and maturing and is worth its own set of posts, videos, and tutorials.

Objectives

In this series we’ll attempt the following:

Provide a working definition of MLOps;
Differentiate between the practice, tooling, and the role;
Show how ML Ops impacts the ML Lifecycle;
Highlight the differences between traditional software and ML software and how those differences relate to the concerns of MLOps.

My Hot and Lukewarm Takes

And because it’s tradition for me in all my posts to drop some hot-takes before going through the blog post, here’s a good list:

MLOps is about creating a system, platform, or internal set of tooling that facilitates many models being developed and deployed by Data Scientist or Ml Engineers;
MLOps is not the same function as ML Engineering, which I find to largely be about getting individual ML pipelines or projects up and running (often times using the tools or systems that Engineering or MLOps has established);
Traditional software concerns and practices still apply but they need to be adapted to the quirks of ML products;
Designing a complicated system for the sake of having more boxes on a system design drawing is about ego, rather than need;
User experience of internal tooling or platforms is a critically undervalued and under-discussed component of adopting MLOps practices and tooling and consistently pointing the finger at Data Scientists not being able to code is lazy and uninspired thinking.

I said what I said.

Who Should Read This 💚⭐️

You should read this series if:

You’re a software engineer exploring making the switch to MLOps but aren’t really sure what it is (especially the ML part) because you’ve never actually developed and deployed a model;
You’re a data scientist that is struggling with getting their model into production because you’re missing the software engineering context;
You’re a product manager or technical leader new to managing or working with Data Science projects and you’re confused by some of the challenges with production models and are also starting to think about investments into MLOps.

This series is especially great if you’re not looking for another code demo or tutorial and instead you want the “Why” of MLOps.

⚠️Warning ⚠️ If you’re already a working or practicing MLOps professional, the blog series is going to be boring and repetitive. But if you do decide to read anyway, please feel free to comment or provide any feedback that would be relevant for newcomers (which this series is geared towards).

Series List

This is Part 1 of the multi-part series that will include:

Part 2: Defining MLOps as Simply As Possible
Part 3: Why ML Ops Matters: Taking the ‘Oh Sh*t’ out of MLOops
Part 4: Goals of MLOps as Themes (Substack)
Part 5: Software and ML Before MLOps (Substack)
Part 6: The Challenges of Scaling ML As Software (Substack)

Links will be updated as the sections are published. Sections tagged with “Subtack” will be available in their entirety at my substack for free.

About Me

My name is Mikiko and at the time of this writing I currently work as a Sr MLOps Engineer at Mailchimp.

Before pivotting to MLOps I’ve also worked as a Data Scientist, Data Analyst, and Growth Hacker at various companies in the Bay Area.

If you’re interested in my various career leaps and breakthroughs, check out these series:

👩🏻‍💻 Miki’s 🔥Hot-Takes🔥 on MLE Interviews: Types of Roles & Interview Prep — Part 1 & Part 2 — Where I talk about getting my role (which was initial ML Engineering before the team converted to MLOps)
✂️Breaking into Data Science- From Hair Salon to Data Scientist 🔍 — Part 1, Part 2, Part 3, Part 4 — Where I go into exhaustive detail about preparing for getting a job as a data scientist.