7 Learnings From the AWS/Sequoia Whitepaper “MLOps: Emerging Trends in Data, Code, and Infrastructure”

Published in

neptune-ai

15 min readSep 15, 2022

I came across the “MLOps: Emerging Trends in Data, Code, and Infrastructure” whitepaper (from AWS/Sequoia) a few weeks ago, and when I read it, I thought, “okay, this makes sense”.

It resonated with me and, in many places, described well what I was thinking for some time now. It’s also very much in line with how we think about building and using ML tools at Neptune.

I figured I’d share the most important learnings with you. I’m curious to know what others think about these points. Would be happy to talk more about that.

So here we go.

1. Building infrastructure is not the goal, pushing models to production is

(Yes, decided to start from something that obvious.)

You want models in production.

But you don’t want to build infrastructure.

So you go for an end-to-end cloud platform (like Sagemaker).

One of the big problems with end-to-end platforms?

You want to customize components of the stack to your use case.

But business folks don’t want you to work on activities that are not “core business problems”.

Things that don’t build long-term differentiation.

So what do you do?

As Vin Sharma from AWS explains in “MLOps: Emerging Trends in Data, Code, and Infrastructure”:

“Today, customizing different components of the ML lifecycle can be highly manual, with practitioners primarily taking a DIY approach.
Businesses prefer some degree of abstraction from the underlying functions and hardware in each step of the modeling process. They don’t have or want to invest in the in-house resources to custom-build core capabilities like data labeling, experiment tracking, model monitoring, etc.“

Let’s face it.

You will not push the business forward long-term by hacking around a pipeline orchestration component or a model deployment solution. Or an experiment tracker, for that matter.

It may be a cool project, but thinking “long-term business”, it doesn’t hold water.

The solution, according to Vin, not me (ok, I see that too :) ), is:

“We see organizations opting for best-of-breed solutions from focused vendors to gain additional control over their modeling workflow without needing to be too concerned with what’s going on “under-the-hood.”

So basically, what Jacopo Tagliabue has been saying for a long time when explaining ML/MLOps at a reasonable scale.

“to be ML productive at a reasonable scale, you should invest your time in your core problems (whatever that might be) and buy everything else.”

If it is not your core business, and there are solutions that do what you want (or you expect them to be able to do what you want soonish), -> don’t build it yourself.

I know it is hard to resist cause “you could build a v1 over the weekend”.

You probably could.

But then you’d need to fix, maintain, document, update, improve, backup, etc.

It’s your time we are talking about here.

This is what the business is paying the big bucks for.

Not the monthly SaaS subscriptions or enterprise contracts.

Use your time wisely.

If that means building an experiment tracker -> great.

But more often than not, outsourcing it to a vendor is the right choice for the business… and you.

2. People want to be able to customize their ML infrastructure

Let’s start from the same point once again.

You want models in production.

But you don’t want to build infrastructure.

So you go for an end-to-end cloud platform (like Sagemaker).

But you have special requirements and need to customize your MLOps stack.

So you end up swapping components of your end-to-end platform and combining vendors, open-source, homegrown tools, and cloud providers’ offerings.

Want an example?

Here you go:

Brainly used the end-to-end Sagemaker platform. They were happy with it.
They started experimenting heavily with CV models.
They realized they needed a deeper solution for the experiment tracking component of the stack.
They swapped Sagemaker experiments to a best-in-class point solution, Neptune.
They are happily using both Sagemaker and Neptune.

If the end-to-end platform is designed to be open, it is not “either or”. It is “both and”.

Here is the full story if interested.

As Vin Sharma explains in “MLOps: Emerging Trends in Data, Code, and Infrastructure”:

“For any organization that wants to generate value from domain-specific business applications or curated data sets, the underlying ML platform and the effort of managing its infrastructure are just a means to an end and potentially represent a heavy burden with no differentiation.
Cloud ML platforms provide managed services that are easy to use at a lower cost for such organizations.
…
However, these end-to-end services do not meet the needs of many customers with more specialized requirements.
Thus, they chose to implement an ML platform composed of best-of-breed products from vendors, open source, homegrown code, and native AWS services.
…
Ultimately, these organizations want fine-grained control over the full stack and willingly take on the additional complexity and cost of operating a DIY.
…
They hope to deliver unique value based on their ML platform software, underlying infrastructure, and ML engineering expertise.”

As your needs evolve, so does your toolstack.

Just understand what you need today.

Be honest with yourself.

Don’t build everything because “it was in that awesome FANG blog post diagram” and “you will need it in the future”.

The secret key is the interoperability of the components

That is the difference between success and frustration when building an MLOps platform.

In the classic debate between build, in-house vs. buy best-of-breed, the answer (when you cut the fluff out) is almost always both.

In “MLOps: Emerging Trends in Data, Code, and Infrastructure”, Vin Sharma from AWS explains why it “typically ends up as a bit of both.”

To build in-house, you need to:

find/hire/retain great (and expensive) people
differentiate what you are building from what is available in the ecosystem (otherwise, why build?)

“Inevitably, the resident CTO arrives at a fruit bowl of cherries picked from open-source projects, vendor-supplied proprietary tools, and services from a cloud provider.” — Vin Sharma

But when you combine tools from the MLOps ecosystem (open-source or vendor-supplied), you have new problems.

You need to design for interoperability of modules that solve particular problems.

“If not thoughtfully constructed, developers may find that these platform architectures fit better in slideware than in software.
…
Ultimately, the interoperability between the modules becomes the secret key to MLOps success or frustration.” — Vin Sharma

When you add to this that most tools overlap in what they do and try to grow wider, not deeper,

You get yourself in the nice hot mess that the MLOps tooling ecosystem is today.

One of the solutions to this is to use platforms that have swappable components.

And then, there’s modularity…

As Vin Sharma from AWS explains in “MLOps: Emerging Trends in Data, Code, and Infrastructure”:

“Top organizations are adopting a DevOps-like process for Machine Learning called MLOps to create a durable and competitive advantage.
These systems continue to evolve at a breakneck speed where the most essential components are obvious: modularity and interoperability.“
“expect to see modular ML systems win by integrating well into the rest of the stack.”

Modular components + do one thing well + integrate well

Sounds dreamy.

But if you look at DevOps 10–15 years ago, which probably is where MLOps is today, this is the reality.

As Sonya Huang from Sequoia explains in “ “MLOps: Emerging Trends in Data, Code, and Infrastructure”:

“Modularity is key. Production ML systems serve models that are as “core” to an enterprise as it gets.
Having the most important models locked into a proprietary vendor that doesn’t integrate well with other solutions simply does not work for ambitious ML organizations.
…
Each component needs to perform its role well — be that a feature store or monitoring solution.
As such, we should expect to see modular ML systems win by integrating well into the rest of the stack.”

This is exactly how we look at our place in the MLOps ecosystem at Neptune.

Be a component -> ML metadata store (experiment tracking + model registry)
Do one thing well -> we are going deeper. No plans to solve other parts of the MLOps stack.
Integrate well -> 30+ integrations and counting

But,

“modularity doesn’t preclude bundling”

You can still be a modular component of the stack and do a few things.

You just need to do each of those things well and let people choose only the parts they want.

In “MLOps: Emerging Trends in Data, Code, and Infrastructure”, Sonya Huang from Sequoia explains:

“That said, modularity doesn’t preclude bundling if you can do multiple things well.
We have seen companies serve multiple parts of the ML stack very well.
Hugging Face is an example of a company that has excelled at multiple parts of the ML stack — from finding models, to training, to deployment.
But what’s key is that a customer has the option to integrate Hugging Face with their own stack if they choose, or just use it for a single component.
That requires Hugging Face to stay on its toes and win on product in every sub-category where it competes.”

Tools can be bundled, but they don’t have to be coupled.

(strongly) coupled can be a problem as you need to use multiple tools from one provider.

Those work well together, but it locks you into an opinionated and limiting way of doing things.

Bundled means you can pick and choose what you need.

Bundles will always be somewhat coupled as they just work best together.

But it’s a flavor of locking that feels positive.

AWS Sagemaker is an interesting case of a tool that is bundled and somewhat coupled but definitely open.

We see teams (like Brainly) swap the experiment tracking component to a best-in-class tool like Neptune.

We’ve heard teams swap model monitoring component to a point solution like WhyLabs.

Yet many folks feel that Sagemaker is a very strongly coupled end-to-end platform.

Interested to see how this changes over time.

How do we as an MLOps tooling ecosystem support that?

Let’s talk modularity.

In a recent talk between Stefan Krawczyk and Chip Huyen, Chip points out that modular MLOps tools are great, but we don’t even have clearly defined categories.

And so many tools overlap so heavily that it is hard to understand where categories start and end.

Look at THE MLOps paper “Machine Learning Operations (MLOps): Overview, Definition, and Architecture” by Dominik Kreuzberger, Niklas Kühl, Sebastian Hirschl

Look at that awesome diagram.

*End-to-end MLOps architecture and workflow with functional components and roles |* *Source*

Take almost any component.

Find a tool that does just that.

Yeah, hard.

Modularity is hard when everyone wants to be an end-to-end MLOps platform.

And interoperability?

Ok, we are making some strides:

tools like ZenML for an agnostic pipeline abstraction,
growing popularity of the model registry concept,
open architectures of end-to-end platforms like Sagemaker.

So things are getting better.

But we would move so much faster if we tried to be more modular and integrated rather than cover larger parts of the MLOps landscape.

3. “there is no single “all in one” software development platform”

Exactly.

Yet, in MLOps today (which people say is like DevOps 10+ years ago, we’ll get to it in a sec), tooling companies are building in the direction of end-to-end platforms.

Why? Is MLOps simpler than DevOps?

Seems like ML dev is more complex than software dev, not less.

+ if anything, it is an extension of DevOps.

Sonya Huang from Sequoia, in “MLOps: Emerging Trends in Data, Code, and Infrastructure”, shares:

“look at how the software development stack evolved, which I believe is an apt analogy for how the machine learning landscape will develop.
As vendors like Atlassian and Datadog, and GitHub grew, they took on more and more adjacent functionality.
But there is no single “all-in-one” software development platform — customers still want to pick and choose a mix of vendors that are best suited for each purpose — it’s simply too large of a scope for a single vendor to deliver well.
…
I believe a similar dynamic will play out in machine learning.
Ambitious vendors will extend their functionality beyond their initial wedge — the ML serving companies may get into ML monitoring, for example — but we should see a handful of winners.
The best companies will strike a balance.
They execute incredibly well on their wedge, have the ambition to extend into adjacencies, and they aren’t overly naive about the scope of their project.”

So yeah, sure, it makes sense to bundle some things together and go into adjacent tooling categories.

We did that with experiment tracking and model registry as people kept requesting those functionalities together. We bundled those together.

But, end-to-end platforms in MLOps seem very unlikely to make sense long-term.

Unless you build it as an end-to-end bundle of loosely coupled tools.

Perhaps that could work.

But, you should know that…

4. Even AWS Sagemaker doesn’t believe in one end-to-end MLOps platform for everyone

Vin Sharma says that teams “want the flexibility to tweak all the components of their model and model-building workflow”

Yes, exactly. And it is a good thing for everyone.

Various ML use cases -> need for flexible components

In “MLOps: Emerging Trends in Data, Code, and Infrastructure”, he says:

“Through the advent of more sophisticated AI use-cases and more advanced data scientists, enterprises do not just want integrated black-box systems anymore.
They will want the flexibility to tweak all the components of their model and model-building workflow to produce the most optimal analyses and systems for their specific business needs.
…
At AWS, we recognize teams need this flexibility. Amazon Sagemaker is intentionally designed to be both modular and flexible to support teams who wish to work with open-source frameworks.
…
We support both end-to-end SageMaker and DIY workflows.”

And many ML teams do.

In the long term, the flexibility of components (or lack of it) is what kills ML platforms.

And AWS understands this and builds it open to give people a solid start from which they can customize.

5. Some stack components are just not worth building

“customers view certain products across the MLOps pipeline as being easier to use, providing deeper features or functionality, and perhaps, even more, cost-effective. “

Word.

We see that a lot with teams being happy to adopt a mature point-solution for something that just doesn’t make sense financially or otherwise to build in-house.

As Tim Porter and Jon Turow from Madrona Venture Group share in “MLOps: Emerging Trends in Data, Code, and Infrastructure”:

“even more organizations […] will want to take a composable approach of choosing best-of-breed products from a variety of vendors, selectively use open-source software,
build on hyperscale infrastructure, and combine these with their own code that best addresses their business and application needs.
Sometimes this is because their data or customers are in multiple places, and their application needs to run on multiple clouds and even on-prem.
Almost always, it’s because customers view certain products across the MLOps pipeline as being easier to use, providing deeper features or functionality, and perhaps, even more, cost-effective.”

The tricky part today is understanding which components are both mature enough and obvious enough to outsource.

Experiment tracking seems to be a good candidate.

Very boring ml-wise.
A lot of backend work to make this scalable and reliable.
At few solid options on the market for some time now.

Which ones do you think are other good candidates?

6. So… MLOps startups going end-to-end may not be the best direction

In theory, one company could solve anything.

But in practice, it is impossible to provide a good enough dev experience across the board.

Is the bad dev experience in MLOps, in general, the reason why so many companies were/are building MLOps infra themselves?

Or why are vendors feeling they can go wide on many components and do that well (or not bad) enough?

Or is it just ignorance?

As Tim Porter and Jon Turow from Madrona Venture Group share in “MLOps: Emerging Trends in Data, Code, and Infrastructure”:

“A trend in MLOps has been that companies who become successful in an MLOps subcategory seem to evolve or expand into offering end-to-end solutions.
Sometimes we question, however, whether this is in response to customer pull or just broader startup imperialism and ambition.
We have seen some of both — if a vendor is providing a strong solution in the “front end” of the ML pipeline (say, labeling, training or model creation), customers might also want deployment
and management that can close the loop back to training.
However, in many more cases, we see customers want a modular system where they can knit together best-of-breed solutions that best fit their environments and business needs.”

So why does that happen?

We’d say it is probably a combination of ignorance and customers asking for it.

You don’t know how hard building each component is until you start getting deep into that component. By then, you are overinvested in “making it work”.

And you build something 2x worse than a best-in-class point solution that someone is focusing 100% of their time on.

From the developers’ perspective, it is just “not great” when compared to that point solution.

So you want to make it better. But you have that and ten other components to improve.

Either way, long-term, it is probably unsustainable.

Anyway.

In the end, we all, as MLOps tooling companies, will just be an extension of the DevOps toolstack.

Yes, it’s time to talk about that now.

7. MLOps is like DevOps in the 1990s and early 2000s

Best practices are not clear.

Where one toolstack component ends and the other starts is not clear.

The reasons for people wearing tracksuits and flared trousers are not clear.

As Sonya Huang from Sequoia shares in “ “MLOps: Emerging Trends in Data, Code, and Infrastructure”:

“We believe the ML ecosystem resembles the software ecosystem in the 1990s and early 2000s, before DevOps as a discipline took over and the software development stack professionalized and coalesced around systems like GitHub, Atlassian, and Datadog.
ML is now having its “DevOps” moment, where technology companies realize they need to operationalize their ML stacks so that their Data and ML teams can develop with maximum impact.”

Exciting times to be in MLOps both on the user/practitioner and tool provider side.

For those who remember, how was DevOps 20 years ago?

Were there a lot of end-to-end platforms back then?

More tools than you can count solving almost the same thing?

Also, if MLOps is like DevOps 10 years ago, what do you think MLOps will look like in 10 years?

Will MLOps just be a part of DevOps?

Ultimately, we are helping teams operate the software. A special type of software called ML.

So speaking long-term, would it ever make sense to have both DevOps and MLOps as separate teams/stacks?

In “MLOps: Emerging Trends in Data, Code, and Infrastructure”, Tim Porter and Jon Turow from Madrona Venture Group say:

“An interesting related question is whether MLOps continues to exist as its own category, or whether it converges simply into DevOps, as nearly every application becomes a data-centric intelligent application.
Madrona’s belief is these two spaces will begin to largely converge, but certain unique customer needs specific to MLOps will continue to persist.”

What do you think?

Will MLOps just converge to DevOps?

Or maybe it has always been DevOps we just made it special :)

Who is going to be the Atlassian of MLOps?

Sure, today, MLOps is nascent. We don’t have standards, lines between categories of tools are not clear, etc.

But at some point, we will get to maturity.

We’ll have more evident user segments, nicely defined needs, best practices, etc.

Will there be a single platform that does all the modular components of MLOps then?

Or will Atlassian just buy modular MLOps point solutions to have a bigger bundle of products?

Ryan Cunningham from AI Fund explains his point of view in “MLOps: Emerging Trends in Data, Code, and Infrastructure”:

“No single company can own all the modular systems right now, because everyone is still actively developing the best frameworks to conduct these operations.
The MLOps field is nascent, which is exciting, because the more people contribute to these open source dialogues, the faster we will converge on standards for ML development, deployment, and management.
We are still several years away from a single dominant company emerging as the central hub for MLOps product to exist within.
In the long run, as MLOps standards formalize, we expect larger players will begin to acquire adjacent products and scale horizontally.
From there, we may begin to see complete, Atlassian-like platforms take shape.”

By the time we get to MLOps maturity, it may be clear that MLOps is just part of DevOps.

That ML software delivery is just software delivery. Or not.

I guess we just have to wait and see.

So who do you think will be the Atlassian of MLOps in 10 years?

I occasionally share my thoughts on ML and MLOps landscape on my Linkedin profile. Feel free to follow me there if that’s an interesting topic for you. Also, reach out if you’d like to chat about it.