The Guide to Operational AI: Part 1 — An Introduction to Operational AI
Is MLOps Dead?
Are you dissatisfied with the complexity of current MLOps systems? Is your company struggling to get value out of its ML investments? Does your data science team struggle to keep up with the demands of your business? In this article, we’ll discuss what comes next: Operational AI.
What is Operational AI?
Let’s start with a very brief and extremely concise look at the development of artificial intelligence over time. In the process, we’ll also strive to understand common terminology in the field so that we can better build up a robust definition of operational AI. These terms are often conflated or misused, so here’s a quick refresher for anyone who needs it.
What is artificial intelligence?
Artificial intelligence is a discipline in computer science that aims to replicate human decision-making using computer systems, or “machines”.
There are many ways in which we could have a machine display human-level intelligence, such as machine learning, statistical modeling, or even complicated rule-based “decision” systems. The important distinction is that the system is able to make decisions without the input of humans at the time of decision-making (and, hopefully, that these decisions are similar, or better, in quality to what a human would produce — perhaps a distinction for good artificial intelligence). We don’t need machines to pass the Turing test to qualify as AI, but it does need to be more independent than the Mechanical Turk.
The idea of artificial intelligence isn’t new. Going back to antiquity, we can find early precursors as far back as ancient Greece in the myths of Talos and Galatea. And it’s almost certain that such ideas were common for quite some time before they were committed into folklore. Whether or not the ideas of automatons, living statues, etc. were fueled by artificial intelligence or simply “magic” is a bit beside the point; let us remember the wisdom of Arthur C. Clarke: “any sufficiently advanced technology is indistinguishable from magic.”
Millennia would pass before mankind really had sufficient tools to build the first systems that would actually resemble artificial intelligence. The inventions of computers obviously provided a great platform for the application of artificial intelligence, as these new machines were capable of executing computations at a much larger scale and faster speed than a single human. In 1951 Christopher Strachey and Dietrich Prinz wrote the first programs to play checkers and chess, respectively, on the Ferranti Mark 1. This is often cited as the first application of AI. At the same time, significant research was underway in the field of machine learning.
What is machine learning?
Machine learning is a subfield of artificial intelligence that focuses on making decisions (or predictions) using existing data without explicitly telling the system how to make those decisions.
There are many different techniques in machine learning, but it’s common to use past data (i.e observations) to try to establish a relationship (aka machine learning model) between the decision (or prediction) being made and the prior data. When new data is observed, a machine learning model can then apply this relationship to make a new decision (or prediction). The old adage that “the past is a good predictor of the present” is essentially the dogma of machine learning.
Research into machine learning boomed in the second half of the 20th century, and most methods we use today were invented in this time period: K-nearest neighbors (1951), k-means Clustering (1956), Support Vector Machines (1963), Random Forests (1995), etc … Heck, even Neural Nets have their origins in the 1940s and linear regression is over 200 years old. Despite all this great theory being available, the need for and application of this work was generally nonexistent outside of select organizations, meaning that machine learning and its related fields would sit in relative obscurity until the end of the century. The advent of the PC, the Internet, and the explosion of data thereafter instigated a newfound interest in this area, going as far as creating a new data discipline in the early part of the 21st century: data science.
What is data science?
Data science is a discipline within data analytics that focuses on extracting predictive insights from data.
Typically, data science is tasked with applying techniques in artificial intelligence like machine learning and statistical modeling in order to glean predictive and prescriptive insights, which makes it a specialized subfield of data analytics. The data analytics team at large is more generally tasked with extracting insights from data, and “data analysts” are usually more focused on descriptive analytics. (Note: If you’re offended that I consider data science as a subset of data analytics, think of it less in terms of skillsets and more in terms of organizational hierarchy. Additionally, I’ll offer my deepest apologies.)
Around the turn of the century, more and more companies were finding themselves with increasingly large collections of data. The savvy among us realized that this data had great value. We could apply methods in artificial intelligence to this data to better make decisions about sales, marketing, product development, business operations, etc. The sky is the limit! The main issue at the time, however, was that the knowledge of these techniques was walled off in academia. Applying research papers to a company’s raw data is quite difficult (let alone reading said papers!), so the most adventurous companies began courting these AI/ML experts to leave academia and enter the corporate world, effectively creating the role of the data scientist.
What is a data scientist?
A data scientist is one who does data science!
Although the responsibilities of a data scientist (and machine learning engineer) can vary a lot from company to company, I wish to make a generalization here that will help us better contextualize operational AI: data scientists are typically tasked with solving a data science use case — i.e. generating predictive or prescriptive insights for the business — however, they are not responsible for the long term maintenance of those insights.
Data scientists entered the corporate world with much gusto and fanfare. Data scientists were named the sexiest job of the 21st century, as well as one of the highest paying and best jobs, and even the president decided to hire a chief data scientist. Life was great… until right after a data scientist finished tinkering with its first ML model and someone in IT inquired about how this was supposed to run “in production.” Production was never part of the equation back in the lab; everything was theory and idealized, clean data. Maintaining production ML models proved challenging and outside the scope of expertise of the data scientist, thereby sparking the creation of yet another role: the machine learning engineer.
What is a machine learning engineer?
A machine learning engineer is an engineer who is responsible for building and maintaining products and processes that leverage ML.
This effectively cuts out the “long-term maintenance of predictive insights” responsibility from the data scientist and puts it into its own discipline. Yes, things can get sufficiently complicated that a new role is required (And it’s not uncommon to hear something like the following from those who have been doing data science for a while: “this is the only thing that is hard about data science”). It’s one of the newest roles on the “data team”, and the least well-defined. As a result, experiences can be wildly different across companies, but there’s opportunity in the ambiguity.
In practice, ML engineers sit somewhere between data scientists and software engineers. Over the past decade, machine learning engineers have been plying away at building and maintaining production processes for AI/ML workloads. There are a lot of nuances in properly deploying and maintaining production models; ML engineers get to worry about all sorts of fun things like: tracking experiments, models, and data versions; detecting model and data drift; refreshing stale models; analyzing new models and comparing them to existing models; understanding feature importance and bias and other topics in explainability (XAI); maintaining containers and python environments and Kubernetes clusters, and so on and so forth.
The job of the ML engineer can get complicated very quickly, and it’s easy to see why it’s desirable to peel this off from the role of data scientist, whom we wish to keep focused on solving problems. Over time, ML engineers have created a lot of tools to help them maintain all these different processes, and this collection has become known as ML Operations, or MLOps.
What is machine learning operations (MLOps)?
MLOps is a subset of applied AI that focuses on deploying and maintaining predictive models in production.
MLOps tends to be the most difficult part of the ML workflow — the one which organizations and data teams struggle with the most — and the most technically involved (meaning, it’s rare to find people who do this well, if at all). MLOps has also infamously resulted in a huge amount of tooling sprawl. It is possible — and even necessary — to leverage multiple tools in the MLOps toolbox to complete a single use case.
More tools mean more complexity, which means the production ML/AI workflow ends up being burdensome to manage. So much so that many AI projects in many companies just fail and never reach a meaningful production stage. However, where there is compounding complexity, there is also an opportunity for gross simplification of the stack. High-tech companies (Uber, Facebook, Google, Netflix, Tesla, etc.) who are on the tip of the proverbial spear with AI/ML adoption realized this in the second half of the 2010s and began building systems that re-envision the AI workflow into one that was actually productive and could be used to build and maintain workflows in a short amount of time. This leads us straight to operational AI.
What is operational AI?
Simply put, operational AI enables end-to-end execution of production AI use cases.
The main goals of operational AI can be summarized as followed:
- Make it easy to build and maintain production AI. I can count on one hand the number of people who have said their company had a system that made operationalizing AI easy. We’re over a decade into this rodeo that we call data science, so if old methods aren’t working, it’s time to look into something new. A wise person once said that the definition of insanity is trying the same thing but expecting different results.
- Allow many different user personas to execute AI workflows. I’ve never liked the term or idea of “citizen data scientist”, but that doesn’t mean that we shouldn’t try to get as many people to be able to execute ML workflows as possible. What if a data analyst or analytics engineer could easily apply a workflow built by a data scientist to their own data? Some data scientists think this is crazy and that everything must be manually built by hand. What if people could drive cars without being professional drivers? That might just be transformative to life as we know it. Being able to run an ML workflow doesn’t make you a (citizen) data scientist, it just makes your organization better.
- Easily scale the work of data scientists. I routinely talk to data scientists who are brilliant but also drowning in upkeep costs while maintaining their own and others’ models in production. Per our definitions above, data scientists are builders and should be focused on building and executing new use cases. For a variety of reasons, the pass between the data scientist and the ML engineer is often fumbled and data scientists are usually left paying the bill. Operational AI gives them a route to easier scaling of their own work.
- Translate software engineering best practices into the world of data science. The gap between DS and ML engineers is also widened due to the lack of hardened best practices for data scientists. Their workflow is much more development and research-oriented, and they often work outside the rigor of something like an SDLC. ML engineers know that without this rigor, production processes are just time bombs waiting to explode. They need shared systems that enforce best practices across the workflow.
Operational AI is a logical and necessary successor to MLOps. We’ll spend the rest of this series diving more into the details of operational AI and the impact it can have on your organization, but I first wish to spend some time highlighting the differences between operational AI and MLOps.
To the uninitiated, it probably seems like MLOps and Operational AI are the same things, but operational AI takes a fundamentally different approach to workflow construction. A key tenet of operational AI is that the system is declarative. A declarative system lets users specify what to build, instead of how to build it. To put this into the context of an ML use case: the what of the use case could be “90-day sales forecast” whereas the how could consist of hundreds or thousands of lines of code. One of these interfaces should look easier than the other.
Declarative workflows turn ease of use and reproducibility into first-class capabilities any data professional can leverage. Every ML toolchain I’ve seen has some imperative component in it. These components need to be explicitly instructed on how to handle various scenarios. Every imperative component adds new choices to be made, which adds complexity and introduces ways for the workflow to fail, bugs to be introduced, and disagreements to break out between data scientists and machine learning engineers. Even when a single MLOps component does conform to a declarative mindset, without the entire workflow being declarative end-to-end, it does not compare to the experience of operational AI. Somewhere the workflow breaks down and exposes the underlying complexity to the end user.
Every MLOps system I’ve seen is a complex arrangement of specialized tools used in lieu of a more elegant chain that is not yet realized. Operational AI begins with an end-to-end design. The MLOps ecosystem has developed into a swamp of point solutions. It is extremely complex and harrowing to try to solve an AI use case end to end without an operational system. Large companies that must leverage AI typically do so by employing dozens or hundreds of ML engineers to maintain various use cases (and, you’re not Google). Personnel directly scales with the number of use cases, which is never a good thing. This is exactly why high-tech companies have begun leveraging operational AI systems internally — it drastically cuts down the complexity while keeping scale under control.
There are entire communities dedicated to discussing, evaluating, and troubleshooting MLOps tools. Few report that MLOps adoption is going well at their company, and fewer still seem to understand what’s going on in the ecosystem. We could wallpaper a room with all the different MLOps ecosystem diagrams that exist, each with dozens upon dozens of technologies and non-overlapping categories with little consistency in product placement between diagrams (if the experts can’t agree on what the tools do, what hope is there for the rest of us?). If this is “right”, I’d rather be wrong. I understand the appeal of having a workflow that is flexible but using an end-to-end tool doesn’t mean that we have to sacrifice flexibility.
The last major contrast that I’d like to make is that operational AI is very likely to embrace a data-centric approach. In a previous article, I discussed the evolution of ML platforms over the years, from code-centric to model-centric, and now to the newly emerging data-centric. The majority of MLOps tools land squarely in the model-centric or code-centric approach to AI, which we know are difficult to carry through an operational workflow.
Fundamentally, the data-centric approach couples closely with an end-to-end workflow. Indeed, it’s difficult to break the end-to-end focus and still retain a data-centric mantra as many of the intermediate stages in the ML pipeline are inherently focused on non-data artifacts. It’s only through abstraction that we can truly take advantage of the benefits of data-centricity. The data-centric approach additionally synergizes nicely with the declarative nature of operational AI, as it enables non-expert users to build technically sound AI workflows while focusing entirely on the input data sets.
In Part 2, we’ll deep dive into Operational AI and provide a design framework to build upon.
Bonus: Companies who are building and utilizing operational AI systems have created yet another role on the data team: the machine learning platform engineer — i.e. one who builds and maintains an operational AI system. At the time of writing, there appear to be less than 200 people on LinkedIn claiming this title, compared with over 40,000 who are machine learning engineers.