Machine Learning Pipelines | Part 1
Unpacking the Value of Machine learning pipelines
When Henry Ford’s company built its first moving assembly line in 1913 to produce its legendary Model T, it cut the time it took to build each car from 12 to 3 hours. This drastically reduced costs, allowing the Model T to become the first affordable automobile in history. It also made mass production possible: soon, roads were flooded with Model Ts.
Since the production process was now a clear sequence of well-defined steps (aka, a pipeline), it became possible to automate some of these steps, saving even more time and money. Today, cars are mostly built by machines. But it’s not just about time and money. For many repetitive tasks, a machine will produce much more consistent results than humans, making the final product more predictable, consistent, and reliable.
On the flip side, setting up an assembly line can be a long and costly process. And it’s not ideal if you want to produce small quantities or highly customized products. Ford famously said, “Any customer can have a car painted any color that he wants, so long as it is black.”
The history of car manufacturing has repeated itself in the software industry over the last couple of decades: every significant piece of software nowadays is typically built, tested, and deployed using automation tools such as Jenkins or Travis. However, the Model T metaphor isn’t sufficient anymore. The software doesn’t just get deployed and forgotten; it must be monitored, maintained, and updated regularly. Software pipelines now look more like dynamic loops than static production lines. It’s crucial to be able to quickly update the software (or the pipeline itself) without ever breaking it. And software is much more customizable than the Model T ever was: software can be painted any color (e.g., try counting the number of MS Office variants that exist).
Unfortunately, “classical” automation tools are not well suited to handle a full machine learning pipeline. Indeed, an ML model is not a regular piece of software. For one, a large part of its behavior is driven by the data it trains on. Therefore, the training data itself must be treated as code (e.g., versioned). This is quite a tricky problem because new data pops up every day (often in large quantities), usually evolves and drifts over time, often includes private data, and must be labeled before you can feed it to supervised learning algorithms.
Second, the behavior of a model is often quite opaque: it may pass all the tests on some data but fail entirely on others. So you must ensure that your tests cover all the data domains on which your model will be used in production. In particular, you must make sure that it doesn’t discriminate against a subset of your users.
For these (and other) reasons, data scientists and software engineers first started building and training ML models manually, “in their garage,” so to speak, and many of them still do. But new automation tools have been developed in the past few years that tackle the challenges of ML pipelines, such as TensorFlow Extended (TFX) and Kubeflow. More and more organizations are starting to use these tools to create ML pipelines that automate most (or all) of the steps involved in building and training ML models. The benefits of this automation are mostly the same as for the car industry: save time and money; build better, more reliable, and safer models; and spend more time doing more useful tasks than copying data or staring at learning curves.
However, building an ML pipeline is not trivial.
During the last few years, the developments in the field of machine learning have been astonishing. With the broad availability of graphical processing units (GPUs) and the rise of new deep learning concepts like Transformers such as BERT, or Generative Adversarial Networks (GANs) such as deep convolutional GANs, the number of AI projects has skyrocketed. The number of AI startups is enormous. Organizations are increasingly applying the latest machine learning concepts to all kinds of business problems. In this rush for the most performant machine learning solution, we have observed a few things that have received less attention. We have seen that data scientists and machine learning engineers are lacking good sources of information for concepts and tools to accelerate, reuse, manage, and deploy their developments. What is needed is the standardization of machine learning pipelines.
Machine learning pipelines implement and formalize processes to accelerate, reuse, manage, and deploy machine learning models. Software engineering went through the same changes a decade or so ago with the introduction of continuous integration (CI) and continuous deployment (CD). Back in the day, it was a lengthy process to test and deploy a web app. These days, these processes have been greatly simplified by a few tools and concepts. Previously, the deployment of web apps required collaboration between a DevOps engineer and a software developer. Today, the app can be tested and deployed reliably in a matter of minutes. Data scientists and machine learning engineers can learn a lot about workflows from software engineering.
I invite you to watch out for part 2 of this series of unpacking the value of Machine learning pipelines and their contribution to the technological and industrial revolution.