“Move fast, think even faster,” is the ultimate goal of data science.
We all want our AI models to do predictions faster and better than we can do them ourselves. Even more, we want to develop those models at lightspeed. But the reality is often a lot different. Running experiments, comparing results, deploying models and monitoring them is not fast or efficient. It’s a slow, tedious, and time consuming.
If we want to speed up the time it takes to get a model from idea to inference we’ve got to get better at performing more experiments faster. The more experiments we can perform, the more we know.
Jupyter notebooks have become so popular in data science, because we can prototype and communicate an idea in minutes. We can move fast and think even faster. But moving fast always comes with a big downside. It’s like riding a bicycle. You have to gain some momentum just to stay up (think proving an initial concept), but once you’re moving, the faster you go, the more risky things get.
I can’t count the number of times I’ve been in a situation where the results in the notebook weren’t reproducible, we couldn’t find the version of the dataset that was used, or where we had no clue what the package dependencies were when the model was produced. When all of these components are changing at the same time, it’s like riding a bicycle without brakes and we have to either risk a terrible crash or slow down to a crawl to keep things from getting out of control.
What if we could add brakes to our machine learning development? We need something reliable that will let us go fast, but not get out of control. Stop when we need to turn around, slow down when we need to, and still get to where we need to go safely every time.
So we know what the tools are that increase agility — notebooks, ML libraries, deployment tools, etc., but what are the brakes? It’s operations. Operations are the processes that make anything standard and repeatable. And in machine learning that means MLOps.
There are dozens of ways to introduce MLOps to your model development process. You could focus on testing or automation or versioning, but what’s really going to boost your repeatability without sacrificing too much agility?
What moves us beyond the “Fred Flinstone approach to braking” in MLOps? In a previous article Versioning and Labeling — Better Together, we combined two tools to add data versioning to data labeling. In this article, we’ll look at the model development stage and integrate the experiment monitoring of ClearML (previously Allegro Trains) with the powerful data versioning and pipeline capabilities of Pachyderm.
ClearML gives you a machine learning experiment management and monitoring platform. It allows you to easily compare experiments, visualize the model results, log training information, and even perform hyperparameter searches. It’s a simple and open source solution to the experiment tracking problem. I’ve used it many times to compare my experiments by overlaying training curves and comparing memory usage for different mini-batch sizes in deep learning models.
Pachyderm delivers a robust data science and processing platform, combining highly customizable pipelines with data versioning and data lineage. It tracks all your files and model artifacts in a git-like immutable file system, making it an incredibly powerful tool for building data-driven workflows.
The unique combination of pipelines with versioned data allows you to automatically train your models when your data changes, version your models, and have a full lineage of how a model was produced.
Because it is built on top of Kubernetes, Pachyderm inherits the scalability and parallelization powers of the premier data orchestration engine for the cloud. We’ll connect ClearML to Pachyderm using Pachyderm secrets. These mirror the Kubernetes approach to secrets, creating a safe and secure way to pass credential information to a pipeline job when it executes.
Combining ClearML with Pachyderm
ClearML allows you to train jobs anywhere, and through a configuration, log the training information to a ClearML server. This is really useful if you are running jobs in multiple environments, yet you still want to be able to compare the experiments. In our case, we securely configure a Pachyderm pipeline to log our model training to ClearML.
Pachyderm pipelines use Docker containers to execute stages of our transformations on Kubernetes. When we enable our pipeline to communicate with an external service, we create access keys via a Pachyderm secret (similar to a Kubernetes secret) with our ClearML access credentials to get them connected securely
Why would we want to do this? One of the most powerful features of Pachyderm are data-driven pipelines. Most pipelines in ML today mirror the old paradigm that comes down to us from the traditional software development world. The logic is hand coded and secondary. You write all the code to a login application manually and only touch data to grab a password or username. But in the machine learning world data is central. The model learns its own rules and logic from the data itself, which means we need to invert the entire paradigm. That’s what data driven pipelines give us. You don’t want to write a while loop to constantly check if your data changes. You want the data to signal that for you. If our data changes, we don’t need to manually remember to rerun our experiment. Pachyderm will automatically do that for us. This enables teams to develop dynamic datasets, that continue to update, while also allowing them to compare and contrast their experiments and track the history of their models.
Run It Yourself
A full example and walkthrough of this “integration” can be found in this GitHub repo.
The setup for this example requires a running Pachyderm cluster (I’m using Pachyderm Hub) and a ClearML server (I’m using the ClearML Hosted version). You can start a free version of each for this example.
Once we have a ClearML running, we need to create our ClearML credentials in the UI. We’ll copy these credentials into a
secrets.json file that will be used to create our Pachyderm secret. This will keep our credentials secure by mapping them into the container when the pipeline starts, rather than building the configuration into the container.
Next, we’ll create the Pachyderm secret by running:
pachctl create secret -f secrets.json
Finally, we can create a repository and deploy our pipeline, which will log its results to our ClearML server
Experimentation speed and reliable reproducibility are essential to any agile machine learning team. Individually, many machine learning tools have improved the speed of development, but a dependable stack of complementary tools that work together seamlessly are what will make large scale data science teams more scalable and successful in the coming years. Today and tomorrow’s data science teams need reproducibility and speed in a single platform.
Combining Pachyderm with ClearML delivers the core mechanisms you need to drive rock solid stability into your development process, while maintaining the quick agility you need for lots of experimentation and that moves us one step closer to “moving fast and thinking even faster.”