Automating Machine Learning / Deep Learning Pipelines

Ahmedabdullah
Red Buffer
Published in
11 min readMar 16, 2022
Photo by Gioia M. on Unsplash

Hey, there fellow ML Engineer! If you’ve been following my previous posts I am assuming that your ML Journey is going great thus far and now you are facing the same problem as 90% of the other ML Engineers. Most of us already know how cool CI/CD is (probably have seen some Circle CI ads on YouTube xD). Every time I saw those ads or saw the development teams getting the most out of CI/CD pipelines, I wondered how great it would be if Machine Learning Engineers could also utilize a concept like that.

In this article, we’ll go through in detail how you can do that with Machine Learning Pipelines and essentially automate your pipelines. Enough with my rant and let’s get started with what CI/CD is. If you already know what that is feel free to skip to the next section.

CI and CD stand for continuous integration and continuous delivery/continuous deployment. In very simple terms, CI is a modern software development practice in which incremental code changes are made frequently and reliably.Automated build-and-test steps triggered by CI ensure that code changes being merged into the repository are reliable. The code is then delivered quickly and seamlessly as a part of the CD process

Understanding CI/CD

The main components of CI/CD are as follows

  • Build — stage at which app is compiled.
  • Test — stage at which changes are tested.
  • Release — stage where code becomes part of the repo.
  • Deploy — stage where code is actually deployed.

By now, you are clear on what CI/CD is but you might be wondering about a few other things:
1) How can we utilize this concept in Machine Learning?
2) Is there a tool that we can use to implement this?
3) How easy is this utilization & How can we do it?
4) What are the scenarios where this will prove to be useful for us?

Let’s discuss all of the above questions in detail.

How can we utilize this concept in Machine Learning?

Starting off with the first question, we essentially can utilize this concept for our daily machine learning tasks. In Machine Learning or Deep Learning, we have different techniques available such as A/B testing while developing our model to help decide the model approach. One of the scenarios where we can use is that if we have all of our training pipelines are in place we can simply change the model function and automate the process of training. If the training accuracy falls below the previous approach, we do not move forward with the pull request and try a new approach. If it performs better, we can approve the pipeline and send this for deployment on our instance. Hence all the steps of CI/CD are reflected in the Machine Learning domain too.

Is there a tool that we can use to implement this?

Yes !!, we can achieve everything we discussed above with the help of GitHub. GitHub has provided an extension, especially for Machine Learning. It’s called CML or continuous machine learning. We can use this to manage ML experiments, track who trained ML models or modified data, and when. Codify data and models with DVC instead of pushing them to your Git repo. This is not only available in GitHub but also in GitLab. To find out more about CML you can visit here.

How easy is this utilization & How can we do it?

To answer this question and show you how easily one can achieve this I will show a working demo below. For this particular example, I have taken a simple regression task where we take the data of motor cars and regress the miles per gallon value for each row. It’s a pretty standard regression task but the goal here is to show you how to get started.

Getting Started

Finally, let’s get started with some hands-on experimentation on how we can achieve this.

So here I have this demo repository with a simple regression pipeline. Currently, the repository has just one main branch.
Constants.py:
This file has all the constants related to the project, also consisting of the list of features that we use for regression. This can be thought of as the config file.
helperfunctions.py: This file has all the functions that help in the main pipeline.
model.py: This file has the definition of the model that we’re using.
run.py: This is the main file that runs the entire pipeline.
Feel free to visit the repository here to learn more in detail.

Let’s create a separate branch from the main branch to get a feel of working on a project with a team. Here I am going to name my branch FirstTry xD. We all know how to create a branch now, don’t we? While we are at it let’s also create a pull request for this branch with the main branch.

Now that we have a branch in order to incorporate the CI/CD or CML flow all we need to do is add cml.yaml file so let’s do that.

The path for cml.yaml is always going to be the same in your repo. The path will be .github/workflows and the file name will always be cml.yaml.

We can create the file directly from GitHub as shown below, go to add-file and then add the path where the file will be created.

From here we go to the path mentioned above to create a file there (just btw I have done this before so you can see .github/workflows already created there) but once you are done you’ll get the same. From here let’s create the file at the path like.

Once we have created the YAML file, what I always do is pick up the YAML template from here.

As we can see on the page there’s already a template available that we can use and modify as per our needs so we just go ahead, copy all of that and paste that in our YAML file except for the parts we don’t need.

So I have given it a name as a CI-ML-Regression example. I have set a trigger on push so as soon as the code get’s pushed in this branch the pipeline will run. runs-on decides which OS image will you be using to run your code on.
Below run is the part where we decide which files will run and which sequence when code gets pushed to this branch.

Each time this runs, GitHub allocates a runner, dockerizes your code, and runs the files in the sequence that you define below the run. For that, we need a requirements.txt so that when our docker gets created it knows which libraries to install.

For our use case let’s suppose we are doing feature engineering and want the pipeline to run as soon as we add or remove features and push the code. So the features will be changed in the constants file and the file that’ll train the model is run.py. So I’ve put run.py after the requirements.txt.

As soon as we save this file and commit the changes navigate to the Actions tab above. Here you can see that the update you just did actually started a job and it’s running. You’ll see something like below:

Instead of 30, you’ll have 1 though. This is the pipeline execution that just happened because you pushed a change to your branch. Let’s click on it to find more details.

This will show you how long your current pipeline took to finish all the tasks you put underrun in the cml.yaml file. Click on again and we can see all the steps that we performed from making a container to running run.py.

This has all the logs of the entire pipeline. If we want to see the results of our model training we can simply click on the Training the Model tab to view the logs.

These are the verbose outputs from my current run. Did you see that? Just by adding a simple YAML file, the whole of your ML pipeline gets automated and all you have to do is push the code. Still not impressed?

Alright, let’s do another one. You might be thinking that going to actions and then seeing the logs is pretty inconvenient right? Let’s change that by adding just 3 lines. So now your YAML file will look something like this:

Notice that the last three lines from line 17 to line 20 are the ones that I added. What that does is that I am writing my model performance into metrics.txt. It takes the text from there, copies it into the report.md file and publish it. Publish it where you ask?

Well, let’s open the Pull request that you opened earlier and you’ll see. Switch over to the Pull requests tab and open the PR created for this branch.

Click on the PR link, like for me the PR was creating the YAML file.
Viola! Now the whole team can see the impact of your changes in your own branch on the model performance by just visiting the PR that you created.

It show’s us the results of our current model run and shares that with the team over our PR. Not only that it’ll keep track of all the changes you commit so you can record the performance of all of your experimentation. But let me show you something really cool here. Wouldn’t it be cooler if we can also share the visual performance of our model or approach with the team over the same pull request?

What I have done now is I save a model_loss.png image each time my model runs to see the results visually. In the cml.yaml file I just add these 2 lines and my cml.yaml now looks like this:

What this will do is it will also publish the lost image along with the performance on the PR. Let’s wait for the job to finish its run and go over to the PR tab again.

By simply adding 2 lines we can show our model performance and share it with the team. Not only that, but we can also keep track of the experimentation we did. Again all we had to do was simply push our code.

Oh my god, wow xD

P.S

You also get this beautiful email from GitHub on each run with all the metrics that were shown in the PR.

Alright, enough with the fancy stuff. Consider you are an ML engineer and in the same project, you find that correlation of a few features is stronger than others with your target variable. Let’s get the new features in place and see the results. Currently, in our constants.py file, for our target variable MPG, we are using the FEATURES_TO_USE for our regression model.

FEATURES_TO_USE = [ ‘MPG’,’Cylinders’, ‘Displacement’, ‘Horsepower’, ‘Weight’, ‘Acceleration’, ‘Model Year’, ‘Europe’, ‘Japan’, ‘USA’]

Suppose we want to do it with only three features and see how that performs. All we do is remove the others and commit our changes, our constants.py now looks like this:

As soon as we push changes from our local to our branch the pipeline is going to run automatically and we can compare the results with our previous runs in the PR.

Turns out the loss is more than the last time so choosing lesser features was a bad thing to do xD.

Again, this is just a naive example to let you know the gist of CML. There are many other scenarios where we can do this.

What are the scenarios where this will prove to be useful for us?

  1. One of the scenarios where I find this useful is feature engineering. We can derive new features and experiment on them to see how they perform while keeping track of the performance. Now we don’t have to worry about the rest of the pipeline we just focus on feature engineering with some baseline model.
  2. Another such scenario is when you are working on an ML Project with a team, here you can work on your own models in your own branch and so can your peers and you can decide which one is the best and then go for a merge.
  3. Another such scenario can be, which is most common in Machine Learning and Deep Learning problems is data labeling. If you have a pre-decided approach, you can simply create a pipeline on top of it, and every time that data gets pushed, you can see whether or not this increased or decreased your model performance and you can essentially get rid of the data batch currently labeled which is making your model perform worse.
  4. I personally find it helpful when monitoring a junior so I can keep track of the experimentation they are doing and how its results turn out.

Common Questions.

  1. We don’t push the data into the repo and in this case, scenario 3 fails.
    Yes and for that, we have DVC or data version control and my next article will be on that.
  2. We use GPUs for our deep models, does GitHub provide them for free?
    No, GitHub doesn’t but it allows you to connect your cloud GPU instance with your pipeline so every time it runs, the pipeline gets executed on your personal GPU.

Conclusion

CML is a great thing to do and use in your daily projects or company projects. Helps track things and keep them smooth and in. Not only that, it’s way too simple and takes no effort. In your pipelines, you can also add the deployment stage but we’ll go through that in the next articles. Feel free to visit the repository above and do experimentation at the link below.
https://github.com/Ahmed4221/CICD-Test/pulls .

Other links where you can learn more about CML are
https://github.com/iterative/cml
https://cml.dev/

--

--