My experience doing a Reproducible Performance Comparison of Machine Learning Libraries
TL;DR: I will show you the results of my exercise in reproducing the performance comparison of two machine learning libraries: xLearn and liblinear, using Popper. You are welcome to take a look at the code in here.
In this post, I’ll describe one of the projects I worked on as part of my internship in the 2020 CROSS Undergraduate Summer Research Experience program and my experience while doing it.
This project deals with reproducing the performance comparison of two machine learning libraries with a workflow using a tool called Popper in a way that is significantly easier for other people to follow. All of this with the objective of helping the community fully understand how workflows can be used in machine learning environments.
First of all, let me get started by explaining what Popper is. Popper is a tool for defining and executing container-native workflows in Docker, as well as other container engines. You are able to define a workflow in a YAML file, and then execute it with a single command. Cool, right?
Let’s get a bit of context first
In case you find yourself in the same position as the one I was a month ago (clueless about how machine learning works), don’t worry, I got you.
Machine learning usually uses two different types of algorithms: supervised learning algorithms, which are the ones that train a model on a known data set with the inputs and outputs desired so that it can predict future outputs, and unsupervised learning algorithms, these are the ones that have to study the data to identify patterns in the input data.
In this project we will be focusing on supervised learning, which uses classification and regression techniques to build predictive models. Let me add that there are a lot of techniques that exist that can be used, such as Naïve Bayes, Support Vector Machines, Logistic Regression, and many more.
Getting started
Going back a little to when I was just beginning, I didn’t know much about workflows, Popper, or anything open source related. Basically, I came in knowing almost nothing so I would find myself nervous at times, lucky for me this community has been nothing but helpful and they are always willing to aid me even in the most banal thing for which I am very grateful. You are welcome to join the Slack to be part of this as well.
So, I spent a considerable amount of time getting familiar with the syntax, functionality, and purpose of container-native workflows as well as some machine learning concepts in order to fully comprehend what the libraries were doing. This is how a Popper workflow file looks like:
steps:
- uses: docker://alpine:3.9
args: ["ls", "-la"]
- uses: docker://alpine:3.11
args: ["echo", "second step"]
options:
env:
FOO: BAR
secrets:
- TOP_SECRET
What is happening here? Popper takes each step, initializes a container for it, and then runs the given command passing it the given arguments. It essentially automates container-related tasks so that others don’t have to guess what was done and instead can see and re-run them easily. If you would like to know more about the workflows syntax you can visit here.
Once I felt comfortable enough with Docker, Popper, workflows and machine learning stuff I jumped into the project. To accomplish this I chose a library called xLearn and focused on one of their demo examples, Higgs classification, and according to the UCI website, from where this data was obtained, this is a classification problem to distinguish between a signal process which produces Higgs bosons and a background process which does not.
The model that was being used in this example is called generalized linear model (GLM) and as the charts that appear on their repository compare linear models in opposition to liblinear, that’s the library we use to compare as well. Due to size issues, they are only able to use a subset of the data set in the demo, but with a Popper workflow, we can go further.
Description
Below I will describe each of the steps that the workflow performs:
- The workflow first builds the image that is in charge of preparing everything by installing the libraries and dependencies necessary for this to work smoothly.
- Then downloads and verifies the complete Higgs data set from the UCI web site, which has been produced using Monte Carlo simulations and has 11 million entries.
- Then the workflow proceeds to prepare the needed data by generating a version of it to an SVM format in order to make it able for liblinear to load it.
- The next step is to run the benchmark which compares both libraries by running five times for each of them the following tasks: Loading the data using Pandas and a SVM format loader, training the linear model, and finally predicting.
- The final step consists of creating the chart that shows the results including error bars with a Python script using Seaborn and Matplotlib. The output generates a figure like this:
Usage
As I mentioned before there are very little steps you need to follow to see this workflow do its magic. The first step is to clone the xLearn repository:
git clone https://github.com/aksnzhy/xlearn.git
Popper works with Docker so you’ll need to install it in case you haven’t already. If you need information about the installation process you can click here:
The next step is to install Popper, which can be done by running this simple command:
curl -sSfL https://raw.githubusercontent.com/getpopper/popper/master/install.sh | sh
Finally, we are now able to run the workflow:
cd xlearn/popper run -f demo/classification/higgs/popper/wf.yml
There is a way to run a single step of the workflow in case you don’t want to run the whole thing each time, you only have to add the name of the step at the end like the following example:
popper run -f demo/classification/higgs/popper/wf.yml prepare-data
And when we are having problems with a step there is also an easy way to debug the workflow by opening an interactive shell instead of having to update the YAML file and invoke popper run
again.
popper sh -f demo/classification/higgs/popper/wf.yml prepare-data
The example above opens a shell inside the container where other things can be done. More information on this matter can be found here.
And that’s it!
Conclusion
The goal of this project is to serve as an illustration of how workflows can be useful. I think it is quite important to make these kinds of development tools such as Popper available for everyone interested in the area of machine learning so that people won’t be put off by all the required steps to follow when they are trying to reproduce something by themselves, with a tool like this, the steps would be minimal.
I really couldn’t have imagined how much a person can acquire by participating and contributing to projects like this. This experience has helped me have a better perspective on how things work in open source related projects as well as being part of something that many people are involved in. The projects that I had worked on before used to be individual or with my classmates, so this taught me several things not only in the technical field but also about working with other people, to ask questions, and learning know how to communicate my doubts.
It is intimidating at first to surround yourself with people who know a lot but once fear is lost and you start learning from them it is really something to value. After spending some time with the other interns and our mentor in the standup meetings, where everybody talks about their complications and tasks to be done, I realized how accessible they are, that also like me they are learning how to do new things where sometimes problems arise and they get stuck, which is totally normal.
So, thanks a lot to my mentor in this project, Ivo Jimenez, to the Center for Research in Open Source Software (CROSS) and also to the others in the crew, you guys rock!