Snap ML — Speed Up Model Training

Will Roberts
IBM Data Science in Practice
3 min readMar 4, 2021
Photo by Marc-Olivier Jodoin on Unsplash

Training (machine learning) ML models can take a long time depending on your dataset and available hardware, and that keeps you from experimenting quickly. That can be a problem for any data scientist on a deadline, but at the very least it’s certainly a pain as they have to sit and wait for their results. Snap ML is an exciting library to help address that pain. As a drop-in replacement for scikit-learn it’s particularly easy to use. Snap ML accelerates the training and inference of some of the most popular ML models (State of Data Science and Machine Learning 2020 | Kaggle) and blends in seamlessly with scikit-learn operators for data pre-processing and feature engineering, using a familiar scikit-learn API.

Python is the standardized programming language of choice for many data scientists because of its wide range of libraries and strong support from its vast community of developers. It’s a particularly powerful language for open-source data stacks, but it suffers by design when it comes to fast code execution.

Python’s popularity and ease of use inspired IBM Research to help data scientists in a Python stack by creating Snap ML — which they designed it to optimize for speed and expediency. Snap ML is a free-to-use software library that you can install right now to shorten your training and inference time for your ML models as compared to typical performance from the generally well-loved standard of ML API’s, scikit-learn.

pip install snapml

The below notebook is an example of how to use Snap ML to shorten your training time using a local CPU. We’ll later show other examples of using Snap ML to improve performance at inference time as well as compare the performance of Snap ML in CPU vs GPU.

Random Forest Credit Card Fraud Class

As you can see, the library itself has the same design as scikit-learn intentionally and it fits into the same workflow as scikit-learn by design. It should be easy for data scientists who need to improve their training time (and inference time in later posts) to shorten their model development lifecycle.

The library is distributed for free, but currently not open-source because IBM uses it in our products. It’s a great way to obtain huge increase in productivity and shortened workloads for products like IBM Watson Studio, IBM Watson Machine Learning, IBM Cloud Pak for Data, and our IBM Watson Machine Learning Accelerator. As an example, when you use AutoAI with Watson Studio to automatically generate ML pipelines and Jupyter Notebooks, part of the reason you can execute so quickly is because of how Snap ML is embedded in our products. In an enterprise scenario for building AI solutions, a group of data scientists could schedule and accelerate their ML workloads with Snap ML in a GPU grid with Watson Machine Learning Accelerator.

Snap ML is obviously ready for you to use right now at no cost. You can install it through PyPi and see the productivity spike in the time it takes to train your first model. If you’re curious about how to use it most efficiently, reach out on the IBM Data Science Community, or sign up for a Watson Studio with AutoAI trial, today.

Thanks to Haris Pozidis and Kelvin Lui for their examples notebooks and edits. Thanks to Andreea Anghel for her contributions. Credit to Jana Thompson for edits.