How to Speed up Model Training with Snapml

Train a Machine Learning Model with Less Time

Published in

Geek Culture

6 min readApr 21, 2022

Machine learning has a huge impact on solving business problems in a variety of industries, including health, finance, and transportation. You can collect a significant large amount of data created every day and train a machine learning model for specific tasks like product recommendations and sentiment analysis.

It is recommended that you train and perform several machine learning experiments on a large dataset in order to have an effective machine learning model. This has its own set of difficulties, such as taking a long time to train the model to achieve the desired outcomes.

In this article, you will learn how you can speed up the process of training a machine learning model in a short amount of time with snapml python package.

Let’s get started! 🚀

What is Snapml?

This is a python package developed by IBM to provide high-speed training of machine learning models in both CPU and GPU computing environments. Snampl can help you do the following tasks in your machine learning project:-

Train and re-train on new data online.
Do Large parameters tuning.
Make accurate decisions and predictions.
Train model on all available data with less resources.
Handle big data efficiently.

It also supports different types of machine learning models like:

Generalized Linear Models (e.g. Linear Regression).
Tree-based models (e.g Decision Trees & Random Forest).
Gradient Boosting models (e.g Boosting Machine).

When training models in Cloud environments, Snapml can help you reduce the costs by speeding up the training process to be completed over a short amount of time.

screenshot from https://www.zurich.ibm.com/snapml/

How to Install Snapml

To install snapml run the following command in your terminal.

pip install snapml

Note: Snapml currently supports Python 3.7,3.8 and 3.9. For macOS users, it currently supports intel(x86_64) architecture.

Train ML Model without Snapml

In this part, you will first train the machine learning model on a large dataset with a machine learning algorithm from the scikit-learn library and evaluate the total time used to train the model on the dataset.

Import Packages

The first step is to import various important python packages we are going to use to load the dataset, prepare the dataset and train a machine learning model.

#import librarisimport numpy as np
import pandas as pdfrom sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_scoreimport time
import warningswarnings.filterwarnings("ignore")
np.random.seed(123)

Load Dataset

We will use a Bank Loan Status dataset to train a model that can classify the future loan status.

Download the dataset from kaggle.com.

To load the dataset, use the read_csv function from the pandas library

#load dataBank_Dataset = pd.read_csv("../data/credit_train.csv")

Check the first few rows of the dataset.

#show the first five rowsBank_Dataset.head()

The dataset has a lot of features showing details about the loan acquired by each customer.

Let’s check the shape of the dataset in order to identify its size.

# show shape
Bank_Dataset.shape

(100514,19)

The Bank Loan Status dataset has more than 100,000 rows of data and 19 columns. This dataset is large enough to evaluate the time difference when training a model with and without Snapml.

Prepare the Dataset

You need to prepare the dataset by removing features that are not required, handling missing values and transforming all features into numerical values.

(a) Remove Features

In this step, you will remove both Loan ID and the Customer ID.

#remove ID columns
Bank_Dataset.drop(["Loan ID", "Customer ID"], axis=1, inplace=True)

We have now left with 16 features and the target column (“Loan Status”).

(b) Handling Missing Values

Usually, a dataset can have missing values that you need to handle before training the machine learning model. Here is the code to check the number of missing values in each column in your dataset.

#check missing values
Bank_Dataset.isnull().sum()

Features with the total number of missing values

Our dataset has missing values in all features including the target column (“Loan Status”).

The code block below will firstly fill missing values in categorical columns by using the most frequent value in each categorical column. Then fill missing values in the numerical columns by using the average value of each numerical column.

# fill missing values for categorical features
Bank_Dataset["Loan Status"].fillna("Fully Paid", inplace=True)
Bank_Dataset["Term"].fillna("Short Term", inplace=True)
Bank_Dataset["Years in current job"].fillna("10+ years", inplace=True)
Bank_Dataset["Home Ownership"].fillna("Home Mortgage", inplace=True)
Bank_Dataset["Purpose"].fillna("Debt Consolidation", inplace=True)# fill missing values for integers features
intergers_columns = list(
    Bank_Dataset.select_dtypes(include=['floating']).columns)
for column in intergers_columns:
    Bank_Dataset[column].fillna(Bank_Dataset[column].mean(), inplace=True)

(c)Transform the Dataset

After handling the missing values in the dataset, you need to transform the dataset into numerical values.

The first step in transformation is to use the LabelEncoder method from the scikit-learn library to preprocess two binary categorical columns (Term and Loan Status).

# preprocess binary categorical columnsle = LabelEncoder()
binary_columns = ["Loan Status", "Term"]
for column in binary_columns:
    Bank_Dataset[column] = le.fit_transform(Bank_Dataset[column])

Then you transform the multiple categorical columns by using the get_dummies function from the pandas library. This function will transform the following columns in the dataset.

Home Ownership
Purpose
Years in current job

# preprocess multiple categorical columns
Bank_Dataset = pd.get_dummies(
 Bank_Dataset,
 columns=[“Home Ownership”, “Purpose”,“Years in current job”])

(d)Split Features and Targets

Split the dataset into its features columns and targets column.

# split data into target and features
target = Bank_Dataset["Loan Status"].values
features = Bank_Dataset.drop("Loan Status", axis=1)

(e) Scaling the Features

The features transformed have different ranges of values. You need to normalize all features by using the MimMaxScaler method from scikit-learn to a given range of 0 and 1.

# scaling the datasetscaler = MinMaxScaler()
features = scaler.fit_transform(features)

Train a Machine Learning Model

To train the model without snapml, you need to instantiate RandomForestclassifer from the scikit-learn library by using the following code.

#create a classifier
sklearn_classifier = RandomForestClassifier()

Finally, train the RandomForestClassifier on the transformed dataset. We will also find the time difference before and after training the model.

# training classifierstart_time = time.time()scores = cross_val_score(sklearn_classifier, features, target, cv=3)print("Training Time: {}".format(time.time() - start_time))
print("Scores: {}".format(scores))

Training Time: 55.80186605453491
Scores: [0.82122071 0.8188927 0.82106614]

In summary, the model performance is around 82% of accuracy and the train time is 55.80 seconds (almost 1 minute).

Let’s see how we can speed up the model training by using Snapml.

Train ML model with Snapml

The first step is to import the supervised algorithm called RandomForest Classifier from snapml package.

# add RandomForestClassifier from snapmlfrom snapml import RandomForestClassifier

Then instantiate the classifier.

snampl_classifier = RandomForestClassifier()

Finally, train the classifier from snapml and evaluate the time difference before and after training the model.

start_time = time.time()scores = cross_val_score(snampl_classifier, features, target, cv=3)print("Training Time with snapml: {}".format(time.time() - start_time))
print("Scores: {}".format(scores))

Training Time with snapml: 14.459826469421387
Scores: [0.81065513 0.80946127 0.8109181 ]

As you can see, the training time when using snapml is 14.45 seconds which is almost 4 times faster than training a machine learning model with the scikit-learn library.

Snapml has a lot of potential and can save your time and cost when training a large-scale dataset in the cloud environment.

Snapml support other classification models that you can try in your own dataset such as:-

Logistics Regression
Decision Trees
Support Vecotr Machine
Boosting Machine
Batched Tree Ensembles.

Conclusion

In this article, you have learned some of the challenges of training a machine learning model with a large dataset and how you can use snampl to speed up the process of training a model within a short time.

As I previously stated, snapml will save you not only time but also money if you are training your model in the cloud environment. The library will give you an opportunity to execute various machine learning experiments without having to worry about running out of time.

Please share this post with others if you learned something new or enjoyed reading it. Until then, stay tuned for the next post!

You can also find me on Twitter @Davis_McDavid.

One last thing: Read more articles like this in the following links.

Top 5 Cloud Migration Strategies You Need to Know

How to Smoothly Move Your Data and System to the Cloud

medium.com

How is Web Crawling Used in Data Science

No-Code tools for collecting data for your Data Science project

python.plainenglish.io

Top 5 Reasons Why Companies are Moving to the Cloud

Why Cloud adoption by companies has increased to 90%