Scale Scikit-learn models at Google AI Platform
A guide of how to train and deploy scikit-learn models at scale at Google Cloud
As a data scientist, probably your first interactions with data might be locally with a sample on a Jupyter notebook or at Google Colab. You might do some exploratory data analysis with Matplotlib, Bokeh, Folium, NumPy, Pandas and there it comes the time for ML modeling. However, sometimes you might not have enough data to use deep learning models or it might not make sense to use it.
Given that, you’ll probably use Scikit-learn(a.k.a sklearn), a Python library that posses a plethora of data science and machine learning algorithms on a very user-friendly interface. Sklearn covers algorithms for:
- Data preprocessing (e.g. standardization, imputation, categorization and binning),
- Unsupervised learning models (e.g. KMeans, DBSCAN, Hierarchical clustering, and Gaussian Mixture Model),
- Supervised learning models (e.g. Ridge, Lasso, Elastic Net, Logistic Regression, K-Nearest Neighbors, Support Vector Machines, Multi-Layer Perceptron, Decision Trees and Ensemble)
- Dimensionality reduction (e.g. Principal Component Analysis, Kernel PCA, Linear Discriminant Analysis and TSNE)
- Model Selection (e.g. Cross-validation, Grid-search)
- Model valuation Metrics (e.g. Accuracy, F1-Score, Confusion Matrix, ROC Curve, Mean Squared Error, Silhouette Score and Cosine Similarity)
Any sklearn algorithm expects as inputs both Pandas DataFrame or NumPy Array and performs their transformations thought the methods
transform. If your data doesn’t fit into memory and/or your pipeline is computationally expensive, you can boost your code by leveraging it into GPU using RAPIDS or parallelize on a spark-like way with Dask.
Nevertheless, you are still limited to the computational power by your local machine. Moreover, whenever you have a trained model, how to serve it? You may expose it through a REST API with Flask, for instance. But, where will it be deployed? At an on-premises infrastructure? Does it have a load balancer? How easy is it to scale new workers if your service starts to receive lots of requests?
Google AI Platform
All these concerns can be easily forgotten if you perform little adjustments on your code by bringing it to Google AI Platform. In a nutshell, it is a managed service provided by Google focused on machine learning training and deployment at scale. You can run your ML code built on top of TensorFlow, Scikit-learn and XGBoost on both CPU, GPU and TPU.
As a matter of example, let’s use the Black Friday challenge, provided by Analytics Vidhya. The challenge is to forecast the amount to be spent by customers at a retail company given their purchase history and some demographics information. Further, the predictions might be used to create personalized offers.
The dataset contains the denormalized purchase history, where each row contains the following columns:
As a matter of evaluation, the Root Mean Squared Error (RMSE) is the chosen metric by the challenge to be optimized.
Similarly but simpler to Kubeflow Pipelines, sklearn provides a way to design an end-to-end pipeline for a machine learning model. It “Sequentially apply a list of transforms and a final estimator.”
What you need to do is to code any transformation as a class which implements the methods
transform. The rule of thumb is to code a single transformation per class and thus you gain code reusability.
In this challenge, classical ML dataset issues can be found, such as missing values, customer ID on test set but not on the train set, little historical data for a single customer ID, categorical variables coded as a string and skewed distributions. To overcome them, let's code a transformation to fix each one of these issues.
BlackFridayPreprocessdo not do any processing on
fitmethod, once it only transforms data. During training, data is fed as a Pandas DataFrame, but during serving the request encodes data as a Unicode string coded on a NumPy Array. So it is decoded, transformed into a dictionary and further into a DataFrame. Afterward, all nulls are imputed with
The second transformer
BlackFridayIdTransfomer is responsible to take all ids with frequency less than given thresholds(as dictionary) to the string
<UNKNOWN>.This is done to avoid a few historical data, both customer id and product id, to be fed into the model and thus make it perform poorly. This is a project design decision of ours and it is up to you to make it differently.
The third transformer
BlackFridayLabelEncoder is responsible to take all categorical features into a numerical representation using
LabelEncoder. For example:
Car -> 0
Bicycle -> 1
Airplane -> 2
A Random Forest was chosen as a classifier and thus no standardization was performed at numerical features, once it is robust to feature scaling. However, if you want to run a linear-like classifier, you should build an additional transformer to scale numerical features. Sklearn already provides a set of scalers, such as z-norm, MinMaxScaler, MaxAbsScaler, RobustScaler, and others.
Finally, the pipeline can be defined, from preprocessing to the estimator.
To train the model at scale at AI Platform, you must to run the command below. You may ask: “But where is the scale part of the thing?”. The answer is simple: with the
scale-tierparameter. In the example below, it is used the
BASIC tier machine type, which maps to a
n1-standard-4machine. There are also other predefined machine types, such as,
BASIC_TPU. For the full list of all available tiers, refer to the official documentation. You can also define a custom infrastructure, with an arbitrary amount of CPUs and RAM per worker node.
With the command above, the resources defined by the
scale-tier parameter will be provisioned, the code will run and finally, the resources are deallocated.
For more details about gcloud CLI, read the official documentation.
Deploy & Predictions
First, the solution must be built and uploaded to Google Cloud Storage.
Afterward, a model must be created, but it does not refer to the trained one yet. An AI Platform model can hold multiple versions of a model and after one is created, a version must be linked to it.
Finally, the model is exposed through a REST API and you can use it Google API Discovery Service.
When you run the script below, the response would look like this:
INFO:__main__:[16305.043599803419, 11757.55457118036, 6301.0986163614, 1832.9853921568624, 2503.677290016506]
I hope that at the end of this post you have learned how easy is to train and deploy a Scikit-learn pipeline model with Google AI Platform. This is very useful when you are a data scientist without strong data engineering skills and you need to scale your first sketches with little adjustments. Also, remind that you have the advantage of not worry about infrastructure management.