Deploy a Dask Cluster on Kubernetes using Minikube on WSL — Part 2

Exploring Dask Machine Learning on Kubernetes

Nirajkanth Ravichandran
ADL Blog
5 min readJun 28, 2024

--

In recent years, machine learning has seen tremendous growth, enabling data scientists and researchers to tackle complex problems more efficiently. However, as datasets and models become larger, the need for scalable and distributed computing arises. Dask, a flexible parallel computing library, and Kubernetes, a container orchestration platform, offer a potent combination for tackling such challenges. In this article, we’ll explore how to harness the capabilities of Dask and Kubernetes to train a Support Vector Regressor (SVR) model on a popular house pricing dataset.

The code I employed for the experiment is available in this notebook.

You can find the step-by-step guide for setting up a local, single-node Kubernetes environment in my preceding article. Let’s assume that you’ve already got your Kubernetes cluster up and running, for the rest of this article.

To kickstart our journey, let’s initialize the Dask-Kubernetes, also known as KubeClusters, as shown in the following code snippets.

from dask_kubernetes.operator import KubeCluster

extra_pip_packages = """scikit-learn==1.3.0
dask-ml==2023.3.24"""
cluster = KubeCluster(name="daskmlcluster",
image='ghcr.io/dask/dask:2023.7.0-py3.10',
n_workers=2,
env={
'EXTRA_PIP_PACKAGES': extra_pip_packages
},
resources={"requests": {"memory": "0.5Gi"}, "limits": {"memory": "1.5Gi"}},
)
cluster
from dask.distributed import Client
# Connect Dask to the cluster
client = Client(cluster)
client

Note: Scikit-learn library isn’t pre-installed in the Dask Docker image. Consequently, it’s essential to install scikit-learn as an extra package via pip or conda, as indicated in the Dask documentation

Dataset and Preprocessing

For this demonstration, we’ll use a well-known house pricing dataset. Our goal is to predict house prices based on various features. To prepare the data, we’ll employ the scikit-learn library. Let’s do step by step data analysis.

  • Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.svm import SVR
import seaborn as sns
  • Read the dataset
df = pd.read_csv('/mnt/d/codes/Dask/Housing.csv')
df.head()
  • Exploring each column
  • Extract all categorical columns
#extracting categorical colums 
cat_cols = [col for col in df.columns if df[col].dtype == 'object']
cat_cols
  • Convert categorical columns to numerical columns using LabelEncoder
df_encoded = df.copy()

le = LabelEncoder()
for col in cat_cols:
df_encoded[col] = le.fit_transform(df_encoded[col])
df_encoded.head()
  • Split the dataset
x = df_encoded.drop(['price'], axis=1)
y = df_encoded['price']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)

Hyperparameter Tuning

Hyperparameter tuning plays a vital role in optimizing the performance of machine learning models. We’ll employ scikit-learn grid search to explore different combinations of hyperparameters for our SVC regression model. This step ensures that our model is performing at its best.

Let’s define the parameter grid as follows.

# Define the parameter grid for hyperparameter tuning
param_grid = {
'C': [0.1, 1, 3, 10,],
'epsilon': [0.001, 0.01, 0.1, 0.2],
'kernel': ['linear', 'poly', 'rbf', 'sigmoid']
}

Joblib with KubeCluster

Source: https://ml.dask.org/joblib.html

Here’s where the magic of Dask truly shines. By utilizing the power of Joblib in conjunction with Kubernetes, we parallelize the training process even further. Joblib, a popular library for lightweight pipelining in Python, integrates seamlessly with Dask to distribute tasks across the cluster. This union of technologies maximizes efficiency and accelerates the model training procedure.

svr = SVR()
grid_search = GridSearchCV(svr, param_grid, cv=5, scoring='neg_mean_squared_error')

# Fit the grid search to the data, use dask kubecluster as backend
import joblib
with joblib.parallel_backend('dask'):
grid_search.fit(x_train, y_train)

While executing the above code snippets, you can observe dask dashboard as in the following figure; where you can observer how tasks are distributed across the workers/threads.

Model prediction

Let’s perform model prediction on the test data and evaluate the model using mean squared error and r2_score.

Code

# Get the best parameters and best estimator
best_params = grid_search.best_params_
best_svr = grid_search.best_estimator_

# Make predictions on the test set
y_pred = best_svr.predict(x_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean squared error: {mse}")
print(f"R-squared score: {r2}")

Output

As you can see, there’s potential to enhance the r2_score even further through normalization and a few other optimization techniques. Nevertheless, it’s important to note that the primary focus of this article is to run machine learning via Dask’s Kubernetes cluster. Thus, for the purpose of this article, we will stay here, without dive into evaluation metric enhancement.

Conclusion

In this article, we’ve explored how to leverage the power of Dask and Kubernetes for distributed machine learning. By combining Dask’s parallelism and dynamic task scheduling with Kubernetes’ container orchestration, we can efficiently train a complex machine learning model like an SVC regression model on a large dataset. This approach not only reduces training time but also enables us to work with datasets that would be impractical to handle on a single machine.

Visit Axiata Digital Labs to find out more about our products and services.

References

1. https://ml.dask.org/joblib.html

2. https://kubernetes.dask.org/en/latest/

3. https://scikit-learn.org/stable/modules/grid_search.html

4. https://ml.dask.org/

--

--