When Should You Use PySpark Over Scikit-Learn ?

Study of the scalability of a decision tree in the context of Big Data: PySpark vs Scikit-Learn.

Published in

Geek Culture

5 min readJan 17, 2023

A justice balance representing the pros and cons for Scikit-learn and PySpark — Photo by Elena Mozhvilo on Unsplash

PySpark is known for using the MapReduce paradigm resulting in the distribution of the classification among different machines in a cluster whereas Scikit-Learn does it locally.

Have you ever wondered when you should directly use PySpark instead of Scikit-Learn? If so, it’s perfect! Today we’re going to demonstrate when you can make this change.

If the specs of the computer on which you are training your model do not change, the only factor that impacts the model is the size of the data set.

Therefore, we will compare PySpark’s implementation of a decision tree and Scikit-learn’s implementation by varying the size of the dataset.

To do this, we will start from an initial dataset, duplicate it several times and compare the training time for each of the libraries depending on the dataset size.

This experimental process is inspired by a paper from Vasile Purdil and Stefan-Gheorghe Pentiuc on the implementation of a scalable parallel decision tree algorithm using MapReduce (see References at the end).

The data

For this experiment, I decided to use the Adult Dataset also known as the “Census Income” dataset provided by the UCI Machine Learning Repository website. If you’re looking for another site than Kaggle to find datasets, this is it!

This dataset’s objective is to predict a person’s income class based on several characteristics such as their age, country of origin, occupation, number of hours worked per week, etc.

Here is a sample of the data :

Sample data from the Adult dataset showing all the features — Image by author

As you can see the dataset is not cleaned yet… A certain number of missing values remain.

As I already knew this dataset, I knew that there were missing data and especially that they were represented by this string: ‘ ?’ with a leading white space.

Here is how I avoided having to modify the dataset after loading it.

df = spark.read.csv(r'Z:\Datasets\adult.data',
                     inferSchema=True,
                     header=False,
                     ignoreLeadingWhiteSpace=True,
                     nullValue='?',
                     schema=schema)

With PySpark, you can indicate if you want to ignore leading white spaces and also which character denotes null values in your dataset. This is a good idea to keep in mind when working with huge datasets as we don’t want to duplicate the work or perform operations on the whole dataset too much.

Visualization

As we are curious to see how those values are distributed, we can visualize them using an MNSO Matrix. It is a way to get a quick visual summary of the sparsity of the dataset.

Visualisation of the missing values using the MSNO library and a MSNO matrix — Image by author

You can see above in white, the missing data in each feature. We notice that we have missing data in only 3 features, two of which are closely related (work class and occupation).

For the sake of simplicity and also because here we are not concerned about the precision of the model, I decided to drop the rows containing those values.

In a project where the predictive quality of the model is important, we would have had to think of possible ways to find the missing values.

Overloading scenarios

As explained in the introduction, to demonstrate the need for the usage of PySpark from a certain point, I created different overloading scenarios.

I duplicated the original dataset until I obtained one of a decent size (3.6 GB). It is, what we call our scaling factor of 1. Then, by duplicating it many other times, we reach quite large volumes of data until we reach about 36GB (a scale factor of 10).

It looks something like this.

Listing of the different scaling factors from 1 to 10 ranging from 3.6GB to 32.4 GB — Image by author

I don’t know about you, but my poor laptop can not handle that much data… So I ran this experiment in the cloud on a Droplet from Digital Ocean with a dedicated processor of 32 cores and 64GB of RAM.

Results

I trained the models using Scikit-Learn and PySpark implementations of the decision tree and here are the results.

Line graph showing the superiority of PySpark in terms of execution time from 12GB of data — Image by author

As we can see on the graph above, PySpark becomes more efficient than Scikit-Learn as soon as we reach the 12GB threshold.

For the nerdier among us here are the detailed numerical results.

Summary table of training and prediction time for each of the libraries according to the size of the dataset — Image by author

Another observation we can make from the data is that regardless of the size of the dataset, PySpark makes predictions in almost instantaneous time, whereas Scikit-learn takes much longer.

Conclusion

And that’s it! Now you have a clearer idea of what is called the big data frontier or also when it becomes more interesting to distribute model training.

Here is what we can conclude from this short experiment.

PySpark lives up to its reputation for efficiency in the context of Big Data.
In my experimental conditions, the use of PySpark over Scikit-Learn was optimal from 12GB of data.
These results relate only to the decision tree model.

Want to connect?

📖 Follow me on Medium
💌 Subscribe to get an email whenever I publish
☕Fancy a coffee in Paris to discuss AI ?
🤓 Connect with me on LinkedIn

I’ve also written:

Real-Time Sentiment Analysis with Docker, Kafka, and Spark Streaming

A Step-By-Step Guide to Deploying a Pre-trained Model in an ETL Process

pub.towardsai.net

Reducing Bias in LLMs: Fine-Tuning for Fairer Language Models

A Critical Examination of a Suggested Research Solution to Reduce Bias in LLMs

generativeai.pub

Monet Paintings Reimagined: Training a DCGAN on Impressionist Art

Exploring the Boundaries of Creativity and Technology