SMOTE implementation in PySpark

hwangdb
2 min readAug 3, 2020

--

Being probably the most common method of doing oversampling on imbalanced dataset, SMOTE introduces some randomness in creating synthetic minority instances and serves as a more robust solution than simply resample. This blog is a summary of my implementation of SMOTE in pyspark.

Having searched through stack-overflow and GitHub, I found 3 issues for my use case (a python project):

  1. Rare to see any pure spark (without using sklearn, pandas neighbours class) implementation for SMOTE, meaning that there must be some collection action from spark to pandas. But this conversion is not desired.
  2. Found some pure spark implementation in Scala, not yet to see any pyspark based smote modules. One Scala based SMOTE is here.
  3. Some existing solutions are only dealing with continuous data, but I need to oversample categorical columns as well.

Therefore I decide to write my own module for pyspark smote.

Let’s first review the algorithm in brief:

Basically we are justing doing many batches of finding k-nearest neighbours and introduce some randomness to create synthetic minority class instances.

Picture source from Internet

Here’s my implementation details:

The steps are clear and I have 2 main functions in the following snippet: pre_smote_df_process takes in a spark df, preprocess it by string indexing, assemble numerical features into a vector. Then in the main function of smote, we first slice the original df into 2, and only retain the one containing all the minority class because only existing minority instances will be used to find nearest neighbours.

We use spark’s Locality Sensitive Hashing module to find the nearest neighbours, in this case, we have a config object to store these parameters, like k, the number of nearest neighbours for each minority instance. In the function you can see a self joining process, to get each row’s nearest row in terms of euclidean distance. Note that finding nearest neighbours only considers the numerical features, after we have randomly chosen the neighbour, we take the vector difference, multiplied by a random factor between 0 to 1 and add it onto the original row vector. Till this step, we have created a synthetic instance for each original minority instance. Then we unpack the columns and to make this synthetic df to be of the same format as the input original df. Thus we have completed augmentation of the minority class instances.

For categorical features, after finding one random neighbour, we keep either the neighbour’s or the original data’s categorical values.

PySpark implementation of SMOTE

Takeaway points by writing this module:

  1. Finding nearest neighbours for vectorised data, using locality sensitive hashing.
  2. Using UDF to work on vectors operations in spark.
  3. Schema manipulation for complex (layered) dataframes.
  4. SMOTE on categorical columns.

--

--

hwangdb

To simplify and automate building well architected solutions.