Synthetic Minority Over-sampling Technique (SMOTE) from Scratch

Breya Walker
4 min readAug 28, 2019

--

Dealing with imbalanced classes in data can be a hassle, especially when it comes to training and testing your machine learning algorithm. The presence of imbalanced data can cause the testing accuracy to be extremely high, with the focus of the overall model accuracy being on the majority class, washing out the inaccurate predictions of the minority class. Precision or recall will be extremely low for the minority class of interest (Saito & Rehmsmeier, 2015) due to insufficient representation of minority observations during training. However, there are several ways to combat the effects of imbalanced data.

An example of imbalanced class data. The majority class is A which accounts for 83% of the category. The minority class is B which accounts for 17% of the category.

Combating the Problem

With the presence of imbalanced classes in your data set, you have to find a solution to fix the issue. There are a plethora of options to select with majority of options falling into three main categories: Majority Under-Sampling, Minority Over-Sampling, and SMOTE. Minority Over-Sampling and Majority Under-Sampling are sampling techniques that randomly sample with and without replacement, respectively, for minority and majority classes. Then there is SMOTE. We will focus our efforts on SMOTE for the remainder of this article. As described in Applied Predictive Modeling (Kuhn & Johnson 2013), SMOTE is a sampling technique that increases the number of minority observations. A data point from the minority class and its K-nearest neighbors (KNNs) are determined. The new synthetic data point is a random combination of the predictors of the randomly selected data point and its neighbors.

Ways to Implement SMOTE

There is the infamous Python library imblearn that contains the SMOTE function (more information regarding this can be located here). This function performs synthetic minority up-sampling by supplying the inputs, X_train and Y_train, as arguments into the parameters of the function that create synthetic X_train and Y_train values.

If you do not have the liberty to install libraries, such as imblearn, into your development environment for whatever reason (e.g., Info-security restrictions) then keep reading. I have found myself in this very situation — not being to install libraries that were not included in my native development environment.

I needed to create synthetic minority observations, so creating SMOTE functionality from scratch was called for. HERE IT IS!

SMOTE From Scratch

Import the following libraries into your development environment. These are just a handful of recommended data science libraries you should have installed, additional libraries can be located here. If these are not installed on your machine, follow the instructions located here.

Import the required modules

Next, we will create a new function that calculates the nearest neighbor (k=5) to our nᵗʰ X_train data point. We will input X_train dataframe as an argument into the nearest_neighbour function. What is most important is to return the k indices of the nearest neighbors, which will be used during a later step.

Nearest_Neighbour function defined

Now we will create the SMOTE function. Recall that the new synthetic data point has to be a random combination of the predictors of the randomly selected data point and its neighbors. Our function will utilize the nearest_neighbour function built above, iterate over the mᵗʰ indices of the X_train dataframe, store one random selection of the mᵗʰ indices’ value from the jᵗʰ column derived from the X_train dataframe named newt and append each random selection into the mᵗʰ row of a matrix.

SMOTE function defined

Now let’s put our function to use. First we have to obtain unique categorical levels that exist in our target label and get a count of those unique categories.

#1. Getting the number of Minority Class Instances in Training Set unique, counts = np.unique(Y_train, return_counts=True)

Create a minority_shape variable which contains the dimension of the minority class. This is obtained by using the unique categories and the counts associated with each category type and selecting the value from index 1.

minority_shape = dict(zip(unique, counts))[1]

Create a new array of ones called x1 as the shape of minority_shape and X_train columns. Store only the minority class observations into x1 and convert it to an array. Note that X_train data has been normalized for the use of KNN. If you have not normalized your training data please refer to here and here. I will be writing an article related to the normalization of training and timestamp data VERY soon.

# 2. Storing the minority class instances separately
x1 = np.ones((minority_shape, X_train.shape[1]))
x1=[X_train.iloc[i] for i, v in enumerate(Y_train) if v==1.0]
x1=np.array(x1)

Now we are ready to use our SMOTE_100 function. Once we run SMOTE we will be able to keep our synthetic instances and concatenate them with the original X_train dataframe to create X_TrainSMOTE.

#3. Applying 100% SMOTEsampled_instances = SMOTE_100(x1)
# Keeping the artificial instances and original instances together
X_TrainSMOTE = np.concatenate((X_train,sampled_instances), axis = 0)

Finally, we will create our new instances of Y by creating a vector of 1s, y_sampled_instances, equal to the shape of the minority class. Next we concatenate the y_sampled_instances with the original Y_train dataframe to create y_TrainSMOTE.

y_sampled_instances = np.ones(minority_shape)y_TrainSMOTE = np.concatenate((Y_train,y_sampled_instances), axis=0)#X_TrainSMOTE and y_TrainSMOTE are the Training Set Features and Labels respectively

Putting it all together

The final script can be located below, for those who just simply want to copy and paste the code without reading :)

Final code snippet for SMOTE functionality

Final Take Away

There are a number of options available to combat the imbalanced data problem that exists in the real world. One of which is SMOTE. If you find yourself in a restricted development environment where the liberty to install non-native libraries is not available, then creating SMOTE from scratch is a great alternative.

Happy SMOTING!!!

Breya Walker-McGlown

Principal Data Scientist

HeySoftware!,LLC

Memphis, TN

Heysoftwaresolutions.com

REFERENCES

Saito, T. & Rehmsmeier, M. (2015). The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE 10(3). Retrieved from https://doi.org/10.1371/journal.pone.0118432

Kuhn, M. & Johnson, K. (2013). Remedies for Severe Class Imbalance. In M.K. & K.J. (Eds.), Applied predictive modeling (pp.427–428). Springer, New York.

--

--