SMOTE technique for imbalanced data of 3 classes in R programming— Cardiotocography Data Set

Phuong Del Rosario
Analytics Vidhya
Published in
3 min readFeb 28, 2021

In this article, we will used SMOTE to balance the 3 classes in cardiotocography data set. Please read the article below to understand more about the dataset.

1. Creating TRAINSET/TESTSET data

TESTSET and TRAINSET were created based on SDATA. Standardized data sets were generated for each class N, S, P whose names are SCL1, SCL2, and SCL3 respectively.

For each class standardized data (SCL1, SCL2 and SCL3), 20% of the data was randomly chosen, then combined as the full TESTSET, and the remaining 80% of the data of each class was combined as full TRAINSET. The full TESTSET has 425 rows and 21 features, while the full TRAINSET has 1701 rows and 21 features.

Table below shows a detail of each class size in TRAINSET and TESTSET. Look at each class size in table 7, we see that there is an imbalance between classes. Thus, balancing is needed for TRAINSET and TESTSET.

2. Balancing the classes in TRAIN data set

There is an imbalance in all the classes in both TRAIN and TEST set, and imbalanced classes may create bias in the predictive model and impact the accuracy of the model, so the next step is to balance all the classes. Here, we are only to balance the classes in TRAIN data set only in order to prevent overfitting on our performances.

SMOTE stands for Synthetic Minority Oversampling Technique which creates new synthetic cases based on existing cases of the minority class. In SMOTE method, the new cases are not just the copies of existing cases of minority class, but SMOTE generates new cases by taking samples of feature space for each target class and its nearest neighbors k. It then combines features of the target case with features of its neighbors k. This method is used to oversample the two minority classes which are suspect (S) class and pathologic (P) class in TRAINSET and TEST SET.

In R, function SMOTE() of “smotefamily” package was used to generate the new observations for S class and P class. The two main parameters in the function are K and dup-size. K value is the number of nearest neighbors chosen for each target case in minority classes, and dup-size value is the number of how many times the size of minority class is duplicated. For S class, the K value and dup-size value are 3 and 3, while the K value and dup-size value for P class are 4 and 4. The final size for each class and the size of full newTRAIN, fullnewTESTl data set after balancing are shown in Table 8: Balanced newTRAIN and newTEST. The full newTRAIN and full newTEST data were created by joining all 3 classes after balancing.

--

--

Phuong Del Rosario
Analytics Vidhya

I am passionate about data, and love beauty ! _ M.S. Student in Statistics and Data Science.