Class Imbalance: ROSE and Random Walk Oversampling (RWO)

Let’s formally define the class imbalance problem then intuitively derive solutions for it!

Essam Wisam
Towards Data Science

--

In this series of stories, we explain various resampling techniques used to deal with class imbalance; in particular, those initially implemented in the Imbalance.jl Julia package which includes Naive Random Oversampling, Random Oversampling Examples (ROSE), random walk oversampling (RWO), Synthetic Minority Oversampling Technique (SMOTE), SMOTE-Nominal, and SMOTE-Nominal Continuous and many undersampling techniques. For this story, we will assume familiarity with the class imbalance problem as formally explained earlier and explain two interesting algorithms that help solve it; namely, ROSE and RWO.

Table of Contents

Random Oversampling Examples (ROSE)
Random Walk Oversampling (RWO)

Random Oversampling Examples (ROSE)

We know that any extra data we collect follows the probability distribution of the underlying population of data belonging to the minority class so how about approximating this probability distribution then sampling from it to simulate collecting real examples. This is what the random oversampling examples (ROSE) algorithm does.

It follows then that ROSE tries to estimate the probability distribution P(x|y=k) for each class k and then draws the needed N_k samples from it. It’s well known that one way to estimate such density is through kernel density estimation which you can derive or intuit starting from more crude versions such as histogram analysis. The following describes KDE:

Given: data points x
Wanted: An estimate of P(x)
Operation: Choose a kernel function K(x) and then estimate P(x) as

Typically, we want to be able to control the scale of the kernel function (i.e., squeeze it or spread it) as that can improve the estimate of P(x) so in a more general sense we have

In essence, what it does is place the kernel function above each point and then sum and normalize them all so it integrates to 1.

Applying KDE to Estimate Distribution by Drleft on Wikimedia Commons (License)

The choice of the kernel function itself is a hyperparameter; perhapse, one that has been shown not to be so significant as long as it satisfies basic properties such as smoothness and symmetry. A simple Gaussian with σ as the scale is a common choice and which ROSE uses for its KDE.

The standard deviation in the normal distribution formula acts as h; hence, not written

ROSE samples the N_k points from this distribution once estimated for any class k (resulting in P(x|y=k)) by performing the following:

  • Choose a point randomly
  • Place the Gaussian on it
  • Sample one point from the Gaussian

This is just like random oversampling except that after choosing a point randomly it places a Gaussian on it and samples the Gaussian to generate the new point instead of repeating the chosen point.

In this, ROSE sets the bandwidth h (or more generally the smoothing matrix in higher dimensions which is the covariance matrix parameter in the normal distribution) using a rule of thumb called Silverman so that the mean integrated squared error is minimized. In particular,

where D_σ is a diagonal matrix of the standard deviations of each of the features, d is the number of features and N is the number of points.

In the Imbalance.jl package, this is multiplied by another constant s to allow optional control of the hyperparameter. For s=1 it’s left as in the paper and for s=0, ROSE becomes equivalent to random oversampling. The following animation produced using the package illustrates the effect of increasing s on the generated points.

Figure by author. Synthetic points shown as diamonds.

Notice that as we increase s, the synthetic points generated due to each original point that was randomly chosen is even farther apart from it.

Random Walk Oversampling (RWO)

When the central limit theorem applies, one can show that all it takes is n examples so that it holds with 95% probability that a mean estimate over such examples x̄=µ±1.96*σ/√n. Suppose the minority class has small σ (e.g., σ=10) and we have collected n=1000 examples for it then it holds that x̄=µ±0.6 with 95% probability or in other words if our estimate x̄ is big enough then pretty much x̄≈µ. A similar argument can be used to show that the estimated standard deviation S_x will be very close to σ; at least when the data is normal.

This motivates that synthetic data can be generated just such that the estimates x̄ and S_x are preserved to simulate collecting new data. It can be shown that if the features in the data belonging to some class k are independent where x̄ is the mean of all points in the class and S_x is their standard deviation (both of which are vectors) then if new points for that class are generated by applying

x is some point belonging to the class; when this is done on all points for k times, k*n new examples are introduced.

then asymptotically these points do not change the original x̄ and S_x. This is easy to see for as when the number of generated points N_k=k*n is very large it will hold that sum(x_new)=k*sum(x) because sum(r)=0. It follows that for the new dataset with both old and new examples is x̄*(k+1)/(k+1)=x̄ so the mean is preserved. It can be likewise shown that the standard deviation is preserved as well.

Animation by the author. Diamonds correspond to synthetically generated points.

The implementation shown in the animation works for any ratio by choosing points from the class to be oversampled randomly instead of looping on all of them.

Figure by the author

The method is called random walk oversampling because in the plot it feels like points were generated by walking randomly over existing ones and placing new points.

I hope this story has enlightened you with the class imbalance problem in machine learning and how it can be solved. Let’s consider the algorithms presented in the SMOTE paper for the next story.

Reference:
[1] G Menardi, N. Torelli, “Training and assessing classification rules with imbalanced data,” Data Mining and Knowledge Discovery, 28(1), pp.92–122, 2014.

[2] Zhang, H., & Li, M. (2014). RWO-Sampling: A random walk over-sampling approach to imbalanced data classification. Information Fusion, 25, 4–20.

--

--