SMOTE — Synthetic Minority Over-sampling Technique
A brief explanation of a pretty useful technique.
If you have ever tried to do any type of classification task, there is a good chance that you encountered imbalanced data. Imbalanced data occurs when one (or more) of your classes that you are trying to label has many more instances than the other. For example, in the case of fraud identification, most transactions are not fraud, so the typical transaction data set will have many instances of legitimate transactions and only a few fraudulent ones. Without doing something to balance the data, whatever classifier you use will likely simply label all transactions as legitimate.
To balance the data, you have several options. The first is to simply gather more data. While this is always preferable, it is often not possible. In this case, you can try resampling the data, either by under-sampling your majority class (non-fraud transactions in the above example) or over-sampling your minority class (the fraudulent transactions). Over-sampling consists of either sampling each member of the minority class with replacement, or creating synthetic members by randomly sampling from the feature set. This is what SMOTE — Synthetic Minority Over-sampling Technique — does.
To understand how this method works, imagine you are trying to classify two groups of people based on where in a room they are standing. People with green shirts tend to stand in the south part of the room, but there’s a lot of them (let’s say 1,000 — it’s a big room), so they fill up the south wall all the way to the middle of the room. People with yellow shirts tend to stand in the middle and towards the northwest corner of the room, but there’s few of them (let’s say 5). The regions where these two groups of people prefer to stand overlap, so there are a couple yellow shirted folks who have green shirted neighbors around them. Now if we train our classifier on this data, it will simply tell us everyone is wearing green shirts, and it will be right 99.5% of the time. Not useful. So we try sampling with replacement until there are 1,000 people in the yellow shirt group. In this case, sampling with replacement will make it seem like there are five positions where yellow shirted people stand and no where else (also like each of the five original people have 199 people sitting on their shoulders). This is overfitting. SMOTE solves this problem in the following way: take the first yellow shirted person and calculate the distance between him and his nearest friend, call it x. Multiply that distance by a random number between 0 and 1, call it a. Put another yellow shirted person at that point, at a distance ax from the original yellow shirt. Do that again with all the yellow shirts until you have 1,000 yellow shirted people standing in the area where they like to stand, but not only in the original 5 spots. Now you have a balanced set with a more realistic sample of where yellow shirted people would actually be standing. Pretty cool, right? :)