Divide & Learn [Machine Learning] [Data Mining]

Most of the time we do have a common problem to increase the accuracy in a model for a given data set. It is hard because when it comes to implementation, we try to do all things using a single model which is not optimal for some scenarios. For example let’s assume that there is a categorical feature called “gender”, and some other features. When we try to train the whole data set at once, this categorical feature might have a huge impact on the variance of other variables which can results an unstable model. Because model cannot converge men’s behavior and women’s behavior together using a single approach.

So, to address this kind of a problem sometimes it is better to treat these categories separately and check for the accuracy. Divide the data set based on gender and remove the gender feature from both data sets. Apply different models for both men’s and women’s data separately. Maybe they might good in different models. If men’s data has more linear relationship then a linear model will quite fit for the scenario. If women’s data does not show a linear relationship then decision tree type algorithm might good for that particular case. Even may be the same model with different hyper parameters could make a difference.

Finally we have to do predictions. When we have to do the predictions, we have to manually check for the feature that we used to separate the data set and navigate each data point to relevant model and get the predictions.

Cons

Sometimes there can be disadvantage when there is a class imbalance problem. If we have more men’s data rather than women’s data then there will be a problem in training a model for women’s data. So it is better to have fair amount of data for both models and should be sufficiently balanced.

Maybe we might have taken a wrong feature to separate the data set which does not have a strong impact on the output. In that case we just increase the complexity of our implementation.

So, test it. Do it only if it is required… :)