How to find whether Train data and Test data comes from same data distribution

Praveen Kotha
2 min readAug 9, 2018

--

At times, there will be a scenario where the performance on test data is really bad and we don’t know the underlying reason for the same.

One possible reason could be, the test data distribution and train data distribution are not similar.

If we have 1 or 2 features we can use some plotting techniques and see the distribution of data.

How to find if we have more than 2 features or say for any real world data??

Follow this step by step procedure.

Procedure:

  • Divide your Dataset (D)into train (Dtrain)and test (Dtest) sets.
  • Make all the labels of Dtrain to be 1(positive) which results in Dtrain¹ and Dtest to be 0(negative) which results in Dtest¹.
  • Now combining both Dtrain¹ and Dtest¹, we get a new dataset D¹.
  • Train a classifier on Dtrain¹ and get the accuracy on test data Dtest¹.

Case 1:

If the accuracy we get on test data is low, we can conclude that both train and test distributions are similar.

Let me elaborate in concise way, when will accuracy be low in this case? If the predicted class label for test data(actual label is 0) is 1. That means if the model is predicting the test data to be label 1, they are actually similar to train data.

Case 2:

If the accuracy we get on test data is medium, we can conclude that both train and test distributions are not very similar.

Case 3:

If the accuracy we get on test data is high, we can conclude that both train and test distributions are not similar.

Solution for this problem:

So what can we possibly do to improve our performance on the test data ?

1. Dropping of drifting features.

2. Importance weight using Density Ratio Estimation.

If you have any questions/suggestions, feel free to comment.

You can connect with me through linkedin.

--

--