How to solve mismatch in train and test set after categorical encoding?

3 min readJun 7, 2018

When I initialy started implementing Machine Learning then I faced a problem that after one-hot encoding the number of columns in the training and test set data were unequal which is quite obvious also but the problem was that after training my model on the training dataset it was not predicting on the test set due to difference in the number of columns.

So after doing a bit of research and through my experience I came to the following three solutions for this problem:

>>>. Pandas `factorize`

Instead of using pandas get_dummies you can use pd.factorize, it is almost similar to get_dummies syntax wise but the primary difference between them is that instead of returning n number of columns as get_dummies, pd.factorize just returns one column.

example:

df[‘column_name'] = pd.factorize(df[‘column_name'])[0]

Personally it is one of my prefered one specially if the column has too many categories for my machine to handle.

Want to know the diff among pd.factorize, pd.get_dummies, sklearn.preprocessing.LableEncoder and OneHotEncoder

Want to know the diff among pd.factorize,

These four encoders can be split in two categories:

medium.com

>>>. Combining the two datasets and the doing encoding on the combined dataset

Now if you really want to do get_dummies or any other than pd.factorize(sometimes it’s neccesary also it may affect efficiency and accuracy), so for that we will first need to make sure that both the train and test dataset contain the same columns and more importantly in the correct order.

Then do this-

Make one new column in both train and test data and assign 1 and 0 to it respectively.

Then concat these two datasets into one new dataset.

One very important to note over here is that we haven’t specified the axis while concatenating which means we are combining along the rows.So what this will do is combine the test set below the train set with the ‘train’ column acting as the demarkation(all rows with 1 belong to train set and those with 0 to the test part).

Now do the encoding you require on the required column and save it in a new dataset.

Concat that new dataset with the combined dataset.

Now its time to separate those two datasets and we are done with both the train and test set now containing the same number of columns.

>>>Implementing PCA

Another method is using PCA after doing encoding on both the sets separately, and then applying PCA to truncate the test set to equal number of columns as the training set.

Personally I would discourage using PCA cause its kind of cumbersome and using PCA always leads to loss of data which may affect the model.

Please do comment if you wish to ask anything, or have confusion in anything.

Thank you,😀