5 Dataframe Tricks in Pandas

Salman Hossain
2 min readApr 15, 2022
  1. Using replace to do one hot encoding when you have a binary feature

Before you decide to go ahead and use one hot encoding or get_dummies on a categorical feature. Check if its a binary feature like gender or a yes/no feature. Than you could just use replace() to easily encode the categorical feature.

Encoding categorical data with replace

2. Checking for imbalanced columns

It is unlikely that you’ll find your data set having similar amounts of sample for each category so this is a quick way you can check using unique()

Counting the number of rows for a unique category for a column

3. Sample to randomly select from a feature useful when you have imbalanced datasets

sample() in pandas doesn’t create artificial samples like SMOTE does but rather takes random samples from the original data set. You can also use the optional parameter replace to allow or disallow sampling from the same row, by default its disabled. Depending on the context of your specific data set this might be okay.

Using sample to make sure to have the same number of samples for each unique category for a feature. This can help in situations when the data set is class imbalanced.

4. Stratification in your train split test

Another imbalanced data set trick is to make sure that class label is distributed equally as this can affect training. Something that I didn’t know about till recently. Learn more about it here https://scikit-learn.org/stable/modules/cross_validation.html#stratification

5. Encoding categorical values that aren’t binary

When you are encountering categorical values that aren’t simply binary. Then you will need to create dummy variables so that you model can be trained. This can be done using pd.get_dummies()

--

--