5 Dataframe Tricks in Pandas
- Using replace to do one hot encoding when you have a binary feature
Before you decide to go ahead and use one hot encoding or get_dummies on a categorical feature. Check if its a binary feature like gender or a yes/no feature. Than you could just use replace() to easily encode the categorical feature.
2. Checking for imbalanced columns
It is unlikely that you’ll find your data set having similar amounts of sample for each category so this is a quick way you can check using unique()
3. Sample to randomly select from a feature useful when you have imbalanced datasets
sample() in pandas doesn’t create artificial samples like SMOTE does but rather takes random samples from the original data set. You can also use the optional parameter replace to allow or disallow sampling from the same row, by default its disabled. Depending on the context of your specific data set this might be okay.
4. Stratification in your train split test
Another imbalanced data set trick is to make sure that class label is distributed equally as this can affect training. Something that I didn’t know about till recently. Learn more about it here https://scikit-learn.org/stable/modules/cross_validation.html#stratification
5. Encoding categorical values that aren’t binary
When you are encountering categorical values that aren’t simply binary. Then you will need to create dummy variables so that you model can be trained. This can be done using pd.get_dummies()