The Ultimate Python Package to Pre-Process Data for Machine Learning

NirBarazida
Oct 6, 2020 · 7 min read

How many times have you received a raw dataset and conduct the same action to pre-process it? Copy and passed code from different projects and re-used it? For this sake, the ‘NBprocessing’ python package was created. It provides many methods, under a variety of classes, that enable the user to explore the dataset, pre-processes it, and finally plot insights.

This blog-post will go over the following subjects:

To see the full usage documentation, click here

What is the importance of preprocessing?

Exploring and pre-processing our dataset is probably the most important step in building an efficient Machine Learning model. Raw data contains noise, missing values, inconsistent representation of features, and many more issues. To achieve an accurate model that helps solve the business problem and performs high-precision forecasting, one will have to manipulate this raw data.

There are several steps to take. First, understand the business problem at hand and the purpose of the model. Second, Explore the dataset (EDA): distribution and correlation of features, missing values, etc. Next is processing the dataset: Take care of missing values and outliers, handle imbalanced features, etc. Last will be Plot insights deducted

“Big data isn’t about bits, it’s about talent” , Douglas Merrill

What differentiates a brilliant Data Scientist from a good one is the ability to process the dataset based on the given business problem and the features at hand. Never underestimate the importance of EDA and Pre-processing the data. The data is the foundation of the model and the performance of it is very much based on the data that you provide.

To read more about EDA and Pre-processing data I highly recommend reading this blog post: Data Preprocessing Concepts written by Pranjal Pandey

Package installation

NBprocessing is a python package based on tabular data that is loaded to pandas DataFrame. This is a basic and important concept to understand before using the package because it will not work while using different tools.

pip install NBprocessing

Package libraries and utilities

Import:

Package Libraries

NBcategorical - contains functions that are relevant to categorical features:

NBcontinuous - contains functions that are relevant to continuous features:

NBgeneral - contains general functions:

NBplot - contains plots functions:

Selected usage examples

To see the full usage documentation, click here

The examples below are shown using this car rental dataset that contains the following features:

Sample of the dataset

As can see, only features with missing values were plotted to the table.

On the right side, we can see the portion of each category, where the missing values are part of the division. On the left side, the division is without the missing value portion. The result that we would like to achieve is keeping the left ratio without having missing values.

As can see from the above, successfully fill the missing values, and keep the categories ratio.

The “Hammer” method

When having normally distributed data, 99.7% of it will be in the boundary of 3 standard deviations from the mean. Thus, by removing the remaining 0.3% we will drop the outliers and keep most of the data. However, features are not always normally distributed and a lot have only long head or back tail. Thus, this function provides the user to choose what the percent of data to drop — top and bottom limits.

‘km_driven’ feature distribution before dropping outliers

As can see from the above, the ‘km_driven’ feature has a long head-tail without a back-tail. Thus, we only have outliers in the top boundary.

‘km_driven’ feature distribution after dropping outliers

The “Tweezers” method

In this case, we would like to drop outliers by giving the function the top and bottom value limits.

First, the user provides the top and bottom bounders using a dictionary with the features name as keys. Then, will check how much data is out of the boundaries and will be dropped. If it’s a reasonable amount the user will be able to use the boundaries to drop the missing values.

Only 1.32% will be lost by conducting this action — will proceed.

Summary

NBprocessing is a very powerful package that has a lot to offer to the Data Science ecosystem. It provides tools to Explore, Process, Plot, and Model tabular data without copy and pasted code from one project to the other. It has many more functions that are shown in this Jupyter Notebook.

If you would like to become a contributor and help grow and promote this package — please don’t hesitate to DM me here, on Github or LinkedIn.

Would like to thank Israel Tech Challange (ITC) for providing me the platform to create this package and to Shir Meir Lador for the inspiration.

The Startup

Get smarter at building your thing. Join The Startup’s +729K followers.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +729K followers.

NirBarazida

Written by

Data Scientist at DAGsHub leading the advocacy and outreach activity worldwide. We are building the next GitHub for Data Science.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +729K followers.