The Ultimate Python Package to Pre-Process Data for Machine Learning
How many times have you received a raw dataset and conduct the same action to pre-process it? Copy and passed code from different projects and re-used it? For this sake, the ‘NBprocessing’ python package was created. It provides many methods, under a variety of classes, that enable the user to explore the dataset, pre-processes it, and finally plot insights.

This blog-post will go over the following subjects:
- What is the importance of preprocessing?
- Package installation
- Package libraries and utilities
- Selected usage examples
To see the full usage documentation, click here
What is the importance of preprocessing?

Exploring and pre-processing our dataset is probably the most important step in building an efficient Machine Learning model. Raw data contains noise, missing values, inconsistent representation of features, and many more issues. To achieve an accurate model that helps solve the business problem and performs high-precision forecasting, one will have to manipulate this raw data.
There are several steps to take. First, understand the business problem at hand and the purpose of the model. Second, Explore the dataset (EDA): distribution and correlation of features, missing values, etc. Next is processing the dataset: Take care of missing values and outliers, handle imbalanced features, etc. Last will be Plot insights deducted
“Big data isn’t about bits, it’s about talent” , Douglas Merrill
What differentiates a brilliant Data Scientist from a good one is the ability to process the dataset based on the given business problem and the features at hand. Never underestimate the importance of EDA and Pre-processing the data. The data is the foundation of the model and the performance of it is very much based on the data that you provide.
To read more about EDA and Pre-processing data I highly recommend reading this blog post: Data Preprocessing Concepts written by Pranjal Pandey
Package installation

NBprocessing is a python package based on tabular data that is loaded to pandas DataFrame. This is a basic and important concept to understand before using the package because it will not work while using different tools.
- Installation:
Run from your command line prompt
pip install NBprocessing- Package dependencies:
All package dependencies will be installed/updated while installing NBprocessing. The package dependencies are: Pandas, Numpy, Matplotlib, Seaborn, Plotly, Scikit-Learn.
Package libraries and utilities
Import:
from NBprocessing import NBcategoricalfrom NBprocessing import NBcontinuousfrom NBprocessing import NBplotfrom NBprocessing import NBgeneral
Package Libraries
NBcategorical - contains functions that are relevant to categorical features:
remove_categories(database, column_name, categories_to_drop)fill_na_by_ratio(database, column_name)combine_categories(database, column_name, category_name="other", threshold=0.01)categories_not_in_common(train, test, column_name)category_ratio(database, columns_to_check=None, num_categories=5)label_encoder_features(database, features_to_encode)OHE(database, features_list=None)
NBcontinuous - contains functions that are relevant to continuous features:
remove_outliers_by_boundaries(database, column_name, bot_qu, top_qu)fill_na_timedate(database, column_name)get_num_outliers_by_value(database, filter_dict_up, filter_dict_down)remove_outliers_by_value(database, filter_dict_up, filter_dict_down)
NBgeneral - contains general functions:
missing_values(database)split_and_check(database, column_name, test_size=0.3)
NBplot - contains plots functions:
plot_missing_value_heatmap(database)plot_corr_heat_map(database)count_plot(database, column_list=None)distribution_plot(database, column_list=None)world_map_plot(database, locations_column, feature, title=None, color_bar_title=None)
Selected usage examples
To see the full usage documentation, click here
The examples below are shown using this car rental dataset that contains the following features:
- year: Year of the car when it was bought
- selling_price: Price at which the car is being sold
- km_driven: Number of Kilometres the car is driven
- fuel: Fuel type of car (petrol / diesel / CNG / LPG / electric)
- seller_type: Tells if a Seller is Individual or a Dealer
- transmission: Gear transmission of the car (Automatic/Manual)
- owner: Number of previous owners of the car.

- Missing Values:
When a feature has a lot of missing values it might become irrelevant to use in the model because it contains very little information. For this sake, ‘NBprocessing’ provides a helpful function that plots the amount and percent of missing values per feature in a table.

As can see, only features with missing values were plotted to the table.
- Fill missing values by categories ratio:
One of the hardest parts in dealing with missing values is what should one do when having missing values in a categorical feature. This function will take the categories ratio, without the missing values, and fill them by keeping this ratio. For example the ‘Fuel’ feature in our dataset:

On the right side, we can see the portion of each category, where the missing values are part of the division. On the left side, the division is without the missing value portion. The result that we would like to achieve is keeping the left ratio without having missing values.


- Category Ratio:
This function enables the user to better understand the distribution of the categories in each feature. It plots a table with all the input features and the portion of every category in the feature. To better see imbalanced features it plots 90% and higher categories in red color. If no feature was selected, it will plot all the dataset features.

- Outliers:
Outliers are the noise points in the data that are usually very remote from the feature mean. ‘NBprocessing’ provides two helpful functions that enable the user to drop the outliers, after informing how much data will be lost by conducting this action.
The “Hammer” method
When having normally distributed data, 99.7% of it will be in the boundary of 3 standard deviations from the mean. Thus, by removing the remaining 0.3% we will drop the outliers and keep most of the data. However, features are not always normally distributed and a lot have only long head or back tail. Thus, this function provides the user to choose what the percent of data to drop — top and bottom limits.


As can see from the above, the ‘km_driven’ feature has a long head-tail without a back-tail. Thus, we only have outliers in the top boundary.

The “Tweezers” method
In this case, we would like to drop outliers by giving the function the top and bottom value limits.

First, the user provides the top and bottom bounders using a dictionary with the features name as keys. Then, will check how much data is out of the boundaries and will be dropped. If it’s a reasonable amount the user will be able to use the boundaries to drop the missing values.

Only 1.32% will be lost by conducting this action — will proceed.

Summary
NBprocessing is a very powerful package that has a lot to offer to the Data Science ecosystem. It provides tools to Explore, Process, Plot, and Model tabular data without copy and pasted code from one project to the other. It has many more functions that are shown in this Jupyter Notebook.
If you would like to become a contributor and help grow and promote this package — please don’t hesitate to DM me here, on Github or LinkedIn.
Would like to thank Israel Tech Challange (ITC) for providing me the platform to create this package and to Shir Meir Lador for the inspiration.







