Data-Preprocessing with Python

Muktha Sai Ajay
DataSeries
Published in
6 min readSep 29, 2020

Making data understandable

Considering the fact that high-quality data leads to better predictions, data preprocessing has become a fundamental step in data science and machine learning. We’ll talk about the importance of processing data and discuss different approaches in sequence.

What is Data PreProcessing

It is a technique that transforms raw data into an understandable format. Real-world data(raw data) is always incomplete and that data cannot be sent through models as it would cause certain errors. That is why we need to preprocess data before sending it through a model.

Here are the steps I have followed;

  1. Import libraries
  2. Read the Dataset
  3. Split the dataset into independent and dependent
  4. Handling missing values
  5. Handling categorical values
  6. Standardization/ Feature Scaling

Step 1: Import Libraries

The first step is usually importing the libraries that will be needed in the program. A library is essentially a collection of modules that can be called and used. Here we will be using

Pandas: We use pandas for data manipulation and data analysis.
Numpy: It is a fundamental package for scientific computing with Python.

Step 2: Import the Dataset

Most of the dataset’s come in .csv(comma-separated value) format. It’s important to keep the dataset in the same folder as your program and read it using a method called read_csv which can be found in the library called pandas.

Step 3: Split the data into independent and dependent features

We will create a matrix of features in our dataset by creating an Independent variable(X) and a dependent variable (Y).To read the columns, we will use iloc of pandas which takes two parameters — [row selection, column selection].

: as a parameter, it selects all rows in the data. For columns, we have -1, which means all the columns will be selected except for the last one.

Photo by Myriam Jessier on Unsplash

Step 4: Handling Missing Values

Sometimes we find some data is missing in the dataset. Missing values need to be handled carefully because they reduce the quality of any of our performance matrix and prediction. No model can handle these NULL or NaN values on its own so we need to deal with it. Firstly, we need to check whether we have null values in our dataset or not. We can do that using the isnull() method.

Handling the missing values is one of the greatest challenges faced by analysts because making the right decision on how to handle it generates robust data models. Let us look at different ways of imputing the missing values.

Deleting Rows

This is the most commonly used method. We either delete a row which has null values and a particular column if it has more than 60% of missing values. This method is only used when that column does not affect the model's prediction that is that feature has less significance or no significance for predicting the model.

Replacing With Mean/Median/Mode

This method can be applied to the features which consist of numerical data. We can calculate the mean, median, or mode of the feature and replace it with the missing values. This method gives better results compared to the removal of rows and columns.

Handling Categorical Data

Sometimes our data is in text form. We can find categories in text form. It gets complicated for machines to understand texts and process them since the models are based on mathematical equations and calculations. Therefore we need to encode the data into numbers.

To make this happen we import a library called LabelEncoder from scikit-learn which we will use for the task. We will create an object of that class. We will call our object labelencoder_X. fit_transform method in the LabelEncoder class will help us.

Now, the text has been replaced by numbers, what if there are more than two categories we keep assigning integers to different categories which lead to confusion. Suppose we have four categories and we assign the first category with 0 and the last category with 3. However, since 1 is greater than 0 and 3is greater than 1 the equations in the model we think 3 has the highest priority than 0. In order to resolve this problem, we use dummy variables where we will have n number of columns for n categories for that we make use of OneHotEncoder.

We will import another class called OneHotEncoder from scikit learn. we will create an object of that class and consider a parameter called categorical_features which takes a value of the index of the column and use fit_transform() for OneHotEncoding as well.

ColumnTransformer allows the input to be transformed separately and the features generated are concatenated to form a single space. It is useful for a heterogeneous data transformations

Feature Scaling

It is used to standardize the values of Independent variables. It is a method used to limit the range of variables so that they can be easily compared.

Why is it necessary?

Most of the machine learning models are based on Euclidean distances. The square difference with the lower value in comparison to the far greater value will almost be treated as if it does not exist. We do not want that to happen. That is why it is necessary to transform all our variables into the same scale.

Most of the Machine Learning models are based on Euclidean distances. Consider if the square root value of (x2-x1) is greater than (y2-y1), then (y2-y1) will be neglected. We don’t want this to happen. That is why it is necessary to transform all our variables into the same scale. There are two ways you can do this.

Normalization

With the help of Normalization, we scale the feature values in between 0.0 and 1.0

Standardization

It scales features to have a mean of zero and a standard deviation one.

we need to import StandardScaler from the scikit preprocessing library and create an object of that class.

It’s time to fit and transform our X_train set. When we apply Standard Scaler to our training and testing sets. We need to fit and transform only to the training set, In case of the test set, we need to transform, no need to fit it to the test set. This will transform all the values to a standardized scale.

Thank you for reading my article. I will be happy to hear your opinions. Follow me on Medium to get updated on my latest articles. You can also connect with me on Linkedin and Twitter. Check out my blogs on Machine Learning and Deep Learning.

--

--