Part-1 : Data preparation made easy with python!!

Neha Kushwaha
Analytics Vidhya
Published in
6 min readFeb 17, 2020

Are you aware of how much time a data scientist spends in data preparation? Any guesses yet!!! You will be surprised to know that a data scientist spends 80% of the time preparing data and 60% of it is spent in cleaning the data itself.
This is where you will be spending most of your time, so it’s important and a non negotiable step before the data is ready to be fed to the machine learning models. Let’s see how we can do it. Why should you read this article? You’ll get to know as you finsih reading.:

  • How to deal with a raw data with help of pandas and numpy libraries of python?
  • What is numerical and categorical variable in data and how to handle it.
  • You will know how to find duplicate values , handle missing values and outliers.
  • You will know how to scale the data and why it is important with its visualization impact.

Step 1: Load the data set and storing it in data-frame

We will be covering a example to read the data of type .csv and .xlsx to dataframe. You can download one data for example from here to see working of below code.

Code to read and load the csv or excel files

Step 2: Handling missing data

Missing values are a common occurrence data and if not handled in the training data set , it can reduce the model fit performance or can lead to a biased model. It can lead to wrong prediction or classification. A missing value signifies a number of different things in your data.

Perhaps the data was not available or not applicable or the event did not happen. It could be that the person who entered the data did not know the right value, or missed filling in. Data mining methods vary in the way they treat missing values. There should be a strategy to treat missing values, lets see how we can do it.

  1. Remove the missing data

#Method 1: List-wise deletion , is the process of removing the entire data which contains the missing value. Although its a simple process but its disadvantage is reduction of power of the model as the sample size decreases.

Remove all the rows with missing data

#Method 2: Pair-wise deletion , is the process of removing only specific variables with missing values from the analysis and continue to analyze all other variables without missing values, variables chosen will vary from analysis to analysis based on missingness. One of the disadvantage of this method, it uses different sample size for different variables.

The grey column is excluded for data analysis

In the above example,for pairwise deletion while performing a correlation we will only perform correlation between city, income and illness and ignore correlation between gender and age but while in list-wise deletion of the missing row would have been done and analysis can be carried out on all three features. Let’s see it with some pandas code as well.

#Method 3: Retain the Data through imputation

The imputation overcomes the problem of removal of missing records and produces a complete dataset that can be used for machine learning.

From : Pinterest

#Method 4: Mean , Mode and Median imputation

Imputation is a way to fill in the missing values with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the data set to assist in estimating the missing values. For numeric data type Mean / Mode / Median imputation is one of the most frequently used methods while for categorical mode is preferred. To know when to go for Mean/Median/Mode you can check my descriptive statistic page here

Fill missing data with mean/median/mode using below code.

#Method 5: Forward filling

Also commonly know as Last observation carried forward (LOCF). It is the process of replacing a missing value with last observed record. Its widely use imputed method in time series data . This method is advantageous as its easy to communicate , but it based on the assumption that the value of outcome remains unchanged by the missing data, which seems very unlikely.

Code can be referred at : https://gist.github.com/nkushwah/a942c30d9f48db86e6db0bee8a676b75

#Method 6: Backward filling

As the name suggest, its exact opposite of forward filling and also commonly know as Next Observation Carried Backward (NOCB). It takes the first observation after the missing value and carrying it backward.

For backward filling, you can replace imputation method to “bfill” in Forward fill example. That won’t be difficult i hope.

#Method 7: Linear Interpolation

Interpolation is a mathematical method that adjusts a function to data and uses this function to extrapolate the missing data. In simple words using a logic to fill the missing values. The simplest type of interpolation is the linear interpolation, that makes a mean between the values before the missing data and the value after. Of course, we could have a pretty complex pattern in data and linear interpolation could not be enough. There are several different types of interpolation. Just in Pandas we have the following options like : ‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘polynomial’, ‘spline’, ‘piece wise polynomial’ and many more .

These are few methods you can follow to fill your missing data, but i will highly recommend you to go through this article to find more interesting and accurate ways to replace your missing data.

Step 3: Check for duplicate value

One another reason why you model performance might not be accurate can be because of the duplicated data, which can make the data bias and results corrupted. Make sure you take care of such data as well.

Remove duplicate data

Step 4: Separating categorical and numerical data.

From : http://survivestatistics.com/variables/

In general you can process the whole data using steps 1–3. But post this numeric data and categorical data needs different kind of treatment because of there nature. We will discover how and why shortly, keep reading.

Coming up next in part-2

  • Handling the numerical data by scaling, removing outliers and more methods.
  • Handling the categorical data with 8 different encoding techniques like label, one-hot, target and many more.

This is not the end here we will meet again with advance EDA preparation steps in upcoming article, till then make yourself familiar with the dataset using this article. Stay tuned for the upcoming Part-2 . I hope you enjoyed this and any suggestions most welcome. Happy learning till then.

--

--

Neha Kushwaha
Analytics Vidhya

Software engineer by profession ….Data science learner by passion!!!!