Data Cleaning for Beginners

What is data cleaning and why is it important?

Hazel Donaldson
Analytics Vidhya
5 min readDec 2, 2019

--

Photo by Markus Spiske on Unsplash

When starting a new project, it is important to use a clear framework as to have a clear objective and find the most efficient solution to a given problem. As you work on more projects, it becomes clearer to see which data science process fits your personal preferences and the way you work. Currently, I have become comfortable using the OSEMin model. OSEMin stands for Obtain, Scrub, Explore, Model and Interpret.

The first part of every data science project is to define its goal and understand the project’s main objective. Once this is clear, you can start gathering data that is related to the project, which can be sourced organically or from external databases. As you start gathering data, it is common that it will not be perfect and will need to be cleaned or preprocessed. This step, the Scrubbing or cleaning step, is one of the most important steps in the OSEMin model. The goal of this post is to gain an understanding of what this step is, why it is important and strategies to properly clean or scrub data.

What is Data Scrubbing?

Data Scrubbing is the process of identifying the incorrect, incomplete, inaccurate, irrelevant or missing parts of the data and then modifying, replacing or deleting them according to necessity. This step is a fundamental element of basic data science.

Data is highly valuable for analytics and machine learning. When dealing with real-world data, it is not improbable that data may contain incomplete, inconsistent and missing values. Cleaning data goes a long way to improving a model’s performance. Let’s see an example that highlights the importance of data cleaning.

Kings County dataset

The main objective of this project was to accurately predict housing prices in Kings County, Washington(kaggle link). This data set contains information about house sale prices for King County between May 2014 and May 2015. Before we can move on to creating a model, first need to see our dataset and understand what is in it.

We can do this with the help of python’s Pandas library. The Pandas library is used primarily for data processing and viewing .csv files. As you can see below, the pandas library is imported, then pandas is used to read the relevant .csv file. Finally, the first 5 houses can be viewed using the .head() function.

import pandas as pddf = pd.read_csv(‘kc_house_data.csv’)df.head() #view first 5 houses in dataset
Figure 1: First fives houses in the data set

The first step to data cleaning to remove the information that is not relevant to the main objective: predicting housing prices. In this case, the “id” and “date” columns are not relevant to the objective, so they were dropped. We can use pandas.DataFrame.drop to drop specific rows and columns.

df.drop([“id”, “date”], axis = 1, inplace = True)

Next, you can move on to dealing with missing data. Handling missing data is very important because if you leave missing values as is, you will affect your analysis. So, we must deal with them moving forward. Depending on the situation, there are three strategies for dealing with missing values in the dataset:

  1. Leave as is
  2. Drop them
  3. Fill in missing values with median or mean

To confirm if a dataset has any missing values, we can use pandas.isna() to detect any missing values and add .sum() to display how many missing there might be in a specific column.

df.isna().sum()
Figure 2: Null values in the dataset

From the output above, we can see that there are three columns with missing values: waterfront, view and yr_renovated. Now, we have to examine each of these columns to deal with it’s missing values.

Waterfront

With the help of the pandas.column_name.unique(), we can see what types of values are in this column.

df.waterfront.unique()

With the help of the .unique() function, you can see that there are three unique values in this column: 1(waterfront view), 0(no waterfront view) and nan(not a number). Because waterfront views do affect housing prices, the column could not be dropped. Instead, the data frame was split into two: houses with waterfront views (1) and houses without waterfront views(0). This way the houses without waterfront information (nans) are removed from the dataset. Finally, the two datasets are combined using pandas.concat().

df_waterfront = df.waterfront == 1df_nowaterfront = df.waterfront == 0new_df = pd.concat([df_waterfront, df_nowaterfront])
new_df.head()

View

For the view column, there were only 63 missing values and decided to fill the missing values with the mean value using pandas.fillna()

df.fillna(df.view.mean())

Yr_renovated

For the yr_renovated column, the missing values were filled with the median year homes that were renovated.

df.fillna(df.yr_renovated.median())

In order confirm that all missing values were dealt with, we can call on .isna().sum().

Outliers

Next, the possibility of outliers can be dealt with. But what is an outlier? According to Wikipedia: An outlier is a data point that differs slightly from other observations. Outliers can be created due to errors in experiments, incorrect data entry or variability in measurements. Using pandas.describe(), which shows us the five-point statistics of each column, it became obvious that there was a clear outlier in the bedroom columns.

df.describe()

As we can see the max value for bedrooms is 33 and upon closer inspection, the house in question had 33 bedrooms, with one floor and 1.75 bathrooms. Since this seemed impossible, this specific column was dropped.

The last step was checking that each column was the correct data type. In figure 1, we can see that the data type of each column was highlighted. In that examination, we can see that the sqft_basement column is an object instead of being numerical. Using the .value_counts() function, we can see there are over 400 entries with a question mark.

df.sqft_basement.value_counts()

Using the .fillna() with the column’s median value, the question marks were replaced and then the datatype was changed using pandas.DataFrame.astype()

df.sqrt_basement.astype(df.sqft_basement, numerical)

Conclusion

Finally, the data set is clean and we can move on to exploring your dataset! Data cleaning is very important for making our analytics error-free. In the real world, many of the most interesting datasets are filled with missing or incorrect information. A small error can ruin a model’s performance. So, make sure the data is always clean.

--

--

Hazel Donaldson
Analytics Vidhya

Aspiring data scientist looking to positively impact the world