Rudimentary Data Cleaning Techniques in Python Using King County, WA Housing Dataset

Published in

The Startup

11 min readJan 19, 2021

As an aspiring data scientist, I often find myself getting really invested in the new material I’m learning and get excited at the fact that I’m inching closer and closer into the popular and revolutionary field of machine learning.

And while I don’t plan on contributing at all in the creation of a highly intelligent, autonomous killer cyborg, the idea of developing and training complex models to make our everyday lives easier is very appealing. However, as advanced and sophisticated as these models may be, they’re pretty much considered garbage if one crucial step is not taken…

DATA CLEANING

In order to produce a trustworthy and accurate model, data cleaning needs to be performed beforehand!

Background

This post will show you some data cleaning techniques being done on the a King County Housing dataset. This dataset contains information on house sales in King County, WA between May 2014 and May 2015 and can be found here. The data cleaning on this dataset will be extremely helpful when creating a regression model for predicting house prices:

1. The Data

Below is a preview of the data:

# importing Python libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inlinedf = pd.read_csv("data/kc_house_data.csv")df.head()

The dataset has 21,597 rows and 21 columns

In order to understand what each column in our data frame represents, a data dictionary is displayed below:

TARGET/DEPENDENT VARIABLE:

price — price of each home sold

PREDICTORS/INDEPENDENT VARIABLES:

id — unique identifier for a house
date — date of the home sale
bedrooms — number of bedrooms
bathrooms — number of bathrooms
sqft_living — square footage of the house’s interior living space
sqft_lot — square footage of the land space
floors — number of floors
waterfront — does the house have a view to the waterfront?
view — an index from 0 to 4 of how good the view of the property was
condition — an index from 1 to 5 on the condition of the house
grade — an index from 1 to 13, where 1–3 falls short of building construction and design, 7 has an average level of construction and design, and 11–13 have a high quality level of construction and design
sqft_above — square feet above ground
sqft_basement — square feet below ground
yr_built— the year the house was initially built
yr_renovated — the year of the house’s last renovation (0 if never renovated)
zipcode — zip
lat — latitude coordinate
long — longitude coordinate
sqft_living15 — average size of interior housing living space for the closest 15 houses, in square feet
sqft_lot15 — average size of land lot for the closest 15 houses, in square feet

It is also important to know the data type of each column to get a sense of its nature (continuous vs. categorical):

df.dtypes

The output above confirms most expected data types such as bedrooms and condition being discrete integers, however, some columns have unexpected data types (date — object, sqft_basement — object, yr_renovated — float64).

1.a — Descriptive Statistics of the Data

A good rule of thumb is to calculate descriptive statistics from our variables to get to know our data even better. Even though most of these will be seen in visualisations later on, here is a table format of it:

df.describe(include = "all")

The output above shows statistics that we’ve probably expected, such as an average house having 2 bathrooms or the most recent house being built in 2015. However, there is cause for concern with some variables (the max number of bedrooms in a house is 33? The earliest renovation was done in year 0?)

1.b — Check for Missing Values

Missing values are often times problematic and should always be checked when reviewing a dataset:

# percentage of missing values per column
(df.isna().sum() / len(df)) * 100

Unfortunately, waterfront, view, and yr_renovated have missing values. If we decide to use these variables in the predictive model, we’ll have to do something about their missing values.

1.c — Check for Duplicates

Duplicates are also problematic and are far more subtle than missing values. Given that a single house could’ve been sold at different points in time, we’ll define a duplicate as any row that has the same `id` and `date` as any other row in the dataset.

# check if the number of duplicates is 0
df.duplicated(subset = ["id", "date"]).sum() == 0

This is great news! Nothing to worry here.

1.d — Check for Extreme or “Irrelevant” Values

Extreme and irrelevant values are also worth checking since they may affect our model down the line. Extreme values could be outliers or another way to express missing data (9999 is often one of them). They are easily spotted since they fall at the very beginning or end of the respective distribution. Irrelevant values are a cause for concern since they shouldn’t even be included in the data (e.g., “?” in a numeric column, 42 in a string column, etc.) These are more difficult to detect since they may be hidden in a large pool of data.

An easy way to check for extreme values is to display the distribution of the data and check the endpoints. Doing this also allows us gain a better understanding of the nature of each variable. This is done below:

# histogram of each numerical feature
df.hist(bins = "auto", figsize = (20,20));

The output shown above mostly helps us understand the shape, skewness, and modality of each distribution. Several features such as price, yr_built, and sqft_living are skewed and unimodal. These distributions also shine a light on which variables could be considered categorical and which are definitely continuous (discrete bars vs. semi-smooth shapes).

To truly visualise potential outliers (extreme values), a boxplot is more appropriate:

# only visualising boxplots of non-string variables
not_object = df.select_dtypes(exclude = np.object).columnsfig, axes = plt.subplots(nrows = 5, ncols = 4, figsize = (20,20))
# counter
index = 0
for row in range(len(axes)):
    for col in range(len(axes[row])):
        if index == len(not_object):
            break
        ax = axes[row][col]
        var = df[[not_object[index]]]
        var.boxplot(fontsize = 15, ax = ax)
        index += 1

Most of the boxplots displayed above show possible outliers. While some boxplots do this accurately (there’s a house with 33 bedrooms!), others report “outliers” that make rather logical sense (houses renovated in the year 2000 or a house having a condition rating of 1).

Checking for irrelevant values is somewhat complicated. What we’ll do is first check if all the “numeric” columns (ints or floats) can be coerced into an int data type. If all these columns truly have numeric values, there shouldn’t be any errors while doing this check:

# list containing all irrelevant values
incorrect_nums = []
# not_object is list of all numeric columns
for col in not_object:
    for val in df[col].dropna():
        try:
            int(val)
        except:
            # append val that couldn't be coerced to int
            incorrect_nums.append(val)print("There is/are", len(incorrect_nums), "erroneous values in our numeric columns.")

That’s good news! This is a strong indication that there don’t exist erroneous values in our numeric columns.

The check for non-numeric columns is easier since there are only 2 of them (date and sqft_basement). We’ll try to coerce date from a string to a datetime object and sqft_basement into a float (their appropriate data types):

incorrect_dates = []
for val in df.date:
    try:
        pd.to_datetime(val)
    except:
        incorrect_dates.append(val)incorrect_sqfts = []
for val in df.sqft_basement:
    try:
        float(val)
    except:
        incorrect_sqfts.append(val)print("There is/are", len(incorrect_dates), "erroneous values in the date column.")
print("There is/are", len(incorrect_sqfts), "erroneous value(s) in the sqft_basement column.")

Uh oh, there seems to be irrelevant values in the sqft_basement column. Let’s check out what they are:

# unique irrelevant values
set(incorrect_sqfts)

The sqft_basement column has an irrelevant/incorrect value. “?” appears 454 times (2.1% of the time).

2. Data Preparation

Having explored the data and gained a solid understanding of it, now it’s time to prepare our data for the regression model.

2.a — Fixing Missing Values

As mentioned in section 1.b, waterfront, view, and yr_renovated are the only columns with missing values. Section 1.d’s distributions suggest waterfront is a binary variable (0,1), view is a discrete variable with most values falling under 0, and yr_renovated also has most of its values fall under 0. To verify this, we’ll output the most common values for each variable along with its percent frequency:

# iterate over each column and print most common values
for col in ["waterfront", "view", "yr_renovated"]:
    print(col)
    print(df[col].value_counts(normalize = True).sort_values(ascending = False).head())
    print("------------------")

The results shown above clearly show that 0 is the most common value for all three columns! In order to retain as much data as possible, we’ll fill the missing values of each column by randomly choosing a unique value given the appropriate weights. For example, a missing value in the waterfront column will be replaced by 0 with a 99% chance and 1 with a 1% chance:

def replace_missing(val, probs):
    if np.isnan(val):
        return np.random.choice(probs.index, p = probs)
    else:
        return valfor col in ["waterfront", "view", "yr_renovated"]:
    # get weights of unique values
    unique_p = df[col].value_counts(normalize = True)
    # apply function above
    df[col] = df[col].apply(replace_missing, args = (unique_p,))
    print("The number of missing values in {} is:".format(col), df[col].isna().sum())print("--------------------------------------")
print("Missing values per column:")
# last check to see if there are missing values
df.isna().sum()

Great! No more missing values and we retained all data we started off with!

2.b — Fix Extreme and “Irrelevant” Values

It is clear by looking at section 1.d’s boxplots that most variables have outliers. Therefore knowing when to get rid of them or leaving them as “reasonable” data points could seem quite hard. In some cases, log-transforming the variable could fix the problem. In other cases, we might not even wish to log-transform the variable and would therefore need to remove as many outliers as we can. In sum, this is a pretty subjective part and really depends on the nature of the variable being dealt with.

For this section, I mostly refer to the boxplots and histograms in 1.d. If I think a variable could be log-transformed later on, I get rid of the most “extreme” outliers (since log-transforming the variable will probably fix the remaining outliers). If I think a variable shouldn’t be log-transformed (in order to “save” its interpretability), I get rid of most of the outliers:

# removing records that exceed "extreme" values:
# log-transforming price could fix remaining outliers
df = df[df.price < 5000000]
# probably don't want to log-transform this variable.
# try to keep most "reasonable" data
df = df[df.bedrooms < 6]
df = df[df.bathrooms < 4]
# log-transforming variable could fix remaining outliers
df = df[df.sqft_living < 10000]
df = df[df.sqft_lot < 800000]
df = df[df.sqft_above < 7000]
# can't log-transform variable
# try to get rid of most outliers
df = df[df.long < -121.8]
# probably good idea to log-transform these variables
df = df[df.sqft_living15 < 5000]
df = df[df.sqft_lot15 < 300000]pct_diff = ((21597 - df.shape[0])/21597)*100print("We now have", df.shape[0], "rows ({}% less than at the beginning)".format(round(pct_diff, 2)))

Now that we got rid of those humongous outliers, we have to make a decision about sqft_basement. Given that “?” occurs only 2.1% of the time, we can replace it with the most common value (the mode):

# replacing ? with mode
mode = df.sqft_basement.value_counts().sort_values(ascending = False).index[0]df.sqft_basement = df.sqft_basement.str.replace("?", mode)
# check
print("The number of times '?' occurs in sqft_basement is:", (df.sqft_basement == "?").sum())

Great! We’re now clear of all missing values, outliers, and irrelevant values.

2.c — Convert Columns to Correct Data Types

Recall the data type of each column:

# displaying data types per column again
df.dtypes

While some make sense (e.g., price = float, bedrooms = int, etc.), there are other columns that need to be converted to a different data type. We’ll do that in this step:

# converting columns to appropriate data type!
df.date = pd.to_datetime(df.date)
# I've decided to use the ceiling function for variables that 
# are discrete by nature but are floats
df.bathrooms = np.ceil(df.bathrooms).astype(int)
df.floors = np.ceil(df.floors).astype(int)
df.waterfront = df.waterfront.astype(int)
df.view = df.view.astype(int)
# cannot convert float string to int directly
df.sqft_basement = df.sqft_basement.astype(float).astype(int)
df.yr_renovated = df.yr_renovated.astype(int)df.dtypes

We finally have all columns with their right data type! Now to apply some transformations

2.d — Data Transformations

Recall how some distributions in section 1.d. (e.g., price, yr_built, and sqft_living) were skewed. In order to improve the performance of our model later on, it’s generally a good idea to transform these variables to resemble normal distributions. One way of doing this is by log-transforming the variable:

log_vars = ["log_sqft_above", "log_sqft_living", "log_sqft_lot", 
            "log_sqft_living15", "log_sqft_lot15"]# log-transforming chosen variables
df[log_vars] = np.log(df[["sqft_above", "sqft_living", "sqft_lot",
                          "sqft_living15", "sqft_lot15"]])# dropping un-transformed variables
df.drop(["sqft_above", "sqft_living", "sqft_lot", "sqft_living15",  
         "sqft_lot15"], axis = 1, inplace = True)df[log_vars].hist(bins = "auto", figsize = (20,20));

As seen by the output above, the distributions of our predictor variables have now changed to resemble more of a normal distribution (symmetrical, unimodal, “bell-shaped”). This is good news as it’ll help our model performance later on.

We have now cleaned the data of all missing values, checked for duplicates, handled extreme values (outliers), and converted each column to its correct data type.

Conclusion

If you’re thinking of becoming a data scientist and, like me, is passionate about creating new models and learning the complexities of machine learning, I strongly recommend you master data cleaning. It’s a vital step that will greatly influence if your model is successful or not. Anyone can really create a highly complex or bare bones model and get prediction out of it. In the end, it’s the true data scientist that cleans and provides relevant data to the model that’ll receive all the applause and cheers.