Data Cleaning in Python — Data Science With Craigslist Data Part 3
Hello again Medium. This is the third installment of a six part series that goes over the entire data analytics process of some 20K plus Craigslist vehicle ads data from 20 cities. In the previous section, I covered the actual web scraping process using python and BeautifulSoup.
Data cleansing is the technique of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete statistics inside a dataset. If the data is incorrect, outcomes and algorithms become unreliable, even though they may appear to be correct. In most cases though, having a domain knowledge helps to avoid under cleaning or over cleaning the data.
In this section, we’ll go over the data cleaning procedure in python as well. The Jupyter Notebook is available on my GitHub if you would like to follow along.
We’ll use the following workflow albeit loosely.
- Load raw data from the web scraping exercise
- Create new columns as needed
- Dropping unnecessary columns
- Check for erroneous data and clean
- Check for outliers and clean
- Fix incorrect data types
- Save clean data frame to csv
Ok, let’s get started. First we import all libraries that we will need into our Jupyter notebook environment. Please see the code below for the imports.
Next, we load and preview the data we obtained from our web scraping session. My data is stored as csv on google drive and I am running a notebook on colab.research.google.com which is Google platform that lets you run notebooks on the cloud. To read data, you just run from google.colab import drive then drive.mount() to mount your google drive and read data from it.
Here is the snapshot of the imported data.
First thing that jumps out is the data time column. We’ll want to create weekdays which will come in handy when we analyze the data. While we’re at it we can create a separate date column as well as timestamp column. Not entirely necessary but useful when you are looking at the data from these two dimensions separately. To create the two new columns, we’re going to use the string function split but first we have to convert the data type to a string.
The resulting columns look as shown below.
Next we’ll create both the numeric and categorical week days. The numeric version comes in handy when doing correlations with other columns. Once we have numerical weekdays, we can create a dictionary with day number day name key values and then map the key values to the numerical labels.
All the resulting new columns from the data time column are shown below.
Next column that needs our attention is the year make model column which contains all three attributes in the same string. We’ll need to split this into three separate columns. Here is what the year make model column looks like.
Immediately, we see another issue that will cause problems down the road. The year occurs twice in the string so splitting say on the first space will create some bad data for second and third columns. We need to remove the extra year where it occurs. To do this we use regular expressions which detect repeated words and remove duplicate words in the year make model column.
The resulting column after removing duplicates looks as shown below. You can see the duplicate year from row 9 above is now shown once below.
We can now split the year make model column and grab the first three words for year, make and model columns respectively.
Here are the resulting new columns from the code above.
As a final check, we’ll create a histogram using matplotlib.pyplot for all year occurrences in the data to see if our cleaning worked as intended.
It almost worked perfectly if it were not for the few erroneous years shown above. We have to do some further cleaning of the year column. There are many ways to do this but I created a helper column that stores the length of the year column. A proper year is four characters long so we’ll find out which year values don't conform to this length by examining the values stored in the new helper column called length.
We see that there are three bad rows with the year length as 4,5 and 9. We‘ll just drop these rows all together from the data frame. We should also drop[ the helper column at this stage since we’re done with it. Now we see our histogram doesn’t detect any more bad years formats.
Next, we look at the data types for each column and change as needed. We just run the pandas.dtypes method to retrieve the data types.
The date time and date columns have the right data type of datetime. Lat, long, odometer and price have the float and int data types. So these don't need any changing. The categorical columns are however stored as object type which would work but we need to tell pandas to convert this to categorical data type categorical variable which is more efficient when working with pandas.
We should also drop any columns we no longer need at this stage. Like the ‘year make model’ column.
The resulting data types are shown below.
The next thing is to examine the data frame to see how many null values we have in each column.
The table below shows the null and the percentage of data they represent for each respective column right besides it.
We see that the VIN number column has the most null values at 85% of all vehicles because most people don’t input this on the craigslist ad. The other columns with a lot of nulls are size, type, drive paint color, cylinders and condition column respectively. In some data cleaning cases we might have to drop these rows all together but in this case we’ll keep them because we’ll lose more than we gain. For example, it’s best to retain the price and odometer information of cars without a VIN number because the price data is more important. This is where a familiarity with the domain kicks in. The key thing here is that the most important columns have almost no null values.
We’ll now get the basic descriptive statistics of the numerical columns to get an idea on how the data is leaning. We use the pandas.describe method for this.
While lat, long and pID are shown here we can ignore them as they are more or less just labels. We see that the odometer column has a maximum value of 10,000,000 and price a maximum of $123 million. At first glance, this obviously does not make sense so we’ll want to see the data distribution and find the outliers. The weekday column shows a maximum of 6 which is the last day of the week and minimum of 0 being the first day of the week. This makes total sense.
So let’s create a boxplot of both price and odometer and find the outliers in the data and remove them. Outliers tend to heavily distort the measure of central tendency.
Let’s try another type of plot, a histogram.
We’re going to remove these outliers by using a quartile filter which will eliminate the top 1% and bottom 1% of data. Pandas the elegant pandas.DataFrame.quantile method to help us handle this.
After removing the outliers we can now take a look at the new price distribution using a histogram.
The cleaned up price column still has a right skew but is reflective of the real world. We use the same steps to clean up the odometer column.
Removing the outliers resulted in some lost data so we’ll just examine how much. We started with 24,427 rows of data and ended up with 23,442. Almost 1000 records short or about 4% of the original data. Not a bad compromise for good data as shown below.
Finally we have a dataset that we can analyze in confidence. For now we’ll just save it to csv but in the next blog post we’ll do a proper data exploration. Just remember that you might have to reassign the right data types when this clean data is loaded from csv.
In conclusion, I know that we can find more ways to clean the data at hand but most of that will depend on what insights we want to generate from the data. Thanks for following along and please, do add your comments or questions below.