Equipping your EDA toolbox with Python.
Exploratory Data Analysis (EDA) is really the backbone of building an accurate model that predicts what you need predicted in the real world. Sure, you can use powerful regression or classification models such as Random Forest, ADABoost, Support Vector Machines and Grid Searches to give you a great train score and but poor test score, resulting in a model that’s most likely overfit. If your dataset is small enough (and clean enough) you might be able to get away with doing little to no EDA on your data and get a fantastic prediction on your training and test score, however once we’re out of the classroom and in the cold cruel world the datasets we will be working with will, more often than not, be full of null values, poorly labelled columns, columns with the wrong data type (I’ll dig into that later) and full of outliers that will throw off your model.
This is where having a robust set of tools in your EDA toolbox will be invaluable. I find EDA can be fun if you know what you’re doing and down right frustrating if you don’t. I’m here to help make it fun for you.
This is what we will cover in this blog:
- . head, .shape, .info, . describe, value counts
- Looking for nulls
- Mapping
- Masks
- Dummies
- Finding correlations
- Plotting data
Explore the data
Let’s look at the data. For this blog I will be using the left hand dataframe. If you’re unfamiliar with this dataframe it sets out to ask a series of questions to random people to predict whether or not the person is left handed. The questions are not listed in the dataframe so you will have to use the dictionary as a reference (link). Finding the the dictionary is a very important step to keeping your features straight and determining which features you want to keep or discard.
We’ll load in the data and preview the dataframe with a .head.
First thing you’ll notice is the columns look wonky. Don’t panic. This can be corrected with a simple break command. Basically /t a TAB character. While we’re at it, let’s maxout the display of the columns so they are all visible. This is very useful when looking at dataframes with a lot of columns.
Boom!
Moving on, let’s check the .shape, .describe(), .info() and count null values.
The .describe()
function gives summary statistics for each of your columns.
Let’s look into the dictionary:
Looking at the dictionary of questions the first thing I notice is Q27 and Q43 are duplicates. Let’s drop one.
Mapping
Ok, let’s inspect the columns. These columns are a little tricky. Object variables can’t go into our model so for the columns we want to keep, we’ll have have to convert them to integers. Let’s look at the “hand” column. We’re for sure going to keep this column because it’s going to be our ‘y’ in our model. Remember: we’re trying to predict left handedness. We want our “y” to tell us if a person is left handed. 1 = yes, 0 = No.
This is a great time to the build-in mapping function:
Additionally here is a quick bar plot:
As you can see we have put ambidextrous and right handed people in the blue column and lefties in red.
There are some columns that look ripe for dummying. If you’re unfamiliar with dummy columns check out this excellent and simple blog post by Rowan Langford. We will dummy out education, gender, orientation, race and religion however they are integers and dummies are more effective when the column being dummied has objects. So, let’s map them into objects.
Now let’s dummy the columns:
Great. Now let’s take at a look at our dataframe with .columns:
Masks and Sort By
Ok now to go through our columns and root out the pesky outliers that can throw off our model. We’ll use the sort_values method to check age the range of our dataframe. This will sort the age column in order of oldest to youngest:
And just as I thought. We got someone who’s older than Moses and a couple people that would be in the Guinness book of world records. Let’s get them out of there. To do this we’ll use a .mask(). Masks are very handy in EDA, they are basically filters you can customize.
Let’s walk through the syntax:
Nice! Now let’s plot our data:
Finding correlations
Remember our problem statement is to make a model that can predict whether a person is left handed based on the questions and their background.
A great way to easily get a feel for linear relationships between your variables is with a correlation matrix. We can see the correlation between all the numeric variables in our dataset by using the .corr() method. It’s useful to get a feel for which columns are correlated. The .corr() method can help you decide what is worth investigating further (though with a lot of variables, the matrix can be a bit overwhelming…)
This a fairly large dataframe with 74 columns. So lets split up our features,:
By taking all the features and splitting them up we can get a much more readable correlation matrix.
Plotting our data
It can be difficult to spot any outliers simply by staring at our correlation matrix. To help get around this issue, let’s use Seaborn’s .heatmap() method along with our correlation matrix.
The annot= shows the numerical correlation from the matrix in our heatmap, giving us a better understanding of the feature’s relationship to hand.
Alright that about does it. From here you can pick and choose the features that will best fit your model. Remember, to continue to explore your data and filter outliers because your data WILL lie to you. One part of our job as data scientist to find the truth. Hopefully this blog will assist you on your quest.