Exploratory Data Analysis (EDA) with Pandas
I wanted to write a blog post on how to quickly perform high level EDA on a dataset. This blog is targeted for beginners looking to explore and get their hands wet with the data.
I will be going over:
- What EDA is.
- How to read in a CSV.
- How to perform the initial steps of EDA.
What is Exploratory Data Analysis and why should I be doing it?
EDA can serve many purposes, but some that stand out are preparing your data for modeling, trying to find some trends and answering initial questions. By having a high level overview of our data, we can see if we are going down the right path.
How to read in a CSV:
A CSV is a file that stores tabular data in plain text. For this example we will be loading in a CSV into a Data Frame to begin exploration. Please view the code below to learn how to read in a CSV. Please note I downloaded my CSV from Kaggle.com. The Data Set is called Where it Pays to Attend College, you can get it at this https://www.kaggle.com/wsj/college-salaries.
Just assign your code to any variable that would be easy for you and others to follow along to. For this example I used a variable name of degrees, since this Data Frame is about the salaries you can gain from the type of degree you have.
How to perform the initial steps of EDA:
You can go about doing these initial steps in any order, but these are the steps I wanted to take.
1:
- Lets view what columns are within this database
You can see that we have 8 columns here, however it will be annoying to keep typing those column names when we want to work with the data. Let’s change the column names to make them easier to work with.
2:
-Renaming columns
degrees.columns =
['major','Start Med Sal', 'Mid Career Med Sal', 'Start-Mid Career Sal Change', '10%tile Sal','25%tile Sal','75%tile Sal','90%tileSal']
The names of the columns have now changed to those listed above.
3:
-Let’s view the shape of this data
The number on the left hand side is the number of rows and the number on the right is the number of columns as we counted before.
4:
- Checking the first 5 rows of the data
Using this head() function attached to the dataframe(df) you want to see, showcases the first 5 rows. You can of course see more or less rows by putting in the number you want to see within the parenthesis. Such as degrees.head(10) to see the first 10 rows. You can also substitute that code for .tail() to see the last rows.
5:
-Lets check what datatypes the columns in the dataframe(df) degrees are in.
We can see here that all but one of the columns are listed as datatype object. What does this mean? High-level, data types are a way or storying data that can be utilized for different purposes. For our arithmetic and graphing purposes, we want to change the datatypes to floats or integers.
6:
- Lets change the data types
Adding “.astype()” can change the dtype to what you decide to put inside. However in this case we got an issue, because the dtype float can not have a dollar sign or a comma in the data. So we need to switch $64,000.00 to 64000.00 before converting to a float. Lets make a function that can do this for us.
7:
-Making a function to take out characters and switch the dtype
So what is happening here? We are defining a function called clean(). You can call this function what ever you see fit. I decided to call it clean, because this function will clean up the data enabling it to switch dtypes. You can read this tutorial, https://www.tutorialspoint.com/python/python_functions.htm, for more information on how to write a function. Functions are very useful, so I recommend looking into them.
Lets apply this function to our dataframe, by looping through the columns.
Here we are creating a new variable that houses the columns of our dataframe. We are doing this so that we can loop through it and apply our function to each column one at a time. If you do not know what a for-loop is, it is pretty much running an assigned command over and over again to a subset of data you ask it to loop through. So in this case, I am asking the for-loop to go through every column in salary_columns and apply the function clean. I am then making the change permanent by reassigning the change to itself degrees[x] or degrees[‘Start Med Sal’] and so forth.
8.
- Lets see what the descriptions of the data look like now
This describe() function shows the statistics of the data such as minimum and maximum, visible on the left hand side. If you try running this with columns that do not have all integers or floats, then the column data will not be displayed. However, as we took out the comma and dollar sign and switched the datatype, we are now able to see these statistics. It can be very useful to look at these statistics early on to see if there are any outliers in the data.
9.
-Counting null values
degrees.isnull().sum()
So in this instance there does not seem to be any null values. However this is a topic I should speak about. If there was a null value (empty/NaN) it may make sense to drop that data if you have over 95% of observed data. There are other methods of inputting null values with a mean or median or a regression imputation, but that can mess up the variance and distort the shape of the data. The best methods of imputing missing data is beyond the scope of this blog, but I highly suggest doing some more research on this topic down the line.
- Deleting null values
degrees = degrees.dropna()
This is the code you would want to run if there were any null values. You would then want to check the shape to make sure that the values actually dropped. I would also like to note that there are times you may want to run analysis or graph a column and it wont let you due to a type error. For instance this type object can not be graphed or meaned or something. Someone could have manually typed in NA or NAN and then it is getting rendered as a type str. These are really annoying issues, but just be conscious that some missing values can cause this issue to arise, so be on the look out.
CONCLUSION
There are many steps one can take regarding EDA, but I decided to get you started with just a few. A lot of people think data scientists are constantly modeling and predicting cool things, but much of our time is actually cleaning the data. Visualizing the data is also very useful, but thought I would create another blog for how to visualize your EDA. I hope this can help a little bit, good luck and have fun!