Getting Started With Exploratory Data Analysis
I remember the first time I heard that term. The way it was said, made me understand that I was supposed to have known it. Of course I played along… Yea, sure EDA… no big deal! But it was sure a big deal. I got to understand what it meant though but I was still pretty much in the dark. I had questions; how far to go? How to go? What am I doing? When should I stop? How should I stop?
Explore- means to investigate systematically, Analysis-means decomposition into components in order to study. These are pretty much closely related. So it is basically that you’re trying to break down and investigate on Data.
But then, lets go formal;
According to Wikipedia, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. The processes of Exploratory Data Analysis involves performing an observation check on the datasets to observe anomalies and finding out patterns with the datasets.
So back to not being formal again;
How far to go?…Answer: All the way if you’ve got the time and luxury of fetching out as much discovery as you can for your project or client
How to go?… Answer: Orderly and systematically.
What am I doing?… Answer: Looking for clues and patterns in the datasets focused towards solving the problem for which you selected or extracted the data for.
When should I stop?… Answer: When you have found patterns that can help you solve your problems or when you can’t find clues or useful patterns anymore.
How to end?… Answer: With summaries of your findings clearly stated and of course in an orderly manner.
Some EDAs come before other processes such as prediction. So you would want to get all the information you can to help you with your next task, process or decision.
Exploratory Data Analysis Processes (EDA) varies according to the datasets being used and the person performing the process although it is expected that certain general observations be made about the same datasets.
When performing EDAs, you basically look for any clue (this could be in the structure of the dataset, the information surrounding patterns formed, reasons if there are any or basically anything), pick it up and run with it. Then do a repeat with as much clues as you can get. Remember the “orderly and systematic process”?... That’s why you try to see where that clue leads you too before you pick up another clue to investigate more on.
To better explain this processes, I will use the US unemployment Rate datasets from the US Department of Labor’s Bureau of Statistics to get you STARTED. Then I would let you continue. This datasets is from the year 1990 to 2016.
In order to perform EDA, it is important to have a background knowledge on what the datasets is/will be all about. For example someone who doesn’t understand what Unemployment rate is would simply operate blindly with the datasets.
I performed the entire process in Kaggle basically because of the size of the datasets and the availability of an accelerator (GPU) and memory. You should check Kaggle out if you haven’t. It is such a wonderful space for everything datasets to data resources and tools.
Understanding the Basic Structure of The Datasets:
I needed to have an idea of what the data set looked like so I viewed the first five and last five rows:
As soon as I saw the first five row, I had to quickly ask myself if the Year, Month and State had repeated values( that is ‘2015’, ‘February’ and ‘Mississippi’). So I had to quickly check the last 5 rows. And I changed my mind with asking that question when I saw different values.
Next, I checked the data types and the shape(number of rows and columns).
So I discovered that the Year column contained integers. The Month, State and Country basically object types (string in this case) and the Rate column, float(having decimals). I also discovered that the data shape was quite large(885548 rows and 5 columns), as expected, considering that it is for the United States.
Next I checked if this large datasets had missing values. Lucky me, I discovered it didn’t. All rows were present with the appropriate data as a matter of fact. The image below just shows that no row was missing.
I then though of viewing some statistical details in the datasets since I had some rows that were made up of numbers. This is what i discovered:
I found out that the minimum (smallest) value in the Year column was 1990(ignore the other zeros) and the maximum value was 2016. From this I could tell that the latest year represented was 2016 and the earliest was 1990… just the way I expected. It is important to note that no datasets should be trusted because anything could have happened at some point to the datasets; from the point of collection to the point of download. The table above also shows that the highest unemployment rate was 58.4%, an outlier (a value far off from other values in the same column); pretty high and this definitely makes it an outlier especially when you consider the third quartile; the value (7.7%). But we’d look at it a bit later. More information from the table! It can also be seen that the minimum column was 0.00%; a zero unemployment Rate! Questions are starting to pop up… I mean… because a zero unemployment rate isn’t quite a good thing. How do I know? I researched zero unemployment rate! The mean unemployment rate was recorded as 6.18%.
Taking a Closer Look at the Content of the Dataset
My attention went to the State Column next. Like, first of all, are all states in the US represented? And if yes, prove it! Show me! Because I can imagine this unemployment rate would vary from state to state and would also come together(in form of mean)to give the general results of unemployment rate for the entire country.
My findings? 47 states only… with the absence of Alaska, Florida and Georgia. Why? I’m not sure… I mean because Florida is the 7th most populous state in the US, Georgia the 8th most populous state and Alaska covering the largest area in the US. And all these states have been in existence since 1990. Could it be that these states would have influenced greatly the general unemployment rate for the US? Or the people who gathered the dataset lost some. Or my dataset source isn’t valid. Or any other hypothesis you may have?
Let’s move on because we can’t answer these questions yet. Next is understanding how these states are represented by the number of rows they occupy in the entire dataset. Having this in mind; that we have a county column:
So, it is clear that they occupy different number of rows. Why would this be so? Remember there is a County column? yep! Different number of counties per state would make the number of rows occupied vary.
Don’t forget visuals
This first visual above gives an easy view of the number of rows occupied per state. It is such a large graph else the y axes should be a bit more detailed in terms of giving the viewer a great estimate of the number of rows occupied per state.
Finding out the number of counties and the names of the counties:
1752 Counties represented and that is alot to display. Another issue is that Counties are not evenly represented. Aha! Something we were suspecting on the States column but couldn’t conclude because of the varying number of Counties. But since, all the months of the year should be the same for all years and the counties should be repeated the same number of times per month for each year and for the same number of years, So we know something is truly wrong with this dataset. And would this mean that the average unemployment rate for the entire country won’t be quite accurate and not well representing of all states and counties? Yes, definitely!
Next, picking up on another clue; that large unemployment rate we saw displayed on the statistic summary. Could that have been an error since some outliers are said to be errors? Let’s look at a boxplot for this information:
It is seen that the outliers happen to be alot. Outliers are sometimes said to be errors, but from the graph above, I might actually say that it isn’t, because of the number of outliers present. seems to me that alot of counties experienced this high rate of unemployment as well. Just that in relation to the entire dataset, they are few. Below are displays of these high unemployment rate.
I’m curious about getting a definite percentage figure of these outliers. They sure don’t make our dataset pretty and they pose alot of confusions about them truly being outliers, But if I can confirm that they aren’t much and aren’t just caused by a big gap in our dataset, then I would gladly move along and pay them no attention.
So from information gotten from the boxblot, extracting the number of these outliers and the percentage as well in relation to the entire dataset, and getting the number of rows of unemployment rate greater than 15%;
Number of rows with unemployment rate greater than 15% is 13,868 and that is just 1.57% of our entire dataset. That’s it, I’m done with the outlier thing. Oh, one more thing first:
A little more detail about the largest Unemployment rate in the entire dataset :
The highest unemployment rate recorder is 58.4%. in 1992, San Juan County, Colorado. This value is seriously high and posses lots of questions. Like what actually happened that lead to the high unemployment rate. The only activity around that period that may have caused this is the after effect of the recession that happened for 8 months between July 1990 to March 1991, because there are no histories of natural disasters, pandemic or anything of that nature! I won’t give much details on the other high unemployment rates in this percentage, because I have observed that they are from different years. No particular pattern. I mean, take a look:
But then, How about the states, is there any pattern?
oops! Sure! 39 states have their unemployment rates in several counties higher than 15%. And 8 states managed to maintain unemployment rates in the different counties, from 1990 to 2016 to be less than 15%. These states are Delaware, Wyoming, Kansas, Connecticut, New Hampshire, Vermont, Nebraska and Rhode Island. Amazing!
So, I’d just move to some other clue.
Zero Unemployment Rate:
2 major counties in Texas kept having 0 unemployment rate though a three year span. This makes the Dataset questionable as a lot of people would argue that a zero unemployment rate is practically impossible. But then if data don’t lie, I smell problems forth coming in the following years basically because a zero unemployment rate would mean that the every person in the work force of that county was employed and so it would be difficult to upscale business, get more staff to grow the business and sack people who aren’t performing up to the company’s or organisational standard. But that wasn’t the case in the following years, so I would assume that these rates are errors, They probably weren’t imputed; my thoughts!
Observing Grouped Data
I cut my trail on individual state and county patterns and wanted to group my dataset and I did per state then per year:
The state with the lowest average unemployment rate(through 1990 to 2016) is Nebraska (3.1%) and the state with the highest average unemployment rate is Arizona(9.3%).
The year that had the lowest average unemployment rate was 2000 (4.32%). This shows us indeed that the US only started to experience the effect of the “Early 2000’s Recession” after the year 2000 although other countries started to experience it from the year 2000. The highest unemployment rate was in 2010 (9.19%). I would like to attribute this high unemployment rate to the great recession that took place between December 2007 and June 2009. Most recession cases are seen to have ended with growing unemployment rate lasting for nothing less than a year.
So we can go on and on investigating (I would let you do that). Did you pick up any trail to follow? You could investigate more as to how the unemployment rate trended in the state with the highest average unemployment rate or the one with the lowest. You could also investigate the pattern of unemployment rate (from January to December) for each year or months. There are so many patterns that can be found. But don’t forget to visualize.
You could check out my notebook for this Unemployment Rate Dataset HERE
So yeah, EDA,… maybe not such a big deal!