Pandas and EDA

For the last two weeks or so I have focused a lot of my time on figuring out Pandas syntax and working though the generic and very broad steps of EDA. Here is an example of working through a dataset and some snipits of code I used. I’ll need to do another post with more tips.. this one got long very quickly.

EDA — Exploratory Data Analysis. The first step in any data science project is getting to know your data. In the same way if you’ve ever moved to a new city you like to get a feel for the neighborhood, what shops are around, and how it feels at night. You probably read the local newspaper or something similar. The goal in this is to build a very basic idea of what you’ve gotten yourself into and have a basis to relate other information.

EDA is kinda like that. There is a very loosely structured process that changes depending on what type of data you have. But the high level steps and goal is the same: to explore what features you have, take a look at data types and conversions you’ll have to do, and decide what you’ll do with nulls, NaNs, and other data that is out of place. Once you’ve cleaned and converted it to numeric features, you’ll probably have a few ideas of where you’ll want to explore next! Pairplot, correlation matrixes, histograms, you name it!

For this walkthough I’m using the fivethirtyeight drug-use-by-age.csv dataset, which can be found here: https://fivethirtyeight.com/datalab/how-baby-boomers-get-high/

The Process:

  1. Look at the data. Import into JN with pandas and get an idea of the shape, data types and and overall feel.

import pandas as pd
drug = pd.read_csv(‘drug-use-by-age.csv’)
drug.head()

top 5 rows of our data frame, the … hides middle columns

drug.info()

column names, counts, data types for our data frame

Okay! So just by looking at these two pieces of information we know we’ve got what looks like four types of columns: age, n, drug-use, and drug-frequency. From the top 5 columns it looks like all the column data types should be numerics, right? But from our drug.info() we can see this is not the case — objects are not numerics. Lets see why.

2. I like to take a look at the unique values in a column, I can imagine this could be a overwhelming result on larger datasets. But until I get to them, here’s how I’ve dealt with it:

drug[‘age’].sort_values().unique()

unique values from the age column

Well, now its obvious the age category isn’t a numeric because it has values that are ranges denoted by ‘-’. Off the top of my head I can think of three ways we could deal with it.

First, we could leave it the way it is — this might affect our ability to graph drug-use over age — especially if we want the age axis to be evenly spread. Second, we could duplicate each age range column for ever age in the range. Example being: for row ‘30–34’ create rows 30, 31, 32, 33, and 34. Then we would have to delete the original column ‘30–34’. This would duplicate values of our data so we’ll have to think about the consequences of this method. Third, we could just create a row taking the median value of the range. This wouldn’t duplicate our data but would keep a consistent spread across the age axis.

At this point, I’m thinking it’d be pretty cool to graph each drug-use over time to se if there are any trends, and which drugs are most popular at each age.

But lets deal with this later and take a look at why the other columns are object types also. This time I’ll use groupby(column) to take a look at all values in a column, and use the size() to get the unique counts. This is helpful when you expect there could be many ‘-’ in the column and gives you an idea of the scale of a change you’d make if/when you decide to change ‘-’ to another value.

drug.groupby(‘cocaine-frequency’).size()

unique values and counts for cocaine-frequency

It has a ‘-’ as well, but there is only one. Thats good to know!

After doing the same for all the columns, we can tell only ‘-’ are present and there are very few. Since we have so few columns, I’d rather not drop these rows from our data entirely. Instead, let fill them in with zeros. Then we’ll be able to convert all columns to numeric.

3. Remove ‘-’s and convert columns to numeric. We can do this with a simple loop. First, we’ll loop through each column. But we only want to affect the columns that are objects, so we can say if the column type is an object, then replace the ‘-’ with a zero. One caveat is that we still don’t know what we’re going to do with ‘age’, so lets skip it too.

cols = drug.columns
for col in cols:
 if col != ‘age’: 
 if drug[col].dtype == ‘object’:
 drug[col] = drug[col].str.replace(‘-’,’0')
 drug[col] = pd.to_numeric(drug[col])

lets call drug.info() again, and see what we’ve got!

column names, counts, data types for our data frame, this time they’re all float64

4. I decided to go with the second option for the ‘age’ column originally. I then tried the third option as well but second one more visually. I imagine there must be an easier way to pull this off, but here’s my code for doing the job.

duplicate each age range row with columns

The basic idea is split the age field into a low and a high value. Then for the range of those values build a small data frame that has the duplicate rows and concatenate it the original dataframe. Once there, re-index it because the duplicates have the copied index value, and loop through the age values again. If they have ‘-’ or ‘+’ remove them. Convert it to a numeric. And clean up the extra columns grown from the re-indexing.

5. With that done, we just need to subset the columns into drug-use and drug-frequency. We start with an empty list and looping through the column names, appending only the ones that have the string ‘use’ in them. After I graphed these I realized I’d need a ‘hard-use’ list as well. So I built that here too.

#group drugs
use_list = []
for col in cols:
 if ‘use’ in col:
 use_list.append(col)
 
freq_list = []
for col in cols:
 if ‘frequency’ in col:
 freq_list.append(col)

hard_use = []
for col in use_list:
 hard_use.append(col)
 
hard_use.remove(‘alcohol-use’)
hard_use.remove(‘marijuana-use’)

hard_freq = []
for col in freq_list:
 hard_freq.append(col)
 
hard_freq.remove(‘alcohol-frequency’)
hard_freq.remove(‘marijuana-frequency’)

6. We can finally graph the drug use over time, lets hope its all we wished for! As one would expect, Alcohol and Marijuanna are both largely more popular than most other drugs. Lets remove them and zoom in on the other drugs.

The ‘stepped’ slope is due to the copied columns, but its much easier to read than having a single point in the middle of the range. In addition, it doesn’t interpolate a lower drug use at the high end of the 50–65 age group.

drug_age.plot(x=’age’,y=use_list,title=’All Drug Usage’,figsize=(10,10))

all drug use over time

drug_age.plot(x=’age’,y=hard_use,title=’Hard Drug Usage (no Alc/Mj)’,figsize=(10,10))

hard drug use over time

I thought it’d also be interesting to take a look at the top 5 most popular drugs per age group. So by shuffling around data with a transpose, setting the index, and using a clever way to get the 5 maximum values per column we get the result below!

#make a new dataframe a set the columns to the age groups
drug_use_b = drug_use_b.set_index(‘age’).T

#create a dictionary and take the top5 most popular drugs per age group
top5b = {}
for i in drug_use_b.columns:
 top5b[i] = drug_use_b[i].sort_values(ascending=False).head()
 
#fill the NaNs so its easier to view
top5b = pd.DataFrame(top5b)
top5b.fillna(‘’,inplace=True)
top5b

top 5 most popular drugs based on drug-use per age group

As you can tell, there is a lot more we could do. Thats the fun and frustration in Data Science. Its like art. Ever wonder how Picasso ever decided he was finished with one piece to move on to the next? I bet (and hope) theres an intuition that builds over time. A feeling thats theres more to find or that you’ve exhausted your options. There are conclusions to make from all this, but I’ll leave that to you!

What do you see in the graph I’ve created? Would you create more trends? How else would you subset the drugs? What would you do with the frequency data?

Until next time..