Pandas, Plotting, Pizzaz, Brazil?

A Brazillion Ways to Explore Data

Caleb Neale
Jun 14 · 10 min read

To those newly initiated to the world of data science, the word pandas might still configure up images of cute, fuzzy, friendly animals, seemingly built for nothing but being cute on camera and wasting otherwise productive hours browsing google images of bears eating leaves. But don’t worry! That’ll soon change. We’ll take you from a passive panda perceiver to a full on Pandas professional, in just a few short minutes.

This article will assume some basic knowledge of Python, and a general idea of what the Pandas library is will be helpful, but not necessary. I’ll go over some basic/introductory concepts to get an overview and general understanding, but the focus of this article will be on application of matplotlib, pyplot, seaborn, and pandas in exploratory data analysis of a messy(ish) dataset. As we go through, I’ll suggest exploration to do on your own for your own practice.

Some of the questions I’ll answer here:

  • What is Pandas?
  • When is Pandas used?
  • How do you clean a dataset using Pandas?
  • How can visualizations aid in exploratory data analysis (EDA)?
  • What does exploratory data analysis look like?

What is Pandas?

Pandas is one of the premier packages for managing and cleaning data in the Python data science space. It allows for the neat containerization of data into Pandas objects called dataframes and is compatible with the widely used Python computing and data manipulation package NumPy, and can be combined with a ton of other common data manipulation packages like SQL. Pandas also contains many functions for cleaning, manipulating, indexing, and performing operations on data in a way which is optimized and significantly faster than standard Python operations, which especially comes in handy when working with the very large datasets which we can sometimes encounter.

At its most basic form, a Pandas dataframe is a two-dimensional organization of data with rows and columns. Though rows and columns are referred to in many way and many contexts throughout the data science world, I’ll try to stick to either rows and columns or datapoints and features. Dataframe columns can be individually selected as a Series, the other common Pandas datatype. The Pandas series and dataframe are the foundation of the Pandas library, and the documentation does a good job explaining these datatypes, so you can read more here if you’re unfamiliar.

When is Pandas Used?

This is an easy one. Are you working with data in Python? Use Pandas.

There are cases when a dataset is simply too large for a local runtime, where additional strategies must be employed, though after a certain point (limited either by your patience or your machine), a switch to a tool designed for larger datasets will be required regardless.

Pandas is used when working with CSVs, data scraped from the web, datasets from Kaggle or other sources, or pretty much any other time you have data which takes the from of datapoints with multiple features.

How do you clean a dataset using Pandas?

The short answer is it depends. There are a myriad of strategies that can be employed, and often you’ll have to look up examples specific to your situation, such as dealing with categorical variables, strings, etc.

That’s not a very useful answer though. In an attempt to write an actually helpful article, I’ll highly recommend visiting Kaggle and simply noting the strategies experts and master make use of, and the situations in which they’re used, making a data science cheatsheet of sorts where you can note basic tasks. I’ll link a rough one I made a while back here.

A helpful process can actually be writing down end goals you want, attempting to figure out a few substeps to get there, and then searching Stack Overflow or Pandas documentation for the implementation. Data cleaning is much more about understanding the mindset of how to manipulate data into useful forms than it is about memorizing an algorithm.

What Does EDA Look Like?

I’ll go through an example here, and post the Kaggle Kernel to take an even closer look, directly at the code. I highly recommend following along on your own, or in the notebook, looking up documentation as you go along.

This is a fairly large dataset with a TON of features. Let’s see if we can employ Pandas and some creative visualizations to clean this up.

As a disclaimer, Seaborn is based on PyPlot and Matplotlib, hence their mention in the intro, but I prefer Seaborn’s functionality and style, so you may not see PyPlot or Matplotlib explicitly here.

# import pandas and numpy, as well as seaborn for visualizations
import numpy as np
import pandas as pd
import seaborn as sns
# import the os package for Kaggle
import os
print(os.listdir("../input"))
# read in the data and create a dataframe
df = pd.read_csv("../input/brazilian-cities/BRAZIL_CITIES.csv", sep=";", decimal=",")
# view a random sample of the dataframe
df.sample(25)

Good stuff! Let’s also look at the shape to get a sense of the magnitude of the set we’re looking at.

# get dataframe shape
df.shape

Out:

(5576, 81)

Whenever we do any type of complicated work, it can often be helpful to have a deep copy of our dataframe on hand such that we can more easily revert back if and when we mess up our primary dataframe.

# create a deep copy of our dataframe
df_copy = df.copy(True)

Now that we’ve done this, let’s get to work!

This dataset has a fairly large amount of features, which might make it hard to explore. Since we’re not looking to create a model, we don’t have to perform feature selection, so we can simply select a subset of columns we’d like to explore. Feel free to make a different list than my own!

columns = ['CITY', 'STATE', 'CAPITAL', 'IBGE_RES_POP', 'IBGE_RES_POP_BRAS','IBGE_RES_POP_ESTR','IBGE_DU','IBGE_DU_URBAN','IBGE_DU_RURAL', 'IBGE_POP','IBGE_1','IBGE_1-4','IBGE_5-9','IBGE_10-14','IBGE_15-59','IBGE_60+','IBGE_PLANTED_AREA','IDHM','LONG','LAT','ALT','ESTIMATED_POP','GDP_CAPITA','Cars','Motorcycles','UBER','MAC','WAL-MART','BEDS']# create reduced dataframe and check shape
r_df = df[columns]
r_df.shape

Out:

(5576, 29)

Awesome! Much more manageable now. A really helpful tool for initial exploration is the Seaborn pairplot function. It graphs every variable against every other variable in one method! Let’s see if it can help us here.

# create a seaborn pairplot
pp = sns.pairplot(r_df)
Seaborn Pairplot of the Data

Wow. Umm, okay. That’s huge, a little overwhelming, and not particularly helpful. Imagine if we had left all the features in! A good next step, especially if the pairplot to unhelpful is trying a correlation matrix just to see if there’s any linear relationship among variables.

corr = r_df.corr()# I prefer one sided matricies so create a mask
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# set up figure
f, ax = plt.subplots(figsize=(15, 15))

cmap = sns.diverging_palette(220, 20, as_cmap=True)

sns.heatmap(corr, mask=mask,cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
Correlation matrix of all features

The UBER column is completely blank, let’s look directly at the data to see what’s going on.

r_df.UBER

Out:

0       1.0
1 NaN
2 1.0
3 1.0
4 1.0
5 1.0
6 NaN
7 1.0
8 1.0
9 1.0
10 1.0
11 1.0
12 1.0
13 1.0
14 1.0
15 1.0
16 1.0
17 NaN
18 NaN
19 NaN
20 NaN
21 1.0
22 1.0
23 NaN
24 NaN
25 NaN
26 1.0
27 1.0
28 NaN
29 1.0
...
5546 NaN
5547 NaN
5548 NaN
5549 NaN
5550 NaN
5551 NaN
5552 NaN
5553 NaN
5554 NaN
5555 NaN
5556 NaN
5557 NaN
5558 NaN
5559 NaN
5560 NaN
5561 NaN
5562 NaN
5563 NaN
5564 NaN
5565 NaN
5566 NaN
5567 NaN
5568 NaN
5569 NaN
5570 NaN
5571 NaN
5572 NaN
5573 NaN
5574 NaN
5575 NaN

Lot’s of NaNs and no zeros, a possible error in the set.

# there are a lot of nans, possibly in place of zeros, let's check
df_copy.UBER.value_counts()

Out:

1.0    125

Yeah, no zeros in the whole columns and only 125 values out of over 5000. Let’s replace NaNs with zeros and try again.

r_df.UBER.replace({np.nan:0}, inplace=True)
r_df.UBER.value_counts()

Out:

0.0    5451
1.0 125

Success! Let’s try the correlation matrix again. Running the same code, we get:

Corrected correlation matrix

Now let’s start looking at the relationships!

# let's investigate the strongest correlation first
sns.set_style('dark')
sns.scatterplot(x=r_df.IDHM, y=r_df.LAT)
Relationship of Latitude and IDHM (Human Development Index)

This shows what we expected from the correlation matrix, but doesn’t really supply a lot of meaning to most people, as without an idea of the latitudes of Brazil or a better idea of what IDHM is, they’re kinda just meaningless points. Let’s contextualize within the geography of the country using latitude and longitude.

# map of lat and long with IDHM detemining size
f, ax = plt.subplots(figsize=(8, 8))
sns.scatterplot(x=r_df.LONG, y=r_df.LAT, size=r_df.IDHM)
Attempt #2 at contextualizing IDHM trends

Here we can see a rough outline of the nation of Brazil. Overlaying this over a map may be even more helpful, but we’re going to skip that here. We can see a bit of a trend, but with so many points it’s hard to distinguish the sizes. Let’s try adding a color encode.

# it's hard to see any trends here, let's add color to get a better idea
f, ax = plt.subplots(figsize=(8, 8))
sns.scatterplot(x=r_df.LONG, y=r_df.LAT, size=r_df.IDHM, hue=r_df.IDHM)
Attempt #3 at contextualizing IDHM trends, much better!

Fantastic! Here we can clearly see a trend of higher IDHM towards the center and south of the country. In your own exploration, maybe you can find characteristics about these parts of the country which may cause this.

Let’s see if we can identify the state capital cities in this plot.

# let's see if we can spot any capitals in there
f, ax = plt.subplots(figsize=(8, 8))
markers = {0:'o', 1:'s'}
sns.scatterplot(x=r_df.LONG, y=r_df.LAT, size=r_df.IDHM, hue=r_df.IDHM,style=r_df.CAPITAL, markers=markers)
Attempt #4 at contextualizing IDHM trends, with markers added for capital cities

Here we’re facing the same problem as in attempt #2. We can’t see the additionally encoded data! Let’s try an overlay and see if that’s more clear.

f, ax = plt.subplots(figsize=(8, 8))
sns.scatterplot(x=r_df.LONG, y=r_df.LAT, size=r_df.IDHM, hue=r_df.IDHM)
sns.scatterplot(x=r_df[r_df.CAPITAL==1].LONG, y=r_df[r_df.CAPITAL==1].LAT, s=100)
Attempt #5 at contextualizing IDHM trends, with an overlay added for capital cities

Great. Now we can see the capital cities as well, and we can see that they very strongly trend towards the east of the country, towards the coast. Not surprising, yet very interesting to see all the same. Let’s see if we can give GDP per capita a similar treatment, kind of reusing our code.

f, ax = plt.subplots(figsize=(8, 8))
sns.scatterplot(x=r_df.LONG, y=r_df.LAT, size=r_df.GDP_CAPITA, hue=r_df.GDP_CAPITA)
GDP per capita encoded with latitude and longitude

Hmm. It looks like there aren’t enough color bins to show all the trends in GDP here, likely meaning the data has a large spread and/or skew. Let’s investigate.

# let's take a look at the distribution, after taking care of nans
f, ax = plt.subplots(figsize=(12, 8))
gdp = r_df.GDP_CAPITA.dropna()
sns.distplot(gdp)
Distribution of GDP per Capita
# it looks like gdp is heavily right skewed with a massive tail. 
# it seems likely that those massive outliers are errors, and could be removed in some cases
gdp.describe()

Out:

count      5573.000000
mean 21129.767244
std 20327.836119
min 3190.570000
25% 9061.720000
50% 15879.960000
75% 26156.990000
max 314637.690000

A huge tail, and a significant skew. This will make the data much more difficult to encode with color. If we wanted to, we could remove outliers or try a different scale on the data, potentially log, but we’ll skip that for now and investigate one last variable. Uber! Let’s see how they’re doing.

f, ax = plt.subplots(figsize=(12, 8))
sns.countplot(r_df['UBER'])
Countplot of Uber in Brazillian Cities

We can see here that the vast majority of Brazillian cities don’t have Uber. Let’s see where they are.

f, ax = plt.subplots(figsize=(8, 8))
sns.scatterplot(x=r_df[r_df.UBER==0].LONG, y=r_df[r_df.UBER==0].LAT)
sns.scatterplot(x=r_df[r_df.UBER==1].LONG, y=r_df[r_df.UBER==1].LAT)
Distribution of cities with presence of UBER encoded in orange

We can see similar trends to IDHM here, with clustering by the coast and the south of the country. Try checking out IDHM and Uber on your own! Let’s look at the Uber’s relationship with cars.

f, ax = plt.subplots(figsize=(16, 12))
sns.boxplot(y=r_df['Cars'], x=r_df['UBER'])
Box plot of the distribution of number of cars in Brazillian cities, separated by the presence of Uber.

Oof. That’s not pretty or useful. Let’s remove those giant outlier and get a better look.

ubers, car_vals = r_df[r_df.Cars <100000].UBER, r_df[r_df.Cars <100000].Cars
sns.boxplot(ubers, car_vals )
Box plot of the distribution of number of cars (#Cars <100,000) in Brazillian cities, separated by the presence of Uber.

This is a really interesting distribution. The minimum bounds for cities with Uber is above the maximum bound for cities without. Try investigating other variables related to cars on your own to see why this is. My guess would be something to do with population or GDP. Also the presence of outliers in the non Uber cities suggests a large skew/tail. Let’s take a look.

f, ax = plt.subplots(figsize=(16, 12))
sns.distplot(r_df[(r_df.Cars < 100000) & (r_df.UBER==0)].Cars)
sns.distplot(r_df[(r_df.Cars < 100000) & (r_df.UBER==1)].Cars, bins=20)

As expected! The majority of cities either have no cars or a very small amount. Good work!

What we learned:

  • What is Pandas? A very useful Python package for data cleaning and manipulation.
  • When is Pandas used? Pretty much anytime you need to work with data in Python!
  • How do you clean a dataset using Pandas? With a curious mind, and lots of searching pandas documentation, paired with helpful examples.
  • How can visualizations aid in exploratory data analysis (EDA)? Visualizations are KEY in exploratory data analysis. This is easier shown than explained, check out the example in the article.
  • What does exploratory data analysis look like? See above!

Want to explore more? Try these:

  • Explore population data
  • See how population varies with other supplied categorical values
  • Look at how gdp per capita varies with presence of other industries
  • Add back in all variables/subset with different variables and create more correlation matrices to explore additional trends

Thanks for reading!

The Startup

Medium's largest active publication, followed by +536K people. Follow to join our community.

Caleb Neale

Written by

Systems Engineering Student at the University of Virginia | Data Scientist in Training | www.linkedin.com/in/calebneale

The Startup

Medium's largest active publication, followed by +536K people. Follow to join our community.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade