# Pandas, Plotting, Pizzaz, Brazil?

## A Brazillion Ways to Explore Data

To those newly initiated to the world of data science, the word pandas might still configure up images of cute, fuzzy, friendly animals, seemingly built for nothing but being cute on camera and wasting otherwise productive hours browsing google images of bears eating leaves. But don’t worry! That’ll soon change. We’ll take you from a passive panda perceiver to a full on Pandas professional, in just a few short minutes.

This article will assume some basic knowledge of Python, and a general idea of what the Pandas library is will be helpful, but not necessary. I’ll go over some basic/introductory concepts to get an overview and general understanding, but the focus of this article will be on application of matplotlib, pyplot, seaborn, and pandas in exploratory data analysis of a messy(ish) dataset. As we go through, I’ll suggest exploration to do on your own for your own practice.

Some of the questions I’ll answer here:

- What is Pandas?
- When is Pandas used?
- How do you clean a dataset using Pandas?
- How can visualizations aid in exploratory data analysis (EDA)?
- What does exploratory data analysis look like?

# What is Pandas?

Pandas is one of the premier packages for managing and cleaning data in the Python data science space. It allows for the neat containerization of data into Pandas objects called dataframes and is compatible with the widely used Python computing and data manipulation package NumPy, and can be combined with a ton of other common data manipulation packages like SQL. Pandas also contains many functions for cleaning, manipulating, indexing, and performing operations on data in a way which is optimized and significantly faster than standard Python operations, which especially comes in handy when working with the very large datasets which we can sometimes encounter.

At its most basic form, a Pandas dataframe is a two-dimensional organization of data with rows and columns. Though rows and columns are referred to in many way and many contexts throughout the data science world, I’ll try to stick to either rows and columns or datapoints and features. Dataframe columns can be individually selected as a Series, the other common Pandas datatype. The Pandas series and dataframe are the foundation of the Pandas library, and the documentation does a good job explaining these datatypes, so you can read more here if you’re unfamiliar.

# When is Pandas Used?

This is an easy one. Are you working with data in Python? Use Pandas.

There are cases when a dataset is simply too large for a local runtime, where additional strategies must be employed, though after a certain point (limited either by your patience or your machine), a switch to a tool designed for larger datasets will be required regardless.

Pandas is used when working with CSVs, data scraped from the web, datasets from Kaggle or other sources, or pretty much any other time you have data which takes the from of datapoints with multiple features.

# How do you clean a dataset using Pandas?

The short answer is it depends. There are a myriad of strategies that can be employed, and often you’ll have to look up examples specific to your situation, such as dealing with categorical variables, strings, etc.

That’s not a very useful answer though. In an attempt to write an actually helpful article, I’ll highly recommend visiting Kaggle and simply noting the strategies experts and master make use of, and the situations in which they’re used, making a data science cheatsheet of sorts where you can note basic tasks. I’ll link a rough one I made a while back here.

A helpful process can actually be writing down end goals you want, attempting to figure out a few substeps to get there, and then searching Stack Overflow or Pandas documentation for the implementation. Data cleaning is much more about understanding the mindset of how to manipulate data into useful forms than it is about memorizing an algorithm.

# What Does EDA Look Like?

I’ll go through an example here, and post the Kaggle Kernel to take an even closer look, directly at the code. I highly recommend following along on your own, or in the notebook, looking up documentation as you go along.

This is a fairly large dataset with a TON of features. Let’s see if we can employ Pandas and some creative visualizations to clean this up.

As a disclaimer, Seaborn is based on PyPlot and Matplotlib, hence their mention in the intro, but I prefer Seaborn’s functionality and style, so you may not see PyPlot or Matplotlib explicitly here.

# import pandas and numpy, as well as seaborn for visualizations

import numpy as np

import pandas as pd

import seaborn as sns# import the os package for Kaggle

import os

print(os.listdir("../input"))# read in the data and create a dataframe

df = pd.read_csv("../input/brazilian-cities/BRAZIL_CITIES.csv", sep=";", decimal=",")# view a random sample of the dataframe

df.sample(25)

Good stuff! Let’s also look at the shape to get a sense of the magnitude of the set we’re looking at.

`# get dataframe shape`

df.shape

Out:

`(5576, 81)`

Whenever we do any type of complicated work, it can often be helpful to have a deep copy of our dataframe on hand such that we can more easily revert back if and when we mess up our primary dataframe.

`# create a deep copy of our dataframe`

df_copy = df.copy(True)

Now that we’ve done this, let’s get to work!

This dataset has a fairly large amount of features, which might make it hard to explore. Since we’re not looking to create a model, we don’t *have* to perform feature selection, so we can simply select a subset of columns we’d like to explore. Feel free to make a different list than my own!

columns = ['CITY', 'STATE', 'CAPITAL', 'IBGE_RES_POP', 'IBGE_RES_POP_BRAS','IBGE_RES_POP_ESTR','IBGE_DU','IBGE_DU_URBAN','IBGE_DU_RURAL', 'IBGE_POP','IBGE_1','IBGE_1-4','IBGE_5-9','IBGE_10-14','IBGE_15-59','IBGE_60+','IBGE_PLANTED_AREA','IDHM','LONG','LAT','ALT','ESTIMATED_POP','GDP_CAPITA','Cars','Motorcycles','UBER','MAC','WAL-MART','BEDS']# create reduced dataframe and check shape

r_df = df[columns]

r_df.shape

Out:

`(5576, 29)`

Awesome! Much more manageable now. A really helpful tool for initial exploration is the Seaborn `pairplot`

function. It graphs every variable against every other variable in one method! Let’s see if it can help us here.

*# create a seaborn pairplot*

pp = sns.pairplot(r_df)

Wow. Umm, okay. That’s huge, a little overwhelming, and not particularly helpful. Imagine if we had left all the features in! A good next step, especially if the pairplot to unhelpful is trying a correlation matrix just to see if there’s any linear relationship among variables.

corr = r_df.corr()# I prefer one sided matricies so create a mask

mask = np.zeros_like(corr, dtype=np.bool)

mask[np.triu_indices_from(mask)] = True# set up figure

f, ax = plt.subplots(figsize=(15, 15))

cmap = sns.diverging_palette(220, 20, as_cmap=True)

sns.heatmap(corr, mask=mask,cmap=cmap, vmax=.3, center=0,

square=True, linewidths=.5, cbar_kws={"shrink": .5})

The `UBER`

column is completely blank, let’s look directly at the data to see what’s going on.

`r_df.UBER`

Out:

`0 1.0`

1 NaN

2 1.0

3 1.0

4 1.0

5 1.0

6 NaN

7 1.0

8 1.0

9 1.0

10 1.0

11 1.0

12 1.0

13 1.0

14 1.0

15 1.0

16 1.0

17 NaN

18 NaN

19 NaN

20 NaN

21 1.0

22 1.0

23 NaN

24 NaN

25 NaN

26 1.0

27 1.0

28 NaN

29 1.0

...

5546 NaN

5547 NaN

5548 NaN

5549 NaN

5550 NaN

5551 NaN

5552 NaN

5553 NaN

5554 NaN

5555 NaN

5556 NaN

5557 NaN

5558 NaN

5559 NaN

5560 NaN

5561 NaN

5562 NaN

5563 NaN

5564 NaN

5565 NaN

5566 NaN

5567 NaN

5568 NaN

5569 NaN

5570 NaN

5571 NaN

5572 NaN

5573 NaN

5574 NaN

5575 NaN

Lot’s of NaNs and no zeros, a possible error in the set.

*# there are a lot of nans, possibly in place of zeros, let's check*

df_copy.UBER.value_counts()

Out:

`1.0 125`

Yeah, no zeros in the whole columns and only 125 values out of over 5000. Let’s replace NaNs with zeros and try again.

`r_df.UBER.replace({np.nan:0}, inplace=True)`

r_df.UBER.value_counts()

Out:

`0.0 5451`

1.0 125

Success! Let’s try the correlation matrix again. Running the same code, we get:

Now let’s start looking at the relationships!

*# let's investigate the strongest correlation first*

sns.set_style('dark')

sns.scatterplot(x=r_df.IDHM, y=r_df.LAT)

This shows what we expected from the correlation matrix, but doesn’t really supply a lot of meaning to most people, as without an idea of the latitudes of Brazil or a better idea of what IDHM is, they’re kinda just meaningless points. Let’s contextualize within the geography of the country using latitude and longitude.

*# map of lat and long with IDHM detemining size*

f, ax = plt.subplots(figsize=(8, 8))

sns.scatterplot(x=r_df.LONG, y=r_df.LAT, size=r_df.IDHM)

Here we can see a rough outline of the nation of Brazil. Overlaying this over a map may be even more helpful, but we’re going to skip that here. We can see a bit of a trend, but with so many points it’s hard to distinguish the sizes. Let’s try adding a color encode.

*# it's hard to see any trends here, let's add color to get a better idea*

f, ax = plt.subplots(figsize=(8, 8))

sns.scatterplot(x=r_df.LONG, y=r_df.LAT, size=r_df.IDHM, hue=r_df.IDHM)

Fantastic! Here we can clearly see a trend of higher IDHM towards the center and south of the country. In your own exploration, maybe you can find characteristics about these parts of the country which may cause this.

Let’s see if we can identify the state capital cities in this plot.

*# let's see if we can spot any capitals in there*

f, ax = plt.subplots(figsize=(8, 8))

markers = {0:'o', 1:'s'}

sns.scatterplot(x=r_df.LONG, y=r_df.LAT, size=r_df.IDHM, hue=r_df.IDHM,style=r_df.CAPITAL, markers=markers)

Here we’re facing the same problem as in attempt #2. We can’t see the additionally encoded data! Let’s try an overlay and see if that’s more clear.

`f, ax = plt.subplots(figsize=(8, 8))`

sns.scatterplot(x=r_df.LONG, y=r_df.LAT, size=r_df.IDHM, hue=r_df.IDHM)

sns.scatterplot(x=r_df[r_df.CAPITAL==1].LONG, y=r_df[r_df.CAPITAL==1].LAT, s=100)

Great. Now we can see the capital cities as well, and we can see that they very strongly trend towards the east of the country, towards the coast. Not surprising, yet very interesting to see all the same. Let’s see if we can give GDP per capita a similar treatment, kind of reusing our code.

`f, ax = plt.subplots(figsize=(8, 8))`

sns.scatterplot(x=r_df.LONG, y=r_df.LAT, size=r_df.GDP_CAPITA, hue=r_df.GDP_CAPITA)

Hmm. It looks like there aren’t enough color bins to show all the trends in GDP here, likely meaning the data has a large spread and/or skew. Let’s investigate.

*# let's take a look at the distribution, after taking care of nans*

f, ax = plt.subplots(figsize=(12, 8))

gdp = r_df.GDP_CAPITA.dropna()

sns.distplot(gdp)

*# it looks like gdp is heavily right skewed with a massive tail. *

*# it seems likely that those massive outliers are errors, and could be removed in some cases*

gdp.describe()

Out:

`count 5573.000000`

mean 21129.767244

std 20327.836119

min 3190.570000

25% 9061.720000

50% 15879.960000

75% 26156.990000

max 314637.690000

A huge tail, and a significant skew. This will make the data much more difficult to encode with color. If we wanted to, we could remove outliers or try a different scale on the data, potentially log, but we’ll skip that for now and investigate one last variable. Uber! Let’s see how they’re doing.

`f, ax = plt.subplots(figsize=(12, 8))`

sns.countplot(r_df['UBER'])

We can see here that the *vast *majority of Brazillian cities *don’t* have Uber. Let’s see where they are.

`f, ax = plt.subplots(figsize=(8, 8))`

sns.scatterplot(x=r_df[r_df.UBER==0].LONG, y=r_df[r_df.UBER==0].LAT)

sns.scatterplot(x=r_df[r_df.UBER==1].LONG, y=r_df[r_df.UBER==1].LAT)

We can see similar trends to IDHM here, with clustering by the coast and the south of the country. Try checking out IDHM and Uber on your own! Let’s look at the Uber’s relationship with cars.

`f, ax = plt.subplots(figsize=(16, 12))`

sns.boxplot(y=r_df['Cars'], x=r_df['UBER'])

Oof. That’s not pretty or useful. Let’s remove those giant outlier and get a better look.

`ubers, car_vals = r_df[r_df.Cars <100000].UBER, r_df[r_df.Cars <100000].Cars`

sns.boxplot(ubers, car_vals )

This is a really interesting distribution. The minimum bounds for cities with Uber is above the maximum bound for cities without. Try investigating other variables related to cars on your own to see why this is. My guess would be something to do with population or GDP. Also the presence of outliers in the non Uber cities suggests a large skew/tail. Let’s take a look.

`f, ax = plt.subplots(figsize=(16, 12))`

sns.distplot(r_df[(r_df.Cars < 100000) & (r_df.UBER==0)].Cars)

sns.distplot(r_df[(r_df.Cars < 100000) & (r_df.UBER==1)].Cars, bins=20)

As expected! The majority of cities either have no cars or a very small amount. Good work!

# What we learned:

**What is Pandas?**A very useful Python package for data cleaning and manipulation.**When is Pandas used?**Pretty much anytime you need to work with data in Python!**How do you clean a dataset using Pandas?**With a curious mind, and lots of searching pandas documentation, paired with helpful examples.**How can visualizations aid in exploratory data analysis (EDA)?**Visualizations are KEY in exploratory data analysis. This is easier shown than explained, check out the example in the article.**What does exploratory data analysis look like?**See above!

# Want to explore more? Try these:

- Explore population data
- See how population varies with other supplied categorical values
- Look at how gdp per capita varies with presence of other industries
- Add back in all variables/subset with different variables and create more correlation matrices to explore additional trends

Thanks for reading!