Analyze open data sets using pandas in a Python notebook
Open data is freely available, which means you can modify, store, and use it without any restrictions. Governments, academic institutions, and publicly focused agencies are the most common providers of open data. They typically share things like environmental, economic, census, and health data sets. You can learn more about open data from The Open Data Institute or from wikipedia.
Two great places to start browsing are data.gov and data.gov.uk where you can find all sorts of data sets. Other good sources are the World Bank, the FAO, eurostat and the bureau for labor statistics. If you’re interested in a specific country or region, just do a quick Google search, and you’ll likely uncover other sources as well.
Open data can be a powerful analysis tool, especially when you connect multiple data sets to derive new insights. This tutorial features a notebook that helps you get started with analysis using pandas. Pandas is one of my favorite data analysis packages. It’s very flexible and includes tools that make it easy to load, index, classify, and group data.
In this tutorial, you will learn how to work with a
DataFrame in 2 basic steps:
- Load data from open data sets into a Python notebook in Data Science Experience.
- Work with a Python notebook on Data Science Experience (join data frames, clean, check, and analyze the data using simple statistical tools).
Data & analytics on Data Science Experience
Data Science Experience features a selection of open data sets that you can download and use any way you want. It’s easy to get an account, start a notebook, and grab some data:
- Sign in to Data Science Experience (or sign up for a free trial).
- Open the sample notebook called Analyze open data sets with pandas DataFrames . To open the sample notebook, click here (or type its name in the Search field on the home page of Data Science Experience and select the card for the notebook), then click the button on the top of the preview page that opens. Select a project and Spark service and click Create Notebook. The sample notebook opens for you to work with.
- Find the first data set and get its access key URL.
- From the Data Science Experience home page, search for “life expectancy”.
- Click the card with the title Life expectancy at birth by country in total years.
- Click the Manage Access Keys button.
- Click Request a New Access Key.
- Copy the access key URL, and click Close. You’ll use this link in a minute to load data into the Python notebook.
Tip: If you don’t want to run the commands yourself, you can also just open the notebook in your browser and follow along: https://apsportal.ibm.com/exchange/public/entry/view/47ed96c50374ccd15f93ef262c1af63b
Load data into a DataFrame
Paste the access key URL you copied from the Life Expectancy data set into the following code (replacing the <LINK-TO-DATA> string). Then run the following code to load the data in a data frame. This code keeps 3 columns and renames them.
import pandas as pd
import numpy as np
# life expectancy at birth in years
life = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area','Year','Value'])
life.columns = ['country','year','life']
Life expectancy figures might be more meaningful if we combine them with other open data sets from Data Science Experience. Let’s start by loading the data set Total Population by country. To do so, find the data set on the DSX home page, request an access key for it, and replace <LINK-TO-DATA> with your access key URL in the following code. Then run the code.
population = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area', 'Year','Value'])
population.columns = ['country', 'year','population']
print "Nr of countries in life:", np.size(np.unique(life['country']))
print "Nr of countries in population:", np.size(np.unique(population['country']))
Nr of countries in life: 246
Nr of countries in population: 277
Joining data frames
These two data sets don’t fit together perfectly. For instance, one lists more countries than the other. When we join the two data frames we’re sure to introduce nulls or NaNs into the new data frame. We’ll use the pandas
merge function to handle this problem. This function includes many options. In the following code,
how='outer' makes sure we keep all data from
on=['country','year'] specifies which columns to perform the merge on.
df = pd.merge(life, population, how='outer', sort=True,
We can add more data to the data frame in a similar way. For each data set in the following list, find the data set on the DSX home page, request an access key URL, and copy the the URL into the code (again replacing the <LINK-TO-DATA> string with the corresponding access key URL):
- Population below national poverty line, total, percentage
- Primary school completion rate % of relevant age group by country
- Total employment, by economic activity (Thousands)
- Births attended by skilled health staff (% of total) by country
- Measles immunization % children 12–23 months by country
# poverty (%)
poverty = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area', 'Year','Value'])
poverty.columns = ['country', 'year','poverty']
df = pd.merge(df, poverty, how='outer', sort=True, on=['country','year'])
# school completion (%)
school = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area', 'Year','Value'])
school.columns = ['country', 'year','school']
df = pd.merge(df, school, how='outer', sort=True, on=['country','year'])
employmentin = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area','Year','Value','Sex','Subclassification'])
employment = employmentin.loc[(employmentin.Sex=='Total men and women') & (employmentin.Subclassification=='Total.')]
employment = employment.drop('Sex', 1)
employment = employment.drop('Subclassification', 1) employment.columns = ['country', 'year','employment']
df = pd.merge(df, employment, how='outer', sort=True, on=['country','year'])
# births attended by skilled staff (%)
births = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area', 'Year','Value'])
births.columns = ['country', 'year','births']
df = pd.merge(df, births, how='outer', sort=True, on=['country','year'])
# measles immunization (%)
measles = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area', 'Year','Value'])
measles.columns = ['country', 'year','measles']
df = pd.merge(df, measles, how='outer', sort=True, on=['country','year'])
The resulting table looks kind of strange, as it contains incorrect values, like numbers in the
country column and text in the
year column. You can manually remove these errors from the data frame. Also, we can now create a multi-index with country and year.
df2 = df2.set_index(['country','year'])
If you are curious about other variables, you can keep adding data sets from Data Science Experience to this data frame. Be aware that not all data is equally formatted and might need some clean-up before you add it. Use the code samples you just read about, and make sure you keep checking results with a quick look at each of your tables when you load or change them with commands like
Check the data
You can run a first check of the data with
describe(), which calculates some basic statistics for each of the columns in the dataframe. It gives you the number of values (count), the mean, the standard deviation (std), the min and max, and some percentiles.
At this point, we have enough sample data to work with. Let’s start by finding the correlation between different variables. First we’ll create a scatter plot, and relate the values for two variables of each row. In our code, we also customize the look by defining the font and figure size and colors of the points with matplotlib.
import matplotlib.pyplot as plt
fig, axes=plt.subplots(nrows=1, ncols=2)
df2.plot(kind='scatter', x='life', y='population', ax=axes, color='Blue');
df2.plot(kind='scatter', x='life', y='school', ax=axes, color='Red');
The figure on the left shows that increased life expectancy leads to higher population. The figure on the right shows that the life expectancy increases with the percentage of school completion. But the percentage ranges from 0 to 200, which is odd for a percentage. You can remove the outliers by keeping the values within a specified range
Even better, would be to check where these values in the original data came from. In some cases, a range like this could indicate an error in your code somewhere. In this case, the values are correct, see the description of the school completion data.
We don’t have data for all the exact same years. So we’ll group by country (be aware that we lose some information by doing so). Also because variables are percentages, we’ll convert our employment figures to percent. Probably, we no longer need the
population column, so let's drop it. Then we create scatter plots from the data frame using
scatter_matrix, which creates plots for all variables and also adds a histogram for each.
from pandas.tools.plotting import scatter_matrix
# group by country
grouped = df2.groupby(level=0)
dfgroup = grouped.mean()
# employment in % of total population
scatter_matrix(dfgroup,figsize=(12, 12), diagonal='kde')
You can see that the data is now in a pretty good state. There are no large outliers. We can even start to see some relationships: life expectancy increases with schooling, employment, safe births, and measles vaccination. You are deriving insights from the data and can now build a statistical model — for instance, have a look at an ordinary least squares regression (OLS) from StatsModels.
In this tutorial, you learned how to use open data from Data Science Experience in a Python notebook. You saw how to load, clean and explore data using pandas. As you can see from this example, data analysis entails lots of trial and error. This experimentation can be challenging, but is also a lot of fun!
Originally published at datascience.ibm.com on August 30, 2016.