Hands-on Python pandas.DataFrame
A tutorial introduction to Python pandas.DataFrame
One of the first concepts we learn as Python programmers is about built-in data structures, such as the list
and the dictionary
.
Once you understand these basic structures, it’s time to add pandas.DataFrame
to your set of Python programming essentials.
Why pandas.DataFrame?
Because pandas.DataFrame
is one of the most used Python data structures, and for good reason. It saves time and anguish, and if you’re going to be reading or collaborating or borrowing other programmers’ code, you are absolutely going to see DataFrames
sooner than later.
About this tutorial
A great way to learn is by doing.
This tutorial includes exercises designed to gently introduce you to the world of pandas.DataFrame
.
The tutorial uses a subset of data from the popular Titanic data set (if you’re a student of data science, you’ll be seeing the Titanic data set again).
Feel free to use your own data instead, simply save a spreadsheet of your choice to CSV format. Otherwise, …
Download the Titanic CSV file into your Python 3 programming environment:
1. Download titanic_names.csv.zip
2. Unzip titanic_names.csv.zip into your Python 3 environment
pandas and DataFrame at a glance
pandas
is a powerful Python package widely used for data analysis. It simplifies tasks for loading, analyzing and manipulating data that would otherwise require way too many lines of Python code.
DataFrame
is a data structure (just like list
and dictionary
are data structures), included in the pandas
package.
According to the Pandas website:
“DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input …
Example view of a DataFrame
As you can see, the DataFrame
structure is much like a spreadsheet or database table:
Let’s get started with the tutorial …
Import data into a DataFrame
Pandas has many reader functions that import an external data source (like a CSV or HTML text file) and return a DataFrame
.
Use read_csv()
to load this tutorial’s Titanic data
# import pandas - the pd alias for pandas is a widely used standard
import pandas as pdtitanic_df = pd.read_csv("titanic_names.csv")
A successful read_csv()
has no output. Let’s see what it created …
View a DataFrame
3 functions & methods for creating a birds-eye view of a DataFrame
are: shape
, head()
, and describe()
.
How many rows and columns?
# display row and column countstitanic_df.shape
891 rows and 6 columns.
What does the data look like?
titanic_df.head()
head()
displays the first 5 records. To see fewer or more records, pass an integer, example: head(10)
.
Notice: the left side of the DataFrame table lists sequential integers 0, 1, …, 4. This is the ‘index’, or ‘row label’. We’ll discuss indexes a little later on in this tutorial.
Are there statistics on the data?
titanic_df.describe()
describe()
outputs a summary of descriptive statistics. Notices the results are a DataFrame
itself.
In this example, the count row contains the count of non-null values in each numeric column of the titanic_df. The column Age has 714 non-null values, which means the rest of the 891 records have Age = null (no value). As an analyst, you may be concerned with null values.
Click here for more descriptive statistics and computation methods, such as:
count()
,max()
,min()
,mean()
,std()
,sum()
,dtypes()
Click here for more views of the underlying structure and meta data.
Now that you have an idea of what a DataFrame
looks like, let’s make some from scratch …
Create a DataFrame
manually
Syntax
pandas.DataFrame(data, index, columns, dtype, copy)
Create a simple DataFrame
Define the first parameter, data
, as a simple list [‘A’, ‘B’, ‘C’]. Accept defaults for all other parameters and see what happens:
pd.DataFrame(['A','B','C'])
Notice the indexes (or, row labels), on the left. They default to a range sequence starting with 0.
Create a DataFrame with a list array
Practice using parameters by passing both index and column labels, and force the data to be of type float
:
pd.DataFrame(
[
[1,2,3],
[4,5,6],
[7,8,9],
[98,99,100]
],
index=['Row 1','Row 2','Row 3','Row 4'],
columns=['Col 1','Col 2','Col 3'],
dtype=float
)
Nice.
Create a DataFrame with a dictionary of lists
pd.DataFrame (
{'PassengerId': [1,2,3],
'Survived': [0,1,1],
'Pclass': [3,1,3],
'Name': ['Braund, Mr. Owen Harris',
'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
'Heikkinen, Miss. Laina'],
'Sex': ['male','female','female'],
'Age': [22.0, 38.0, 26.0]
}
)
Perfect.
Copy a data structure to a DataFrame
Another way of creating a DataFrame
is to copy from an existing data structure such as: a list of dictionaries, narrays/lists, Series
(a pandas structure), and a list of Series.
Copy from list of dictionaries
Use the method from_dict()
.
titanic_dict = [
{'PassengerId': 1, 'Survived': 0, 'Pclass': 3,
'Name': 'Braund, Mr. Owen Harris',
'Sex': 'male', 'Age': 22.0},
{'PassengerId': 2, 'Survived': 1, 'Pclass': 1,
'Name': 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
'Sex': 'female', 'Age': 38.0},
{'PassengerId': 3, 'Survived': 1, 'Pclass': 3,
'Name': 'Heikkinen, Miss. Laina',
'Sex': 'female', 'Age': 26.0}
]# Copy list of dictionaries into a DataFrame
small_titanic_df = pd.DataFrame.from_dict(titanic_dict)
small_titanic_df
For details on
from_dict()
,from_records()
and other methods for copying data to and from aDataFrame
, click here.
Select specific columns
There are three methods of selecting columns from a DataFrame
, and they all return the same results:
- dot notation
dataframe.column_name
- square braces with the name of the column as a string
dataframe[‘column_name’]
- numeric indexing and the
iloc
selectordataframe.iloc[:, <column_number>]
(include head()
method to limit rows returned)
titanic_df.Age.head() titanic_df['Age'].head() titanic_df.iloc[:,5].head()
I believe the square brackets version is the cleanest and easiest to read. Also, less error-prone than iloc
if columns are re-ordered, added or removed from the DataFrame
later. iloc
’s slicing syntax is handy, however, when selecting a series of columns.
Select multiple columns
Include multiple column names as a list.
titanic_df[['Name','Age']].head()
Select specific rows
Select the second set of 5 rows by slicing inside of brackets.
titanic_df[5:10]
Notice how the index (left side) is the index for each specific record in the DataFrame
, not for the selection result’s table row.
These two commands are the same, each returning the first 5 rows of a DataFrame
→ titanic_df.head(5)
= titanic_df[0:5]
.
Select specific columns and rows at the same time
Columns get listed first, then rows are sliced.
titanic_df[['Name','Age']][5:10]
Filter records based on conditions
Who on the Titanic was older than 65?
titanic_df[titanic_df['Age'] > 65]
How many people survived? (0 = no, 1 = yes)
titanic_df['Survived'].value_counts()
What was the survival rate, in percentages?
titanic_df['Survived'].value_counts(normalize=True) * 100
What percentage of seniors survived?
titanic_df[ titanic_df["Age"] > 65 ] \
["Survived"].value_counts(normalize=True) * 100
Not a good survival rate folks over 65 years old.
Sort by column values
Who were the 10 youngest?
titanic_df.sort_values('Age').head(10)
How about the 10 oldest?
titanic_df.sort_values('Age', ascending=False).head(10)
For more on slicing and selecting data, click here.
Group by columns
How many survived and died, grouped by females and males?
Use the df.groupby() method.
titanic_df.groupby(['Sex', 'Survived'])['PassengerId'].count()
Change data
Update ‘Survived’, converting a Boolean to a String.
Remember, with Booleans, 0=False and 1=True.
Use the pandas.DataFrame.replace()
method. Create a boolean dictionary
(boolean_d) first.
boolean_d = {True: 'Survived', False: 'Died'}titanic_df['Survived'] = titanic_df['Survived'].replace(boolean_d)titanic_df.head()
Indexing
By default, a DataFrame
’s rows are indexed with sequential integers starting with 0. What if we want to index on a column that already exists, instead? For example, on the ‘PassengerId’ column?
There are a couple ways to set an index: convert an existing column to an index, or set the index when the DataFrame
is built.
Convert column to index
Use set_index()
.
set_index_titanic_df = titanic_df.set_index('PassengerId')set_index_titanic_df.head()
Index DataFrame upon creation
Now let’s re-import our test data to create titanic_df using read_csv
again, this time explicitly assigning an index instead of accepting the default range index. Use the index_col
parameter.
indexed_titanic_df = \
pd.read_csv("titanic_names.csv",index_col='PassengerId')indexed_titanic_df.head()
Indexes are not columns
An important thing to note here: now that ‘PassengerId’ is an index, it is no longer a column in the DataFrame
. This is demonstrated with the shape
function, which shows there are now 5 columns instead of previous 6:
indexed_titanic_df.shape
Also, an index is not referenced like a column. For example, if you try to reference ‘PassengerId’ as you would a column, you’ll get an error:
indexed_titanic_df['PassengerId']
Convert index to a column
Change your mind about indexing that column? No worries. Just set it back using reset_index()
:
un_indexed_titanic_df = indexed_titanic_df.reset_index()un_indexed_titanic_df.head()
Conclusion (with a teaser)
This tutorial reveals only the very tip of the pandas.DataFrame
iceberg.
We haven’t even touched on doing math, using apply
and applymap
to apply functions across a DataFrame
, pivot
ing tables, multi-indexing, using crosstab
to validate a hypothesis, merging, iterating over rows, and so much more.
I encourage you to keep exploring the pandas.DataFrame
, and the pandas
library in general.
Before you go, here’s one last teaser …
Plotting pandas.DataFrame
One of the best things about pandas
and theDataFrame
structure is how easy it makes doing data analysis, especially when coupled with the matplotlib
library. matplotlib
is a whole new subject, so here’s just a simple example to wet your appetite.
Try it
import pandas as pd
import matplotlib.pyplot as plttitanic_df = pd.read_csv("titanic_names.csv")# Draw a histogram of passenger agesage_hist = titanic_df['Age'].hist()
age_hist.set(xlabel='Passenger Age', ylabel='Count')plt.show()
Happy Python-ing!!!
Resources
- Original Titanic data source
- Handy Pandas cheat sheet (by pandas.pydata.org)
- Pandas official documentation on DataFrames
- Pandas documentation on plotting and data visualization
- Jupyter Notebook with all the exercises in this Tutorial