Hands-on Python pandas.DataFrame

A tutorial introduction to Python pandas.DataFrame

Published in

Women Data Greenhorns

9 min readAug 2, 2018

Image Courtesy : Combination built with pandas logo owned by pandas.pydata.org and waving pandas on Amazon.com

One of the first concepts we learn as Python programmers is about built-in data structures, such as the list and the dictionary.

Once you understand these basic structures, it’s time to add pandas.DataFrame to your set of Python programming essentials.

Why pandas.DataFrame?

Because pandas.DataFrame is one of the most used Python data structures, and for good reason. It saves time and anguish, and if you’re going to be reading or collaborating or borrowing other programmers’ code, you are absolutely going to see DataFrames sooner than later.

About this tutorial

A great way to learn is by doing.

This tutorial includes exercises designed to gently introduce you to the world of pandas.DataFrame.

The tutorial uses a subset of data from the popular Titanic data set (if you’re a student of data science, you’ll be seeing the Titanic data set again).

Feel free to use your own data instead, simply save a spreadsheet of your choice to CSV format. Otherwise, …

Download the Titanic CSV file into your Python 3 programming environment:

1. Download titanic_names.csv.zip
2. Unzip titanic_names.csv.zip into your Python 3 environment

pandas and DataFrame at a glance

pandas is a powerful Python package widely used for data analysis. It simplifies tasks for loading, analyzing and manipulating data that would otherwise require way too many lines of Python code.

DataFrame is a data structure (just like list and dictionary are data structures), included in the pandas package.

According to the Pandas website:

“DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input …

Example view of a DataFrame

As you can see, the DataFrame structure is much like a spreadsheet or database table:

pandas.dataframe populated with Titanic data

Let’s get started with the tutorial …

Import data into a DataFrame

Pandas has many reader functions that import an external data source (like a CSV or HTML text file) and return a DataFrame.

Use `read_csv()` to load this tutorial’s Titanic data

# import pandas - the pd alias for pandas is a widely used standard
import pandas as pdtitanic_df = pd.read_csv("titanic_names.csv")

A successful read_csv() has no output. Let’s see what it created …

View a DataFrame

3 functions & methods for creating a birds-eye view of a DataFrameare: shape, head(), and describe().

How many rows and columns?

# display row and column countstitanic_df.shape

891 rows and 6 columns.

What does the data look like?

titanic_df.head()

head() displays the first 5 records. To see fewer or more records, pass an integer, example: head(10).

Notice: the left side of the DataFrame table lists sequential integers 0, 1, …, 4. This is the ‘index’, or ‘row label’. We’ll discuss indexes a little later on in this tutorial.

Are there statistics on the data?

titanic_df.describe()

describe() outputs a summary of descriptive statistics. Notices the results are a DataFrame itself.

In this example, the count row contains the count of non-null values in each numeric column of the titanic_df. The column Age has 714 non-null values, which means the rest of the 891 records have Age = null (no value). As an analyst, you may be concerned with null values.

Click here for more descriptive statistics and computation methods, such as: count(), max(), min(), mean(), std(), sum(), dtypes()
Click here for more views of the underlying structure and meta data.

Now that you have an idea of what a DataFrame looks like, let’s make some from scratch …

Create a `DataFrame` manually

Syntax

pandas.DataFrame(data, index, columns, dtype, copy)

Create a simple DataFrame

Define the first parameter, data, as a simple list [‘A’, ‘B’, ‘C’]. Accept defaults for all other parameters and see what happens:

pd.DataFrame(['A','B','C'])

Notice the indexes (or, row labels), on the left. They default to a range sequence starting with 0.

Create a DataFrame with a list array

Practice using parameters by passing both index and column labels, and force the data to be of type float:

pd.DataFrame(
    [
        [1,2,3],
        [4,5,6],
        [7,8,9],
        [98,99,100]
    ], 
    index=['Row 1','Row 2','Row 3','Row 4'],
    columns=['Col 1','Col 2','Col 3'],
    dtype=float
)

Nice.

Create a DataFrame with a dictionary of lists

pd.DataFrame (
    {'PassengerId': [1,2,3], 
     'Survived': [0,1,1], 
     'Pclass': [3,1,3], 
     'Name': ['Braund, Mr. Owen Harris', 
              'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
              'Heikkinen, Miss. Laina'],
     'Sex': ['male','female','female'],
     'Age': [22.0, 38.0, 26.0]
    }
)

Perfect.

Copy a data structure to a DataFrame

Another way of creating a DataFrame is to copy from an existing data structure such as: a list of dictionaries, narrays/lists, Series (a pandas structure), and a list of Series.

Copy from list of dictionaries

Use the method from_dict().

titanic_dict = [
    {'PassengerId': 1, 'Survived': 0, 'Pclass': 3, 
     'Name': 'Braund, Mr. Owen Harris', 
     'Sex': 'male', 'Age': 22.0}, 
    {'PassengerId': 2, 'Survived': 1, 'Pclass': 1, 
     'Name': 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)', 
     'Sex': 'female', 'Age': 38.0}, 
    {'PassengerId': 3, 'Survived': 1, 'Pclass': 3, 
     'Name': 'Heikkinen, Miss. Laina', 
     'Sex': 'female', 'Age': 26.0}
    ]# Copy list of dictionaries into a DataFrame
small_titanic_df = pd.DataFrame.from_dict(titanic_dict)
small_titanic_df

For details on from_dict(), from_records() and other methods for copying data to and from a DataFrame, click here.

Select specific columns

There are three methods of selecting columns from a DataFrame, and they all return the same results:

dot notation
dataframe.column_name
square braces with the name of the column as a string dataframe[‘column_name’]
numeric indexing and the iloc selector
dataframe.iloc[:, <column_number>]

(include head() method to limit rows returned)

titanic_df.Age.head() titanic_df['Age'].head() titanic_df.iloc[:,5].head()

I believe the square brackets version is the cleanest and easiest to read. Also, less error-prone than iloc if columns are re-ordered, added or removed from the DataFrame later. iloc’s slicing syntax is handy, however, when selecting a series of columns.

Select multiple columns

Include multiple column names as a list.

titanic_df[['Name','Age']].head()

Select specific rows

Select the second set of 5 rows by slicing inside of brackets.

titanic_df[5:10]

Notice how the index (left side) is the index for each specific record in the DataFrame, not for the selection result’s table row.

These two commands are the same, each returning the first 5 rows of a DataFrame → titanic_df.head(5) = titanic_df[0:5].

Select specific columns and rows at the same time

Columns get listed first, then rows are sliced.

titanic_df[['Name','Age']][5:10]

Filter records based on conditions

Who on the Titanic was older than 65?

titanic_df[titanic_df['Age'] > 65]

How many people survived? (0 = no, 1 = yes)

titanic_df['Survived'].value_counts()

What was the survival rate, in percentages?

titanic_df['Survived'].value_counts(normalize=True) * 100

What percentage of seniors survived?

titanic_df[ titanic_df["Age"] > 65 ] \
   ["Survived"].value_counts(normalize=True) * 100

Not a good survival rate folks over 65 years old.

Sort by column values

Who were the 10 youngest?

titanic_df.sort_values('Age').head(10)

How about the 10 oldest?

titanic_df.sort_values('Age', ascending=False).head(10)

For more on slicing and selecting data, click here.

Group by columns

How many survived and died, grouped by females and males?

Use the df.groupby() method.

titanic_df.groupby(['Sex', 'Survived'])['PassengerId'].count()

Change data

Update ‘Survived’, converting a Boolean to a String.

Remember, with Booleans, 0=False and 1=True.

Use the pandas.DataFrame.replace() method. Create a boolean dictionary (boolean_d) first.

boolean_d = {True: 'Survived', False: 'Died'}titanic_df['Survived'] = titanic_df['Survived'].replace(boolean_d)titanic_df.head()

Indexing

By default, a DataFrame’s rows are indexed with sequential integers starting with 0. What if we want to index on a column that already exists, instead? For example, on the ‘PassengerId’ column?

There are a couple ways to set an index: convert an existing column to an index, or set the index when the DataFrame is built.

Convert column to index

Use set_index().

set_index_titanic_df = titanic_df.set_index('PassengerId')set_index_titanic_df.head()

Index DataFrame upon creation

Now let’s re-import our test data to create titanic_df using read_csv again, this time explicitly assigning an index instead of accepting the default range index. Use the index_col parameter.

indexed_titanic_df = \
    pd.read_csv("titanic_names.csv",index_col='PassengerId')indexed_titanic_df.head()

Indexes are not columns

An important thing to note here: now that ‘PassengerId’ is an index, it is no longer a column in the DataFrame. This is demonstrated with the shape function, which shows there are now 5 columns instead of previous 6:

indexed_titanic_df.shape

Also, an index is not referenced like a column. For example, if you try to reference ‘PassengerId’ as you would a column, you’ll get an error:

indexed_titanic_df['PassengerId']

Convert index to a column

Change your mind about indexing that column? No worries. Just set it back using reset_index():

un_indexed_titanic_df = indexed_titanic_df.reset_index()un_indexed_titanic_df.head()

Conclusion (with a teaser)

This tutorial reveals only the very tip of the pandas.DataFrame iceberg.

We haven’t even touched on doing math, using apply and applymap to apply functions across a DataFrame, pivoting tables, multi-indexing, using crosstab to validate a hypothesis, merging, iterating over rows, and so much more.

I encourage you to keep exploring the pandas.DataFrame, and the pandas library in general.

Before you go, here’s one last teaser …

Plotting pandas.DataFrame

One of the best things about pandas and theDataFrame structure is how easy it makes doing data analysis, especially when coupled with the matplotlib library. matplotlib is a whole new subject, so here’s just a simple example to wet your appetite.

Try it

import pandas as pd
import matplotlib.pyplot as plttitanic_df = pd.read_csv("titanic_names.csv")# Draw a histogram of passenger agesage_hist = titanic_df['Age'].hist()
age_hist.set(xlabel='Passenger Age', ylabel='Count')plt.show()

Happy Python-ing!!!

Resources

Original Titanic data source
Handy Pandas cheat sheet (by pandas.pydata.org)
Pandas official documentation on DataFrames
Pandas documentation on plotting and data visualization
Jupyter Notebook with all the exercises in this Tutorial