A Beginners Guide to Pandas

Your first step to get started with data analysis.

Najia Gul
Analytics Vidhya
5 min readJan 11, 2021

--

Since 2010, when pandas first became open-sourced, it has matured quite into a beautiful and extensive library for data analysis. It is often used in conjunction with computational and statistical libraries like NumPy, scikit-learn, matplotlib etc.

In this article, I’ll walk you through the most common functions in pandas so you’re ready to do some exploratory data analysis.

pandas has 2 data structures that are repeatedly used over and over: Series and DataFrame

Series

Simply put, a series is a 1-d array of elements, but with an added feature: there’s an explicit index to address each element. To create a series, you need to pass an array (a list or a dictionary or even a tuple). You can also pass an explicit array for the index. When indexing a series, think of it as a hashmap, associative array or dictionary data structure.

>>> import pandas as pd
>>> series = pd.Series([1,2,3,4,5,6,7])
>>> series
0 1
1 2
2 3
3 4
4 5
5 6
6 7
dtype: int64
>>> series.index
RangeIndex(start=0, stop=7, step=1)
>>> #or pass an explicit index
>>> series2 = pd.Series([1,2,3,4,5,6,7], index['a','b','c','d','e','f','g'])
>>> series2
a 1
b 2
c 3
d 4
e 5
f 6
g 7
dtype: int64

DataFrame

pandas is designed for working with tabular or heterogeneous data. This is achieved through the DataFrame data structure. It is like a table — it contains rows and columns which can be indexed. There are numerous ways to create a dataset, the most common of which would be to pass a dictionary. Let’s create one.

>>> diction = {'name': ['Alex', 'Bob', 'Charlie', 'Jack', 'Melissa'],
'age': [20, 18, 19, 20, 19],
'score': [109, 108, 99, 120, 115]}
>>> dataframe1 = pd.DataFrame(diction)

To illustrate the true power of pandas library, we will now start working with a small dataset from kaggle. This is a dataset of students’ performance on the exams and how it is affected by economic, personal and social factors. Download the dataset from here.

Importing a dataset

To import a dataset in csv format using pandas, we use pandas read_csv function.

>>> df = pd.read_csv('StudentsPerformance.csv')

Examining the data

Often when we’re working with a dataset, we are interested in knowing what columns it contains and what each column describes. This is easy with pandas head() and info() functions.

>>> df.head()
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
gender 1000 non-null object
race/ethnicity 1000 non-null object
parental level of education 1000 non-null object
lunch 1000 non-null object
test preparation course 1000 non-null object
math score 1000 non-null int64
reading score 1000 non-null int64
writing score 1000 non-null int64
dtypes: int64(3), object(5)
memory usage: 62.6+ KB

Checking for Null values

Datasets often contain null values which can become a nuisance to work with during ML. We can check for null values using the info() function or sum over each individual column to check for nulls through the columns.

>>> df.isnull().sum()
gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64

Describing the data

Another very useful function is describe(). This gives a 5 number summary of the data along with some other useful stats for interpreting the range in which our data would lie.

df.describe()

Asking questions from data

The main reason for exploring a dataset is to be able to answer questions. For e.g. one would want to know by looking at the dataset, “How many students passed the Math exam?” Or, “How many students scored a percentage above 80?” Answering these question is easy with pandas.

>>> # how many students passed the math exam?
>>> passing_score = 40
>>> math_stats = df['math score'] >= passing_score
>>> print(math_stats.sum())
960
>>> # how many students scored a percentage above 80?
>>> total_marks = df['math score'] + df['reading score'] + df['writing score']
>>> total = 3
>>> total_percentage = (total_marks/total)
>>> above_80 = total_percentage >= 80
>>> print(total_percentage[above_80].count())
198

Sorting data by percentages

Now that we’ve calculated the percentages for all the students, we may wish to sort the data according to their rank. pandas lets you sort according to your desired column. For this, let’s add another column names ‘percentage’ in our dataset and sort this.

>>> df['percentage'] = total_percentage
>>> df.sort_values(by='percentage', ascending=False)

Counting by values

Sometimes we may be interested in analyzing a specific column from our dataset. For instance, the column ‘parental level of education’ contains 5 unique values throughout the column. We may wish to count these values. pandas value_counts() function makes this very simple.

>>> df['parental level of education'].unique()array(["bachelor's degree", 'some college', "master's degree",
"associate's degree", 'high school', 'some high school'],
dtype=object)
>>> df['parental level of education'].value_counts()some college 226
associate's degree 222
high school 196
some high school 179
bachelor's degree 118
master's degree 59
Name: parental level of education, dtype: int64

Wrapping up

Here I’ve discussed just a few handy functions that are repeatedly used during exploratory data analysis. For a complete guide to using pandas, the best source is the documentation itself!

--

--