Analytics Vidhya
Published in

Analytics Vidhya

A Beginners Guide to Pandas

Your first step to get started with data analysis.


Simply put, a series is a 1-d array of elements, but with an added feature: there’s an explicit index to address each element. To create a series, you need to pass an array (a list or a dictionary or even a tuple). You can also pass an explicit array for the index. When indexing a series, think of it as a hashmap, associative array or dictionary data structure.

>>> import pandas as pd
>>> series = pd.Series([1,2,3,4,5,6,7])
>>> series
0 1
1 2
2 3
3 4
4 5
5 6
6 7
dtype: int64
>>> series.index
RangeIndex(start=0, stop=7, step=1)
>>> #or pass an explicit index
>>> series2 = pd.Series([1,2,3,4,5,6,7], index['a','b','c','d','e','f','g'])
>>> series2
a 1
b 2
c 3
d 4
e 5
f 6
g 7
dtype: int64


pandas is designed for working with tabular or heterogeneous data. This is achieved through the DataFrame data structure. It is like a table — it contains rows and columns which can be indexed. There are numerous ways to create a dataset, the most common of which would be to pass a dictionary. Let’s create one.

>>> diction = {'name': ['Alex', 'Bob', 'Charlie', 'Jack', 'Melissa'],
'age': [20, 18, 19, 20, 19],
'score': [109, 108, 99, 120, 115]}
>>> dataframe1 = pd.DataFrame(diction)

Importing a dataset

To import a dataset in csv format using pandas, we use pandas read_csv function.

>>> df = pd.read_csv('StudentsPerformance.csv')

Examining the data

Often when we’re working with a dataset, we are interested in knowing what columns it contains and what each column describes. This is easy with pandas head() and info() functions.

>>> df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
gender 1000 non-null object
race/ethnicity 1000 non-null object
parental level of education 1000 non-null object
lunch 1000 non-null object
test preparation course 1000 non-null object
math score 1000 non-null int64
reading score 1000 non-null int64
writing score 1000 non-null int64
dtypes: int64(3), object(5)
memory usage: 62.6+ KB

Checking for Null values

Datasets often contain null values which can become a nuisance to work with during ML. We can check for null values using the info() function or sum over each individual column to check for nulls through the columns.

>>> df.isnull().sum()
gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64

Describing the data

Another very useful function is describe(). This gives a 5 number summary of the data along with some other useful stats for interpreting the range in which our data would lie.


Asking questions from data

The main reason for exploring a dataset is to be able to answer questions. For e.g. one would want to know by looking at the dataset, “How many students passed the Math exam?” Or, “How many students scored a percentage above 80?” Answering these question is easy with pandas.

>>> # how many students passed the math exam?
>>> passing_score = 40
>>> math_stats = df['math score'] >= passing_score
>>> print(math_stats.sum())
>>> # how many students scored a percentage above 80?
>>> total_marks = df['math score'] + df['reading score'] + df['writing score']
>>> total = 3
>>> total_percentage = (total_marks/total)
>>> above_80 = total_percentage >= 80
>>> print(total_percentage[above_80].count())

Sorting data by percentages

Now that we’ve calculated the percentages for all the students, we may wish to sort the data according to their rank. pandas lets you sort according to your desired column. For this, let’s add another column names ‘percentage’ in our dataset and sort this.

>>> df['percentage'] = total_percentage
>>> df.sort_values(by='percentage', ascending=False)

Counting by values

Sometimes we may be interested in analyzing a specific column from our dataset. For instance, the column ‘parental level of education’ contains 5 unique values throughout the column. We may wish to count these values. pandas value_counts() function makes this very simple.

>>> df['parental level of education'].unique()array(["bachelor's degree", 'some college', "master's degree",
"associate's degree", 'high school', 'some high school'],
>>> df['parental level of education'].value_counts()some college 226
associate's degree 222
high school 196
some high school 179
bachelor's degree 118
master's degree 59
Name: parental level of education, dtype: int64

Wrapping up

Here I’ve discussed just a few handy functions that are repeatedly used during exploratory data analysis. For a complete guide to using pandas, the best source is the documentation itself!



Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store