12 Pandas Methods To Master: A Complete Roadmap To Be A Data Scientist

8 min readAug 14, 2022

Last week I wrote a blog on Python for data science: Python For Data Science: A Complete Roadmap To Be A Data Scientist.

Seems like many people liked that.. Well, then there is no point to stop now, is there?

And that’s why this is gonna be the second part of the blog series “A Complete Roadmap To Be A Data Scientist”.

Let’s get started!

So you’ve learned Python, what’s next?

The obvious answer is Pandas. But…

Why Pandas?

A one word answer is data. See, in the real world the raw data we get is so messy that it is literally impossible to work with. And here’s where Pandas come into play. First of all, it provides a structure to the data so that it becomes way easier to handle the data. And then it provides a function to clean, merge, analyze, visualize… basically any step you want to perform on your data, Pandas will be there for you.

Now that it’s clear why Pandas, let’s import pandas as pd in your life.

Today we’ll be talking about the 12 most used methods in Pandas:

read_csv():
This function is like the gateway of data science. Because without this, we cannot read the data. If we don’t have data, then what’s the meaning of learning all these? Now beside reading the data, there are a few parameters we need to learn about this function.

# opening a local csv filedf = pd.read_csv('census.csv')
df.head()

It is used to open a .csv file, which is already stored in your local computer.

The head() is used to show some particular part of the data, by default it shows 5 rows of the data. But we can customize it by giving other numbers like, df.head(10).

# Opening a csv file from an URLimport requests
from io import StringIO

url = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0"}
req = requests.get(url, headers=headers)
data = StringIO(req.text)

pd.read_csv(data)

Whenever there is a .csv file uploaded on a website, we can directly fetch and use that using this piece of code.

Two libraries, along with Pandas are used here: Requests and StringIO.

The requests library is used to send a request to the server where the data is stored. There are different requests methods, here we’re using the get method. More on requests, here.

Here the “url” is the page url, where the data is stored.

The “headers” is a piece of code we use to send to the browser through the requests method, so that the browser thinks of our request as a request made by a human, and not from a bot.

# opening a local .tsv filepd.read_csv('movie_titles_metadata.tsv',sep='\t',
    names=['sno','name','release_year','rating','votes','genres'])

First of all, .tsv means tab separated value. So just like .csv, there are other file formats as well and .tsv is one of them.

Now, in case of read_csv, there is a parameter called sep [ means, separator ], and by default it has a value of ‘,’ which is valid for a .csv file, but not for a .tsv file.

That’s why while handling a .tsv file, we can use this parameter and set it as sep=“\t” [ \t means tab ], so from now it’ll use tab as a separator to read that file.
Also, the “names” parameter is used to name the columns, whenever it becomes hard for pandas to understand which row to use as a column or if in any case there are no columns in our data, we can use this names parameter.

# setting a column as indexpd.read_csv('aug_train.csv',index_col='enrollee_id')

The “index_col” parameter is used to tell pandas that we want this column as our index, and pandas will do that for us.

# using only a particular columnspd.read_csv('aug_train.csv',
   usecols=['enrollee_id','gender','education_level'])

There will be situations, where we have a dataset with a large number of columns and to complete our work we don’t really need all of them.

That’s why we use this “usecols” parameter, and we can just pass on which are the columns we want to use.

# encoding parameterpd.read_csv('zomato.csv',encoding='latin-1')

Usually when we work on a dataset, it gets open and we can work on that.

And then there comes some legend, which will throw you something like this: “UnicodeDecodeError”.

Why does this happen?.. Well because the encoding of that file is a different one than UTF-8. UTF-8 is like a normal encoding which is used on a daily basis, and in read_csv, the “encoding” parameter is by default set with UTF-8.

Now when we’ve files with different encoding, we can just use this “encoding” parameter and let pandas know which encoding it has.

Otherwise we can just open the file in notepad, and we can just save it with a different encoding format. And then open it normally in pandas.

# skipping bad rowspd.read_csv('BX-Books.csv', sep=';', 
   encoding="latin-1",error_bad_lines=False)

There will be times when you have data which have some bad rows, meaning it might have more values than it is supposed to.

This is where this function comes. If we set “error_bad_lines=False”, then pandas will automatically skip those rows.

# loading a huge dataset in chunksdfs = pd.read_csv('aug_train.csv',chunksize=5000)for chunks in dfs:
    print(chunk.shape)

Sometimes the dataset becomes so huge that it becomes way too much for our RAM to store the data. So if we use this “chunksize” parameter, it will read the data in chunks, and it won’t eat up our RAM.

Just one thing to remember here, if we read data in chunks, we need to use a loop to use any other functions on each of the dataset.

read_json():
This is the function we use in pandas to read a json file.

# reading a json filedf = pd.read_json('data.json')print(df.to_string())

The “to_string()” parameter is used to print the entire dataframe.

isnull():
Let’s say, we imported our dataset and now we want to start working on it. But here’s a thing, always always always before start working on your data please check whether there are some missing values in your data or duplicate values.

# checking for missing valuesdf = pd.isnull().sum()

The isnull() method checks whether there are any null values( or NaN values ) in your data, and then the sum() method( aggregate function .. we’ll talk about it later in this blog )just sums up the result for each column.

dropna():
This is the next method we should be using in Pandas.

# removing the missing valuesdf = pd.dropna()

The dropna() method removes all the missing values in our dataframe. If we use this, the entire row which is having the missing value, will be dropped.

fillna():
Sometimes we just cannot drop some rows, because there might be some important data we still need.

# replacing the null valuesdf.fillna(0, inplace = True)

The fillna() method helps us to replace those null values with a particular place-holder we pass with this method. Here we passed 0 as a place-holder.

Also, the parameter “inplace=True”, what it does is it updates the original dataframe after we apply the .fillna() method.

12 Pandas Methods To Master: A Complete Roadmap To Be A Data Scientist

Written by Datasthan