12 Pandas Methods To Master: A Complete Roadmap To Be A Data Scientist

Datasthan
8 min readAug 14, 2022

--

Last week I wrote a blog on Python for data science: Python For Data Science: A Complete Roadmap To Be A Data Scientist.

Seems like many people liked that.. Well, then there is no point to stop now, is there?

And that’s why this is gonna be the second part of the blog series “A Complete Roadmap To Be A Data Scientist”.

Let’s get started!

So you’ve learned Python, what’s next?

The obvious answer is Pandas. But…

Why Pandas?

A one word answer is data. See, in the real world the raw data we get is so messy that it is literally impossible to work with. And here’s where Pandas come into play. First of all, it provides a structure to the data so that it becomes way easier to handle the data. And then it provides a function to clean, merge, analyze, visualize… basically any step you want to perform on your data, Pandas will be there for you.

Now that it’s clear why Pandas, let’s import pandas as pd in your life.

Today we’ll be talking about the 12 most used methods in Pandas:

read_csv():

This function is like the gateway of data science. Because without this, we cannot read the data. If we don’t have data, then what’s the meaning of learning all these? Now beside reading the data, there are a few parameters we need to learn about this function.

# opening a local csv filedf = pd.read_csv('census.csv')
df.head()

It is used to open a .csv file, which is already stored in your local computer.

The head() is used to show some particular part of the data, by default it shows 5 rows of the data. But we can customize it by giving other numbers like, df.head(10).

# Opening a csv file from an URLimport requests
from io import StringIO

url = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0"}
req = requests.get(url, headers=headers)
data = StringIO(req.text)

pd.read_csv(data)

Whenever there is a .csv file uploaded on a website, we can directly fetch and use that using this piece of code.

Two libraries, along with Pandas are used here: Requests and StringIO.

The requests library is used to send a request to the server where the data is stored. There are different requests methods, here we’re using the get method. More on requests, here.

Here the “url” is the page url, where the data is stored.

The “headers” is a piece of code we use to send to the browser through the requests method, so that the browser thinks of our request as a request made by a human, and not from a bot.

# opening a local .tsv filepd.read_csv('movie_titles_metadata.tsv',sep='\t',
names=['sno','name','release_year','rating','votes','genres'])

First of all, .tsv means tab separated value. So just like .csv, there are other file formats as well and .tsv is one of them.

Now, in case of read_csv, there is a parameter called sep [ means, separator ], and by default it has a value of ‘,’ which is valid for a .csv file, but not for a .tsv file.

That’s why while handling a .tsv file, we can use this parameter and set it as sep=“\t” [ \t means tab ], so from now it’ll use tab as a separator to read that file.
Also, the “names” parameter is used to name the columns, whenever it becomes hard for pandas to understand which row to use as a column or if in any case there are no columns in our data, we can use this names parameter.

# setting a column as indexpd.read_csv('aug_train.csv',index_col='enrollee_id')

The “index_col” parameter is used to tell pandas that we want this column as our index, and pandas will do that for us.

# using only a particular columnspd.read_csv('aug_train.csv',
usecols=['enrollee_id','gender','education_level'])

There will be situations, where we have a dataset with a large number of columns and to complete our work we don’t really need all of them.

That’s why we use this “usecols” parameter, and we can just pass on which are the columns we want to use.

# encoding parameterpd.read_csv('zomato.csv',encoding='latin-1')

Usually when we work on a dataset, it gets open and we can work on that.

And then there comes some legend, which will throw you something like this: “UnicodeDecodeError”.

Why does this happen?.. Well because the encoding of that file is a different one than UTF-8. UTF-8 is like a normal encoding which is used on a daily basis, and in read_csv, the “encoding” parameter is by default set with UTF-8.

Now when we’ve files with different encoding, we can just use this “encoding” parameter and let pandas know which encoding it has.

Otherwise we can just open the file in notepad, and we can just save it with a different encoding format. And then open it normally in pandas.

# skipping bad rowspd.read_csv('BX-Books.csv', sep=';', 
encoding="latin-1",error_bad_lines=False)

There will be times when you have data which have some bad rows, meaning it might have more values than it is supposed to.

This is where this function comes. If we set “error_bad_lines=False”, then pandas will automatically skip those rows.

# loading a huge dataset in chunksdfs = pd.read_csv('aug_train.csv',chunksize=5000)for chunks in dfs:
print(chunk.shape)

Sometimes the dataset becomes so huge that it becomes way too much for our RAM to store the data. So if we use this “chunksize” parameter, it will read the data in chunks, and it won’t eat up our RAM.

Just one thing to remember here, if we read data in chunks, we need to use a loop to use any other functions on each of the dataset.

read_json():

This is the function we use in pandas to read a json file.

# reading a json filedf = pd.read_json('data.json')print(df.to_string())

The “to_string()” parameter is used to print the entire dataframe.

isnull():

Let’s say, we imported our dataset and now we want to start working on it. But here’s a thing, always always always before start working on your data please check whether there are some missing values in your data or duplicate values.

# checking for missing valuesdf = pd.isnull().sum()

The isnull() method checks whether there are any null values( or NaN values ) in your data, and then the sum() method( aggregate function .. we’ll talk about it later in this blog )just sums up the result for each column.

dropna():

This is the next method we should be using in Pandas.

# removing the missing valuesdf = pd.dropna()

The dropna() method removes all the missing values in our dataframe. If we use this, the entire row which is having the missing value, will be dropped.

fillna():

Sometimes we just cannot drop some rows, because there might be some important data we still need.

# replacing the null valuesdf.fillna(0, inplace = True)

The fillna() method helps us to replace those null values with a particular place-holder we pass with this method. Here we passed 0 as a place-holder.

Also, the parameter “inplace=True”, what it does is it updates the original dataframe after we apply the .fillna() method.

More on fixing null values, here.

to_datetime():

If in our data there is a date column, and within that column there are some NaN values, we can use this.

# replacing the null values in a datetime columndf['Date'] = pd.to_datetime(df['Date'])

What it does is, it changes those NaN values to NaT values, which stands for Not a Time.

And also most of the time we get our data, the date columns come as a string and not in a datetime format.

So if we use this method, our entire column now has been formatted to datetime and we can freely use any other datetime functions on that column.

duplicated():

As a data guy, we should be aware of duplicate rows in our data.

# checking duplicate valuesdf.duplicated().sum()

This method helps us to understand if there are any duplicate rows in our column and again we are chaining the sum() method to get a total value of each column.

drop_duplicates():

Now if we’ve duplicate rows in our data, we use this.

# dropping duplicate valuesdf.drop_duplicates(inplace=True)

By using this method, we are dropping all the duplicate rows from our data and also we’re updating the dataframe at the same time using that “inplace” parameter.

merge():

Let’s say you’re working on some problem and the data you got is divided into 2 or more datasets. And you found out the solution you want to reach can only be found if the datasets were merged together. And that’s why we use merge().

# merging 2 dataframesdf1 = pd.read_csv(data1.csv)
df2 = pd.read_csv(data2.csv)

newdf = df1.merge(df2, how='right')

The idea here is simple. We’ll merge these two dataframes on the basis of the right join.

Now what is right join…. Well there are a total of 5 joins, and it is one of them: Left join, Right join, Inner join, Outer join and Cross join.

What we’re doing here is using the “how” parameter to tell pandas that on the basis of right join, please merge these 2 dataframes.

More on merge and join, here.

corr():

When we’re working with a dataset and we need to find the correlation between two or more variables ( aka columns ), we use corr().

# finding correlation between variablesdf = pd.read_csv(data.csv)

df.corr(method ='pearson')

So what it does is it finds the correlation among the columns, and the “method” parameter is where we tell pandas which method to use to find the correlation.

There are a total 3 methods: pearson, kendall, spearman. Based on our needs we use them.

More on corr, here.

groupby():

Many times while working with data, we need to group the data on the basis of some variables and then get some insights from it, and in pandas we can do that with groupby().

# grouping data df.grouby('teams').count()

Let’s say the ‘teams’ column has multiple team names. Now we want to see for each team what their performance is, so that we can find out which is the best team.

This is where we can use the groupby() method and it is a very important method to keep aside.

Also the count() is another method( aka aggregate function ).. while using groupby on a dataframe, we use some aggregate function on it. There are other aggregate functions as well like mean(), sum(), min(), max(). We mostly use it to make our analysis easy.

More on grouby, here.

pivot_table():

A PivotTable is an interactive way to quickly summarize large amounts of data.

# pivot tabledf = pd.pivot_table(df, index =['A', 'B'])

Here in the pivot_table() method, we’re passing the dataframe, and also telling pandas that ‘a’ and ‘b’ are the index by using the “index” parameter.

What we can do with it is we can aggregate a numeric column as a cross tabulation against two categorical columns.

More on pivot table, here.

Hope this helped. I’m going to share more of my journey, which will be totally beginner-friendly. Check out my other blogs on Medium or we can connect on LinkedIn.

--

--