Must-Know Pandas Parameters for Future Data Scientists

Published in

ILLUMINATION

7 min readMay 21, 2024

Imagine you are sitting in your favorite cafe, a warm cup of coffee with a perfect foam art in your hand. As you savor the first sip, you’re ready to dive into the world of data science, where mastering the right tool makes all the difference. Just like that barista perfected your coffee, you too can perfect your data manipulation skills with Pandas, a powerful Python library.

In this blog, we’ll explore the must-know Pandas parameter that will help you streamline your data analysis workflow, much like how a well-crafted coffee enhances your morning. Let’s get started!

Note that detailed explanations of NumPy and Pandas are available in Medium, do check that out!

Check Out the Links!

∘ What is the use of Pandas?
∘ Why learn Pandas If You Know Excel?
∘ Essential Pandas Parameters
∘ Interesting Task Using IPL Dataset and Pandas Parameters
∘ Analysis Tasks
∘ Conclusion
∘ Stay Connected With Us!
∘ Thank You!
∘ Bonus Suggestion

What is the use of Pandas?

Pandas is an open-source Python library that is mainly used for data manipulation and analysis. It is built on top of NumPy.

Think of Pandas as a pro-version of Excel, where you can experience the facility to clean, analyze, and visualize data quickly and efficiently.

Why learn Pandas If You Know Excel?

If you’re already comfortable with Excel, you might be wondering, why to learn Pandas If you know Excel. Here’s a simple comparison to help you understand the benefits of Pandas:

— Handling Large Dataset:

Excel: Great for small to medium-sized datasets, but struggle with large datasets (millions of rows).
Pandas: Can easily handle much larger datasets without slowing down your computer.

— Complex Data Manipulation:

Excel: Performing complex data operations can be cumbersome and error-prone, especially with multiple steps.
Pandas: Provides powerful functions to filter, transfer, and merge datasets with clear code. Complex operations become simpler and more reliable.

— Integration with Other Tools:

Excel: Works well with Microsoft Office tools, but has limited integration with other software.
Pandas: Easily integrates with the Python ecosystem, allowing you to use powerful libraries for tasks like machine learning (Scikit-learn), visualization(Matplotlib, Seaborn), and more.

Essential Pandas Parameters

You can get all the datasets mentioned in my blog on my Github, here.

‘header’:

Suppose let’s say you have a dataset and your task is to convert the first row of the dataset to your column, how would you do that?

In such a case we use a header parameter, that tells Pandas which row to use as the column names. By default, Pandas uses the first row, but you can change it if your column names are in a different row.

import pandas as pd

df = pd.read_csv('test.csv', header = 1)
df.head()

Since our column names were in the first row, so we have used (header = 1).

‘usecols’ parameter:

To get only some specific columns from the dataset, we use the ‘usecols’ parameter.

Let’s say you want to get data only from 3 specific columns like ‘enrollee_id’, ‘gender’, and ‘education_level’:

df = pd.read_csv('aug_train.csv', usecols = ['enrollee_id', 'gender', 'education_level'])
df.head()

‘index_col’ parameter:

Sample Preview of aug_train Dataset (Image by Author)

Instead of Sl.No, If I want ‘enrollee_id’ to be my index column, we can use ‘index_col’ in such case.

df = pd.read_csv('aug_train.csv', index_col = 0)
df.head()

Output:

DataFrame with Custom Row Index using index_col Parameter (Image by Author)

‘nrows’ parameter:

It tells Pandas how many rows of files to read.

Suppose from 1650 rows of the dataset, I want only the first 100 rows, then we can use ‘nrows’ for that.

df = pd.read_csv('aug_train.csv', nrows = 100)
df

dtype parameter:

It lets you define the data type for one or more columns

Data Types of Columns in aug_train DataFrame (Image by Author)

We can see that the ‘target’ column is of ‘float’ data type. But, I want to convert it to ‘int’ data type.

df = pd.read_csv('aug_train.csv', dtype = {'target': 'int'})
df.head()

Output:

DataFrame with ‘target’ Column Set to Integer Type (Image by Author)

We can see that the ‘target’ column is now converted to an integer data type.

Isn’t it that simple?

‘parse_dates’ Parameter:

ipl_df = pd.read_csv('IPL Matches 2008-2020.csv')
ipl_df.dtypes

Output:

Snapshot of IPL Dataset (Image by Author)

In this dataset, we want to convert the ‘date’ column from ‘object’ to ‘date’ data type.

ipl_df = pd.read_csv('IPL Matches 2008-2020.csv', parse_dates = ['date'])
ipl_df.dtypes

Output:

IPL Dataset with ‘date’ Column Parsed as Datetime (Image by Author)

The ‘parse_date’ parameter helps convert specified columns to date format automatically. This is especially useful for time series data.

‘na_values’ parameter:

The ‘na_values’ help Pandas identify which strings to consider as the missing values (NaN). This helps in cleaning your data more effectively.

ipl_df = pd.read_csv('IPL Matches 2008-2020.csv', na_values=['NULL', 'NA'])
ipl_df

You might be wondering why we need to specify na_values since the dataset might already have 'NaN' values. However, there are cases where missing values are represented by different characters, such as a hyphen (-) or other strings. These are not automatically recognized as NaN or null values. This can lead to errors if not handled properly from the start.

To avoid such errors, it’s best to specify these characters or strings as null values at the initial stage. This way, your model will treat them as missing values right from the beginning.

- ‘converters’ parameter:

Original data frame:

First 5 Entries of ‘team1’ Column from IPL Dataset (Image by Author)

Now, Instead of ‘Royal Challengers Bangalore’, I want “RCB’ to appear in my dataset.

def rename(name):
    if name == 'Royal Challengers Bangalore':
        return 'RCB'
    else:
        return name


df = pd.read_csv('IPL Matches 2008-2020.csv', converters= {'team1': rename})
df['team1].head()

Output:

IPL Dataset with Custom Conversion Applied Using ‘converters’ Parameter (Image by Author)

Interesting Task Using IPL Dataset and Pandas Parameters

Load the IPL dataset and take a quick look at the first few rows and the data types.
Convert the ‘date’ column to the DateTime format for easier time-based analysis.
Specify custom missing value indicators (e.g., ‘NULL’, ‘NA’) and verify they are treated as NaN.
Set a column like ‘id’ or ‘match_id’ as the index for easier row referencing.
Load only the necessary columns to focus on specific analyses (e.g., ‘date’, ‘team1’, ‘team2’, ‘winner’).
Ensure that numeric columns like ‘win_by_runs’ and ‘win_by_wickets’ are read in the correct format.
Suppose the dataset has a column ‘score’ with values like ‘10,000’. Use a converter to clean these values.
For quick testing or specific analyses, load only a certain number of 500 rows.

Analysis Tasks

Task 1: Determine the number of matches each team has won.
Task 2: Analyze the performance trends of teams over the years.
Task 3: Identify which teams have the most wins at home vs. away.
Task 4: Visualize the number of matches played each year using a bar chart.
Task 5: Calculate and visualize the win margins (runs and wickets) for different teams.

Conclusion

Hopefully, this gives you a clear understanding of how essential Pandas parameters can simplify and enhance your data manipulation tasks. By mastering these parameters, you can handle various data challenges efficiently and set a strong foundation for your journey in data science.

Want to learn more about Importing Files like CSV and TSV, check out my previous article:

Pandas Tutorial: Importing CSV, and TSV Files Into Data Frames

Do you know what’s the first step in data analysis? It’s importing files.

medium.com

Stay Connected With Us!

Subscribe to our new YouTube Channel for detailed tutorial videos, all for free.

Data Science Delight

Hi! Welcome to DataScienceDelight! DataScienceDelight is a dedicated channel created to bring you the latest insights…

www.youtube.com

Follow us on Instagram for more content on data science and machine learning.
For free PDFs, flow charts, cheat sheets, and many more connect with us on LinkedIn.
Join our Substack newsletter for exclusive articles, updates, and insights delivered directly to your inbox.

Thank You!

If you liked this blog, please like, share, and follow us on Medium. Also, don’t forget to subscribe to our email list to get the latest articles on data science and machine learning.

See you with a new blog, till then keep learning, and keep smiling!

Bonus Suggestion

If you want the answer or code for the task above, please let me know in the comments. I’ll be happy to assist you further by creating a separate blog with complete details. Additionally, if you prefer a tutorial video, please let me know, and I’ll be glad to accommodate your request.

Must-Know Pandas Parameters for Future Data Scientists

Table of Contents:

What is the use of Pandas?

Why learn Pandas If You Know Excel?

Essential Pandas Parameters

Interesting Task Using IPL Dataset and Pandas Parameters

Analysis Tasks

Conclusion

Pandas Tutorial: Importing CSV, and TSV Files Into Data Frames

Do you know what’s the first step in data analysis? It’s importing files.

Stay Connected With Us!

Data Science Delight

Hi! Welcome to DataScienceDelight! DataScienceDelight is a dedicated channel created to bring you the latest insights…

Thank You!

Bonus Suggestion

Written by Data Science Delight