Elevate Your Data Analysis Skills with Pandas

Master Pandas: Essential Tips & Tricks

CyCoderX
Python’s Gurus
11 min readJun 15, 2024

--

Photo by Luke Chesser on Unsplash

Throughout my journey, I regularly share insights on Python programming, from beginner tips to advanced techniques, helping others enhance their coding skills and stay updated with the latest trends. My content spans various topics, including Python, SQL, databases, tech advancements in AI and more. Your engagement, whether through claps, comments, or following me, fuels my passion for creating and sharing more informative content.

In this article, we will focus on one of the fundamental libraries in Python’s data science toolkit: Pandas. Whether you’re a seasoned programmer or just starting out, understanding Pandas can significantly enhance your data manipulation and computational efficiency. We’ll cover the basics, explore key features and provide practical examples to get you up and running with Pandas in no time.

So, let’s dive into the world of Pandas and unlock new possibilities in Python programming!

Your engagement, whether through claps, comments, or following me, fuels my passion for creating and sharing more informative content.
If you’re interested in more
Python or SQL content, please consider following me. Alternatively, you can click here to check out my Python list on Medium

Introduction to Pandas and the Example Dataset

Pandas is a powerhouse library for data manipulation and analysis in Python. In this article, we’ll explore 20 essential Pandas codes that will elevate your proficiency in handling and analyzing datasets. To illustrate these codes, we’ll use a different example dataset focused on sales data.

If you don’t have Pandas installed, simply run pip install pandas.

Example Dataset

Importing Pandas: Before you can use any pandas functions, you need to import the Pandas library. This is simply done with the line import pandas as pd at the beginning of your script. This allows you to access all of Pandas' functions using the prefix pd.

import pandas as pd

# Creating a basic example dataset
data = {
'Product': ['Laptop', 'Tablet', 'Smartphone', 'Monitor', 'Keyboard'],
'Price': [1000, 300, 800, 200, 50],
'Quantity': [50, 150, 200, 75, 100],
'Category': ['Electronics', 'Electronics', 'Electronics', 'Accessories', 'Accessories']
}

df = pd.DataFrame(data)
print(df)

This code creates a simple DataFrame with columns for products, prices, quantities and categories. It’s a foundational example that we’ll use to demonstrate various Pandas functionalities.

Are you also interested in SQL and Database content? Click here to check out my list on Medium.

Example Analysis

Read Data from a CSV File

Reading data from a CSV file is one of the most common operations in Pandas. Here’s how you can do it:

# Read data from a CSV file
raw_df = pd.read_csv('sales_data.csv')
raw_df.head()

After reading the data, it’s common to want to view a portion of the DataFrame to ensure it was loaded correctly or to get a quick sense of the data’s structure.

  • df.head() displays the first five rows of the DataFrame. You can pass an integer to head() to display a different number of rows, e.g., df.head(10) will display the first ten rows.
  • Similarly, df.tail() shows the last five rows of the DataFrame. This can be helpful for checking the end of your dataset, especially if you suspect data issues or want to see the latest entries

The pd.read_csv() function is highly customizable with several parameters to handle different data peculiarities. For example:

  • sep or delimiter can be used if your data isn't comma-separated (which is the default).
  • index_col can be used to specify a column from the CSV to use as the index of the DataFrame.
  • usecols can be specified to load only a subset of columns.
  • dtype can be used to force columns into certain data types.

Display Basic Information About the DataFrame

Understanding the structure and composition of your DataFrame is crucial. The info() method provides a concise summary:

# Display basic information about the DataFrame
df.info()

The info() method on a DataFrame provides essential details about the DataFrame's structure and data types. It outputs a summary that includes the total number of entries, the number of non-null entries per column, the data type of each column and the memory usage of the DataFrame. Here's a detailed breakdown of what df.info() offers:

  • Column Names: It lists all column names in the DataFrame.
  • Data Types: Shows the data type of each column (e.g., int, float, object, datetime).
  • Non-Null Count: Indicates the number of non-null entries in each column. This is helpful for identifying columns with missing values.
  • Memory Usage: Provides information on the memory consumption of the DataFrame, which can be crucial for handling large datasets or optimizing performance.

Descriptive Statistics of Numeric Columns

To get a quick statistical summary of your data, use the describe() method:

# Descriptive statistics of numeric columns
df.describe()

The describe() method offers a quick statistical summary of the numeric columns in your DataFrame, which is very useful for getting a sense of the data's distribution. This method primarily applies to columns that contain numeric data, automatically excluding non-numeric data types unless specifically included. It’s a valuable tool for quickly identifying potential outliers, understanding the central tendency and the spread of your data. Here’s what df.describe() provides:

  • Count: The number of non-null entries in each column.
  • Mean: The average value of each column.
  • Std (Standard Deviation): Measures the amount of variation or dispersion in the column.
  • Min: The smallest value in each column.
  • 25% (First Quartile): The value below which 25% of the data falls.
  • 50% (Median): The middle value of the data.
  • 75% (Third Quartile): The value below which 75% of the data falls.
  • Max: The largest value in each column.

Together, df.info() and df.describe() provide a powerful starting point for data analysis by giving a thorough understanding of the DataFrame’s content, structure and the statistical properties of the numeric fields. This insight is crucial before diving into more complex data manipulation or analysis tasks.

Interested in more Python content? Click here to check out my list on Medium.

Grouping and Aggregating Data

Grouping data and performing aggregate functions is a powerful feature of Pandas. Here’s an example of how to calculate the mean price and the total quantity in each category:

# Grouping and aggregating data
grouped_df = df.groupby('Category').agg({'Price': 'mean', 'Quantity': 'sum'})
print(grouped_df)

Grouping and aggregating data in Pandas allows you to perform complex data analysis easily. Here’s a concise explanation of how the groupby() and agg() functions work together in your DataFrame:

Using df.groupby()

The groupby() function splits the data into groups based on some criteria. Typically, you specify the column(s) you want to group by.

Using agg()

After grouping, the agg() function is used to perform aggregation operations, turning the grouped data into a single summary value per group. You can specify different aggregation functions for different columns using a dictionary.

In this example:

  • Grouping: Data is grouped by the 'Category' column.
  • Aggregating: Calculates the average of 'Price' and the sum of 'Quantity' for each category.

Output

The resulting grouped_df will be a DataFrame where each category from the original DataFrame is now an index, with columns showing the average price and total quantity for each category. This approach is efficient for summarizing data by categories.

Filtering Data

Filtering data in Pandas based on conditions is a powerful way to narrow down your analysis to only the relevant subsets. For example, filtering products priced above $500:

Using Condition-Based Filtering

To filter data, you apply a condition directly to the DataFrame. The condition is typically stated as a comparison that returns a boolean Series (mask), which Pandas uses to select rows where the condition is True.

# Filtering data based on conditions
filtered_df = df[df['Price'] > 500]
print(filtered_df)

In this example:

  • Condition: The condition df['Price'] > 500 checks which rows in the DataFrame have a 'Price' column value greater than 500.
  • Filtering: df[df['Price'] > 500] selects only those rows where the condition is true. The result is a new DataFrame, filtered_df, containing only products priced above $500.

Output

The output filtered_df will contain all rows from the original DataFrame where the product's price exceeds $500, displaying the entire row's data for each matching entry. This makes it easy to focus on higher-priced products or to further analyze this specific data segment.

Keep in mind that the landscape of database technologies is continually evolving, with frequent updates and improvements that could shift these comparisons in the future.
Stay tuned and follow me to keep up-to-date with the latest developments and insights!

Sorting Data

Sorting data in Pandas is a straightforward operation that allows you to organize your DataFrame in a meaningful order based on one or more columns. Here’s a brief explanation and example on how to sort data by price in descending order:

Using sort_values()

The sort_values() method sorts a DataFrame by the values of one or more columns. You can specify whether you want the sorting to be in ascending or descending order.

# Sorting data
sorted_df = df.sort_values(by='Price', ascending=False)
print(sorted_df)

In this example:

  • Sorting: The method sort_values(by='Price', ascending=False) is used to sort the DataFrame based on the 'Price' column. The parameter ascending=False specifies that the sorting should be in descending order, meaning the higher prices will appear at the top of the DataFrame.

Output

The resulting sorted_df will be a new DataFrame where the entries are ordered from the highest to the lowest price. This type of sorting is particularly useful when you need to quickly identify the most expensive or least expensive items in your dataset.

Creating New Columns

Creating new columns in a Pandas DataFrame based on existing data is an essential task in data analysis, allowing you to derive new insights from your data. Here’s how you can calculate and add a new column for total revenue for each product:

You can create a new column in a DataFrame by simply assigning a calculation or value to it using square bracket notation.

# Creating new columns
df['Revenue'] = df['Price'] * df['Quantity']
print(df)

In this example:

  • Creating Revenue Column: The new column 'Revenue' is created by multiplying the 'Price' and 'Quantity' columns. This operation is performed element-wise, meaning each row's Price and Quantity are multiplied to compute the Revenue for that particular row.
  • The expression df['Revenue'] = df['Price'] * df['Quantity'] directly adds this new column to the existing DataFrame.

Output

After executing the code, the DataFrame df will now include an additional column named 'Revenue'. This column contains the calculated revenue for each product (price multiplied by quantity). The updated DataFrame is then printed, showing all existing data along with the new Revenue column.

This approach is particularly useful for enhancing data sets with additional metrics needed for further analysis or reporting.

Merging DataFrames

Merging DataFrames in Pandas is a powerful technique for combining data from different sources based on common columns or indices. Here’s a concise explanation and example on how to merge two DataFrames based on the ‘Category’ column:

Creating a Second DataFrame

Before merging, you often need to prepare or create the secondary DataFrame that you want to merge with the original one.

# Merging DataFrames
other_data = {
'Category': ['Electronics', 'Accessories'],
'Store': ['Store A', 'Store B']
}
other_df = pd.DataFrame(other_data)
merged_df = pd.merge(df, other_df, on='Category', how='left')
print(merged_df)

In this example:

  • Creating other_df: A new DataFrame other_df is created from a dictionary. It includes a 'Category' column and a 'Store' column, which will be used to add store information to the main DataFrame df.
  • Merging: The pd.merge() function is used to combine df and other_df based on the 'Category' column.
  • on='Category' specifies the column to merge on, which must be present in both DataFrames.
  • how='left' specifies the type of join. A left join means that all rows from df will be included in the merged DataFrame and where categories match in other_df, the 'Store' information will be added.

Output

The output merged_df will be a new DataFrame where:

  • Each row from the original df is retained.
  • Where the ‘Category’ in df matches the 'Category' in other_df, the 'Store' information from other_df is added.
  • If there’s no match, the ‘Store’ column in the resulting DataFrame will show NaN (not a number) for that row.

This merged DataFrame allows you to analyze the original data along with the additional ‘Store’ information, enhancing the dataset’s utility for further analysis or reporting.

Interested in learning how to manage environment variables in Python? Click here to read my Medium article on using the dotenv module for secure configurations.

Pivot Table

Creating pivot tables in Pandas is an excellent way to summarize and analyze data efficiently. Pivot tables reorganize data, allowing you to focus on specific segments and metrics. Here’s a simple guide on how to create a pivot table to calculate the mean price by category:

Using pivot_table()

The pivot_table() function is used to create a spreadsheet-style pivot table as a DataFrame. It can take various parameters to customize the summarization of the data.

# Creating a pivot table
pivot_table = df.pivot_table(index='Category', values='Price', aggfunc='mean')
print(pivot_table)

In this example:

  • Creating the Pivot Table:
  • index='Category' sets the 'Category' column as the index of the pivot table. This means the data will be grouped by the unique values in the 'Category' column.
  • values='Price' specifies that the values in the 'Price' column will be used to calculate the mean.
  • aggfunc='mean' defines the aggregation function to apply, which in this case is the mean. This function will calculate the average price for each category.

Output

The resulting pivot_table is a DataFrame where each index is a unique category from the original DataFrame. The corresponding values are the mean prices for those categories. This setup is particularly useful for quickly identifying average pricing trends across different product categories, aiding in strategic decisions or further statistical analysis.

This method offers a clear and concise way to view important metrics organized by any categorical variable, enhancing your ability to make data-driven decisions.

Conclusion

In this comprehensive guide, we’ve explored the versatile capabilities of Pandas, a fundamental library in Python’s data science toolkit, which is indispensable for effective data manipulation and analysis.

Through examples, we demonstrated basic to advanced operations such as merging, filtering, and creating pivot tables, showing how Pandas efficiently manages large datasets to provide insightful analyses.

Mastering these operations will enhance your data analysis skills, enabling you to address diverse data-driven challenges with speed and accuracy. This knowledge will serve as a valuable asset in your programming toolkit, helping you make informed decisions and explore new avenues in Python and beyond.

Stay tuned, as I will continue to expand further on this topic.

Final Words:

Thank you for taking the time to read my article.

This article was first published on medium by CyCoderX.

Hey There! I’m CyCoderX, a data engineer who loves crafting end-to-end solutions. I write articles about Python, SQL, AI, Data Engineering, lifestyle and more!

Join me as we explore the exciting world of tech, data and beyond!

For similar articles and updates, feel free to explore my Medium profile

If you enjoyed this article, consider following for future updates.

Interested in Python content and tips? Click here to check out my list on Medium.

Interested in more SQL, Databases and Data Engineering content? Click here to find out more!

Happy Coding!

What did you think about this article? If you found it helpful or have any feedback, I’d love to hear from you in the comments below!

Python’s Gurus🚀

Thank you for being a part of the Python’s Gurus community!

Before you go:

  • Be sure to clap x50 time and follow the writer ️👏️️
  • Follow us: Newsletter
  • Do you aspire to become a Guru too? Submit your best article or draft to reach our audience.

--

--

CyCoderX
Python’s Gurus

Data Engineer | Python & SQL Enthusiast | Cloud & DB Specialist | AI Enthusiast | Lifestyle Blogger | Simplifying Big Data and Trends, one article at a time.