Mastering Data Manipulation: A Guide to Merge, Append, and Pivot in Pandas

Punyakeerthi BL
6 min readJul 4, 2024

Before proceeding with this article, please read the following for continuation:

Slicing and Dicing Pandas DataFrames

Merge

Merging in Python, specifically within the pandas library, is a fundamental technique for data manipulation. It allows you to seamlessly combine data from two or more DataFrames based on a shared column or set of columns. This process is akin to joining tables in relational databases, where you connect related data sets using a common field.

Effectively merging DataFrames unlocks a multitude of possibilities for data analysis. You can enrich existing datasets with additional information, identify trends across various sources, and create more comprehensive views of your data.

Understanding the Merge

Imagine you have two DataFrames: one containing customer information (customer ID, name, contact details) and another with purchase history (order ID, customer ID, product details). By merging these DataFrames on the customer ID column, you can create a single, unified dataset that includes both customer details and their purchase history.

Merging offers a high degree of control over how data is combined. You can specify:

  • Merge keys: The columns used to match rows between the DataFrames.
  • Join type: How to handle rows that don’t have a matching key in the other DataFrame. There are four common join types:
  • Inner: Only rows with matching keys in both DataFrames are included.
  • Left: All rows from the left DataFrame are included, even if there’s no match in the right DataFrame (missing values are filled).
  • Right: All rows from the right DataFrame are included, even if there’s no match in the left DataFrame.
  • Outer: All rows from both DataFrames are included, regardless of whether there’s a match.

Unleashing the Power of Merging

Merging empowers you to perform various data analysis tasks, such as:

  • Enriching Data: Add additional attributes to existing data for a more comprehensive picture.
  • Identifying Relationships: Uncover connections between data points across different datasets.
  • Creating New Features: Combine data to generate new features that enhance analysis.
  • Data Cleaning: Merge with reference data to identify and rectify inconsistencies.

By mastering merging techniques, you can transform raw data into valuable insights, fostering more informed decision-making.

In conclusion, merging DataFrames in Python is a cornerstone of data manipulation. It empowers you to effortlessly combine data sets, unlocking a world of possibilities for data exploration and analysis. As you delve deeper into data science, proficiency in merging will become an invaluable asset in your data wrangling toolkit.

Merging combines data from two or more DataFrames based on a common column or set of columns. It’s analogous to joining tables in relational databases.

Code Example:

Python

import pandas as pd
data1 = {'CustomerID': [1, 2, 3], 'Name': ['Alice', 'Bob','Charlie'],
'City': ['New York', 'Los Angeles', 'Chicago']}
df1 = pd.DataFrame(data1)
data2 = {'CustomerID': [1, 2, 4], 'OrderAmount': [100, 200, 300]}
df2 = pd.DataFrame(data2)
# Inner Join (default)
merged_inner = pd.merge(df1, df2, on='CustomerID')
print(merged_inner)
# Left Join
merged_left = pd.merge(df1, df2, how='left', on='CustomerID')
print(merged_left)
# Right Join
merged_right = pd.merge(df1, df2, how='right', on='CustomerID')
print(merged_right)
# Outer Join
merged_outer = pd.merge(df1, df2, how='outer', on='CustomerID')
print(merged_outer)

Output:

CustomerID        Name       City  OrderAmount
0 1 Alice New York 100
1 2 Bob Los Angeles 200
CustomerID Name City OrderAmount
0 1 Alice New York 100
1 2 Bob Los Angeles 200
2 3 Charlie Chicago NaN
CustomerID Name City OrderAmount
0 1 Alice New York 100
1 2 Bob Los Angeles 200
2 4 NaN NaN 300
CustomerID Name City OrderAmount
0 1 Alice New York 100
1 2 Bob Los Angeles 200
2 3 Charlie Chicago NaN
3 4 NaN NaN 300

Appending DataFrames in Pandas

Vertically stacking DataFrames, also known as appending, is a fundamental operation in Pandas. It allows you to combine multiple DataFrames, adding their rows one below the other, to create a larger dataset. This is particularly useful when the DataFrames share the same columns but represent independent data points that you want to analyze together.

Here’s a breakdown of why and how appending DataFrames is done in Pandas:

When to Use Appending

Use appending when you have multiple DataFrames with:

  • Identical Columns: All DataFrames involved must have the same column names and data types. Appending doesn’t merge based on specific keys, so mismatched columns will cause errors.
  • Independent Data: The DataFrames represent separate datasets that you want to combine for analysis, not relational data requiring merging based on conditions.

How to Append DataFrames

There are two primary ways to append DataFrames in Pandas:

  1. Using pd.concat(): This is the preferred and more flexible method. The pd.concat() function takes a list of DataFrames as input and concatenates them along a specified axis. By default (axis=0), it concatenates vertically:

Python

import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [7, 8, 9], 'B': [10, 11, 12]})
appended_df = pd.concat([df1, df2], axis=0)
print(appended_df)
  • Using DataFrame.append() (deprecated): While still functional, the DataFrame.append() method is deprecated and will be removed in future Pandas versions. It's recommended to use pd.concat() instead:

Python

appended_df = df1.append(df2, ignore_index=True)  # Reset index for clarity
print(appended_df)

Additional Considerations

  • Index Handling: By default, pd.concat() keeps the original indices of the DataFrames being appended. You can use the ignore_index=True argument to reset the index for the resulting DataFrame.
  • Data Type Consistency: Appending assumes consistent data types across columns in all DataFrames. If there are mismatches, Pandas may coerce data types or raise errors depending on the severity.

Appending DataFrames is a powerful tool for building larger datasets from independent sources. By understanding when and how to use it effectively, you can streamline your data manipulation tasks in Pandas.

Pivot Tables: Unlocking Trends and Patterns

Ever felt overwhelmed by a massive spreadsheet? Drowning in numbers without a clear picture of what they mean? Pivot tables are your lifeline! These powerful tools transform raw data into insightful summaries, making it a breeze to identify trends and patterns.

At its core, a pivot table is like a magic box for data analysis. Imagine a large table with rows and columns. A pivot table lets you choose specific columns to group your data by, like product category or customer region. Then, you can pick another column, perhaps sales figures, and calculate totals, averages, or other values for each group.

Here’s why pivot tables are champions:

  • Effortless Summarization: Say goodbye to complex formulas! Pivot tables automatically calculate sums, averages, counts, and more for your chosen groups. No need to write endless equations — just drag and drop fields to get the insights you need.
  • Unveiling Trends: Spot hidden patterns within your data. Pivot tables can reveal which products sell best in certain regions, how customer demographics affect buying habits, or how sales fluctuate over time.
  • Flexibility at Your Fingertips: Change your perspective with ease! Pivot tables allow you to rearrange your data on the fly. Simply drag and drop fields to different areas of the table to see your information from a fresh angle.

Let’s say you’re running an online store. A pivot table can help you analyze your sales data by:

  • Product Category: See which categories generate the most revenue or have the highest average order value.
  • Customer Location: Identify top regions or countries for sales.
  • Time Period: Track sales trends by month, quarter, or year.

With these insights, you can make informed decisions about product development, marketing campaigns, and resource allocation.

Pivot tables are not just for spreadsheets anymore. Many data analysis and business intelligence tools offer similar functionalities. So, the next time you’re faced with a mountain of data, remember the power of pivot tables. They’ll help you transform information overload into actionable insights!

Code Example:

Python

import numpy as np  # For sample sales data
# Sample sales data
np.random.seed(10)
data = {'Product': ['A', 'B', 'A', 'C', 'B'],
'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago'],
'Sales': [100, 200, 300, 400, 500]}
df = pd.DataFrame(data)
# Pivot table (summing sales by product and city)
pivot_table = df.pivot_table(values='Sales', index=['Product'], columns='City', aggfunc=np.sum)
print(pivot_table)

Output:

City  Los Angeles  New York
Product
A 20

If you like this post please follow me on Linked In: Punyakeerthi BL

--

--

Punyakeerthi BL

Passionate Learner in #GenerativeAI|Python| Micro-Service |Springboot | #GenerativeAILearning Talks about #GenerativeAI,#promptengineer, #Microservices