Data Manipulation with Pandas

8 min readApr 4, 2024

Data Manipulation is the process of adjusting data to make it easier to read and understand. Pandas, short for “Python Data Analysis Library,” a Python library, simplifies this process by providing tools to import, clean, restructure, analyze, and export data. It is a popular open-source tool used for data manipulation, analysis, and visualization. In this blog post, we’ll explore how to use Pandas for data manipulation.

Install Pandas using pip install pandas

Import Pandas using import pandas as pd

1. Reading and Writing Data

Importing external datasets is crucial for data manipulation work. Pandas provides various methods to import data, including from CSV files and Excel sheets.

# Importing data from a CSV file
df = pd.read_csv('file.csv')

# Importing data from an Excel file
df = pd.read_excel('file.xlsx')

Pandas also allows you to export manipulated data back to different file formats.

# Exporting data to a CSV file
df.to_csv('new_file.csv')

# Exporting data to an Excel file
df.to_excel('new_file.xlsx')

2. Pandas Data Structures

Pandas offers two primary data structures that are essential for data manipulation tasks: Series and DataFrames.

Series: A Series is a one-dimensional labeled array. Here’s an example of creating a Series with some student names and their scores:

import pandas as pd

# Create a Series with student names as labels and scores as data
student_scores = pd.Series([85, 92, 78, 95], index=["Alex", "Ben", "Clara", "Deric"])

# Print the Series
print(student_scores)

#Output:

Alex     85 
Ben      92 
Clara    78 
Deric    95 
dtype: int64

DataFrame: A DataFrame is a two-dimensional labeled data structure resembling a spreadsheet. Here’s an example of creating a DataFrame with student information:

# Create a dictionary with student data
student_data = {
    "Name": ["Alex", "Ben", "Clara", "Deric", "Eva", "Frank"],
    "Age": [20, 21, 19, 22, 23, 20],
    "Major": ["CS", "Math", "Physics", "Economics", "Biology", "Chemistry"],
    "Score": [85, 92, 78, 95, 88, 90],
    "Hometown": ["York", "Angeles", "Chicago", "Houston", "Boston", "Francisco"],
    "Graduation Year": [2022, 2023, 2022, 2023, 2024, 2023],
    "Scholarship": [False, True, False, True, False, False]
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(student_data)

# Print the DataFrame 
print(df)

#Output:

    Name  Age      Major  Score   Hometown  Graduation Year  Scholarship
0   Alex   20         CS     85       York             2022        False
1    Ben   21       Math     92    Angeles             2023         True
2  Clara   19    Physics     78    Chicago             2022        False
3  Deric   22  Economics     95    Houston             2023         True
4    Eva   23    Biology     88     Boston             2024        False
5  Frank   20  Chemistry     90  Francisco             2023        False

3. Data Exploration

Before performing data manipulation, it’s essential to explore and understand your dataset. Pandas provides several handy functions for this purpose. Let’s explore them using the above student_data dataframe example.

head(): This function returns the first few rows in the DataFrame. By default, it returns the first 5 rows. Similarly, tail() returns the last few rows in the DataFrame.

# Display the first few rows of the DataFrame
print(df.head())

#Output:

    Name  Age      Major  Score Hometown  Graduation Year  Scholarship
0   Alex   20         CS     85     York             2022        False
1    Ben   21       Math     92  Angeles             2023         True
2  Clara   19    Physics     78  Chicago             2022        False
3  Deric   22  Economics     95  Houston             2023         True
4    Eva   23    Biology     88   Boston             2024        False

describe(): This function provides a statistical summary of the numerical columns in the DataFrame.

# Display a summary of the DataFrame
print(df.describe())

#Output:

             Age      Score  Graduation Year
count   6.000000   6.000000         6.000000
mean   20.833333  88.000000      2022.833333
std     1.471960   5.966574         0.752773
min    19.000000  78.000000      2022.000000
25%    20.000000  85.750000      2022.250000
50%    20.500000  89.000000      2023.000000
75%    21.750000  91.500000      2023.000000
max    23.000000  95.000000      2024.000000

info(): This function displays a concise summary of a DataFrame, including data types, non-null values, and memory usage.

# Display information about the DataFrame
print(df.info())

# Output:

<class 'pandas.core.frame.DataFrame'>       
RangeIndex: 6 entries, 0 to 5
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Name             6 non-null      object
 1   Age              6 non-null      int64 
 2   Major            6 non-null      object
 3   Score            6 non-null      int64 
 4   Hometown         6 non-null      object
 5   Graduation Year  6 non-null      int64 
 6   Scholarship      6 non-null      bool  
dtypes: bool(1), int64(3), object(3)        
memory usage: 422.0+ bytes

df.shape : Used to get a tuple containing the DataFrame’s dimensions.

# Display DataFrame Dimensions
print(df.shape)

# Output:
(6, 7)

4. Handling Missing Data

Managing missing values helps avoid errors and incorrect conclusions in data analysis. Without complete data, our conclusions may be flawed. By filling in missing information, we make sure our findings are accurate and trustworthy.

Identifying missing values:

# Check for missing values in the DataFrame
print(df.isnull().sum())

Dropping or filling missing values:

# Drop rows with missing values
df.dropna()


# Fill missing values with a specific value
df.fillna(0)


# Fill missing values using a function
df.fillna(df.mean())

5. Data Selection and Filtering

Pandas provides various methods to select and filter data from a DataFrame based on labels, integer positions, or boolean conditions.

Selecting Columns:

# Select a single column
df['Major']


# Select multiple columns
df[['Name', 'Age', 'Major']]

Filtering Rows:

# Filter rows based on a condition
df[df['Score'] > 90]


# Filter rows with multiple conditions
df[(df['Graduation Year'] == 2023) & (df['Scholarship'] == True)]

Selecting Rows and Columns using loc and iloc :

.loc: This is used for label-based indexing. You can select data by row and column labels.

# Select rows and columns using label-based indexing
selected_data = df.loc[df['Major'] == 'CS', ['Name', 'Score']]
print(selected_data)

# Output: 

   Name  Score
0  Alex     85

.iloc: This is used for position-based indexing. You can select data by row and column numbers.

# Select rows and columns using integer-based indexing
selected_data = df.iloc[[0, 2, 4], [0, 2, 3]]
print(selected_data)

# Output:

    Name    Major  Score
0   Alex       CS     85
2  Clara  Physics     78
4    Eva  Biology     88

6. Data Transformation

Data transformation techniques in Pandas, such as applying functions, mathematical operations, and string operations, are essential for shaping data to meet specific analysis or modeling requirements. These methods enable you to create new columns, modify existing ones, and perform complex calculations.

Applying Functions:

# Apply a function to each element of a column
df['Score'].apply(lambda x: 'A' if x >= 90 else 'B')

Mathematical Operations:

# Perform mathematical operations on columns
df['Score'] + 2

String Operations:

# Perform string operations on columns
df['Hometown'].str.upper()

Renaming Columns:

# Renaming the "Hometown" column to "City"
df.rename(columns={"Hometown": "City"}, inplace=True)

Sorting:

# Sort a DataFrame by column
df.sort_values(by=['Score'], ascending=[False])

7. Data Aggregation and Grouping

Data Aggregation is the process of combining data in a manner that results in new information. The most common aggregations are sum, mean, maximum, minimum, etc.

Grouping is the process of categorizing data based on given conditions so that a group of data can be analyzed.

# Group the data by Major and calculate the mean of the scores
grouped_data = df.groupby('Scholarship')['Score'].mean()

print(grouped_data)

# Output:

Scholarship
False    85.25
True     93.50
Name: Score, dtype: float64

After grouping data, you can apply aggregate functions to each group. This enables you to summarize or calculate statistics for each group, providing comparisons between groups.

8. Merging and Joining Data

Merges:

In Pandas, merge is the process of combining two or more DataFrames based on common columns. The different types of merges are:

Inner Merge: Only the common values in both the left and right dataframes are kept.
Left Merge: Combines all rows from the left DataFrame with matching rows from the right DataFrame, filling missing values with NaN.
Right Merge: Combines all rows from the right DataFrame with matching rows from the left DataFrame, filling missing values with NaN.
Outer Merge: All the values from both the left and right dataframes are kept.

To demonstrate merging, let’s create a new DataFrame about student courses containing student names (common column) and their enrolled courses:

# Create a new dataframe
df_new = pd.DataFrame({'Name': ['Alex', 'Ben', 'Clara', 'Deric'], 
'Courses': ['Algorithms', 'Calculus', 'Quantum Physics', 'Macroeconomics']})

# Left merge
left_merge = df.merge(df_new, on='Name', how="left")

print(left_merge)

# Output: 

    Name  Age      Major  Score   Hometown  Graduation Year  Scholarship          Courses
0   Alex   20         CS     85       York             2022        False       Algorithms
1    Ben   21       Math     92    Angeles             2023         True         Calculus
2  Clara   19    Physics     78    Chicago             2022        False  Quantum Physics
3  Deric   22  Economics     95    Houston             2023         True   Macroeconomics
4    Eva   23    Biology     88     Boston             2024        False              NaN
5  Frank   20  Chemistry     90  Francisco             2023        False              NaN

In Pandas merge() function, on specifies the column(s) used to match rows between DataFrames, while how defines the type of merge operation.

Joins:

This method is used to join rows from two or more DataFrames based on their index values. The join() method works similarly to merge() but only merges on the index values instead of specific columns. Thus, the join() method doesn’t need a common column. It’s handy for combining data that has a direct index-based relationship.

The method has four types of joins: Inner Join, Left Join, Right Join, and Outer Join.

# Create a new dataframe
df_new = pd.DataFrame({'Courses': ['Algorithms', 'Calculus', 'Quantum Physics', 'Macroeconomics']})

# Inner join
inner_join = df.join(df_new, how="inner")

print(inner_join)

# Output:

    Name  Age      Major  Score Hometown  Graduation Year  Scholarship          Courses
0   Alex   20         CS     85     York             2022        False       Algorithms
1    Ben   21       Math     92  Angeles             2023         True         Calculus
2  Clara   19    Physics     78  Chicago             2022        False  Quantum Physics
3  Deric   22  Economics     95  Houston             2023         True   Macroeconomics

Concatenation:

concat() is used to append or concatenate rows or columns from one DataFrame to another. It doesn’t consider common values or matching indexes. Use it when you want to simply stack DataFrames:

Vertically (axis=0): Appends rows from one DataFrame to another, stacking them vertically.
Horizontally (axis=1): Appends columns from one DataFrame to another, stacking them horizontally.

Assume we have another DataFrame (df_new) of student information with similar columns:

# Concatenate the two DataFrames vertically
pd.concat([df, df_new], axis=0, ignore_index=True)

9. Working with Categorical data

Working with categorical data in Pandas involves handling variables that represent discrete, non-numerical values, such as gender, nationality, or status. Pandas offers a specific data type, Categorical, which provides more efficient storage and additional methods for working with categorical variables.

# Convert a column to Categorical
df['Major'] = df['Major'].astype('category')

print(df['Major'])

# Output:

0           CS
1         Math
2      Physics
3    Economics
4      Biology
5    Chemistry
Name: Major, dtype: category
Categories (6, object): ['Biology', 'CS', 'Chemistry', 'Economics', 'Math', 'Physics']

10. Data Visualization

With pandas, you can quickly create visual graphs like line plots, bar charts, and scatter plots from your data tables. These graphs help you understand patterns and relationships in your data easily, making it simpler to communicate your findings visually.

# Line Plot
df['Score'].plot()

# Histogram
df['Age'].hist()

# Scatter Plot
df.plot.scatter(x='Age', y='Score')

# Box Plot
df.boxplot(column='Score', by='Major')

To display the plots created with Pandas, you need to use plt.show() from the Matplotlib library.

Conclusion

Pandas provides many useful tools to clean, change, and work with data. It helps us fix missing data, combine datasets, transform & manipulate data, and calculate new values. While this blog provided an introductory overview, pandas’ capabilities extend to more complex transformations. Enjoy Pandas’ power!