Data Manipulation with Pandas

Sanya Gubrani
8 min readApr 4, 2024

--

Data Manipulation is the process of adjusting data to make it easier to read and understand. Pandas, short for “Python Data Analysis Library,” a Python library, simplifies this process by providing tools to import, clean, restructure, analyze, and export data. It is a popular open-source tool used for data manipulation, analysis, and visualization. In this blog post, we’ll explore how to use Pandas for data manipulation.

Install Pandas using pip install pandas

Import Pandas using import pandas as pd

1. Reading and Writing Data

Importing external datasets is crucial for data manipulation work. Pandas provides various methods to import data, including from CSV files and Excel sheets.

# Importing data from a CSV file
df = pd.read_csv('file.csv')

# Importing data from an Excel file
df = pd.read_excel('file.xlsx')

Pandas also allows you to export manipulated data back to different file formats.

# Exporting data to a CSV file
df.to_csv('new_file.csv')

# Exporting data to an Excel file
df.to_excel('new_file.xlsx')

2. Pandas Data Structures

Pandas offers two primary data structures that are essential for data manipulation tasks: Series and DataFrames.

  • Series: A Series is a one-dimensional labeled array. Here’s an example of creating a Series with some student names and their scores:
import pandas as pd

# Create a Series with student names as labels and scores as data
student_scores = pd.Series([85, 92, 78, 95], index=["Alex", "Ben", "Clara", "Deric"])

# Print the Series
print(student_scores)
#Output:

Alex 85
Ben 92
Clara 78
Deric 95
dtype: int64
  • DataFrame: A DataFrame is a two-dimensional labeled data structure resembling a spreadsheet. Here’s an example of creating a DataFrame with student information:
# Create a dictionary with student data
student_data = {
"Name": ["Alex", "Ben", "Clara", "Deric", "Eva", "Frank"],
"Age": [20, 21, 19, 22, 23, 20],
"Major": ["CS", "Math", "Physics", "Economics", "Biology", "Chemistry"],
"Score": [85, 92, 78, 95, 88, 90],
"Hometown": ["York", "Angeles", "Chicago", "Houston", "Boston", "Francisco"],
"Graduation Year": [2022, 2023, 2022, 2023, 2024, 2023],
"Scholarship": [False, True, False, True, False, False]
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(student_data)

# Print the DataFrame
print(df)
#Output:

Name Age Major Score Hometown Graduation Year Scholarship
0 Alex 20 CS 85 York 2022 False
1 Ben 21 Math 92 Angeles 2023 True
2 Clara 19 Physics 78 Chicago 2022 False
3 Deric 22 Economics 95 Houston 2023 True
4 Eva 23 Biology 88 Boston 2024 False
5 Frank 20 Chemistry 90 Francisco 2023 False

3. Data Exploration

Before performing data manipulation, it’s essential to explore and understand your dataset. Pandas provides several handy functions for this purpose. Let’s explore them using the above student_data dataframe example.

  • head(): This function returns the first few rows in the DataFrame. By default, it returns the first 5 rows. Similarly, tail() returns the last few rows in the DataFrame.
# Display the first few rows of the DataFrame
print(df.head())
#Output:

Name Age Major Score Hometown Graduation Year Scholarship
0 Alex 20 CS 85 York 2022 False
1 Ben 21 Math 92 Angeles 2023 True
2 Clara 19 Physics 78 Chicago 2022 False
3 Deric 22 Economics 95 Houston 2023 True
4 Eva 23 Biology 88 Boston 2024 False
  • describe(): This function provides a statistical summary of the numerical columns in the DataFrame.
# Display a summary of the DataFrame
print(df.describe())
#Output:

Age Score Graduation Year
count 6.000000 6.000000 6.000000
mean 20.833333 88.000000 2022.833333
std 1.471960 5.966574 0.752773
min 19.000000 78.000000 2022.000000
25% 20.000000 85.750000 2022.250000
50% 20.500000 89.000000 2023.000000
75% 21.750000 91.500000 2023.000000
max 23.000000 95.000000 2024.000000
  • info(): This function displays a concise summary of a DataFrame, including data types, non-null values, and memory usage.
# Display information about the DataFrame
print(df.info())
# Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 6 non-null object
1 Age 6 non-null int64
2 Major 6 non-null object
3 Score 6 non-null int64
4 Hometown 6 non-null object
5 Graduation Year 6 non-null int64
6 Scholarship 6 non-null bool
dtypes: bool(1), int64(3), object(3)
memory usage: 422.0+ bytes
  • df.shape : Used to get a tuple containing the DataFrame’s dimensions.
# Display DataFrame Dimensions
print(df.shape)

# Output:
(6, 7)

4. Handling Missing Data

Managing missing values helps avoid errors and incorrect conclusions in data analysis. Without complete data, our conclusions may be flawed. By filling in missing information, we make sure our findings are accurate and trustworthy.

Identifying missing values:

# Check for missing values in the DataFrame
print(df.isnull().sum())

Dropping or filling missing values:

# Drop rows with missing values
df.dropna()


# Fill missing values with a specific value
df.fillna(0)


# Fill missing values using a function
df.fillna(df.mean())

5. Data Selection and Filtering

Pandas provides various methods to select and filter data from a DataFrame based on labels, integer positions, or boolean conditions.

Selecting Columns:

# Select a single column
df['Major']


# Select multiple columns
df[['Name', 'Age', 'Major']]

Filtering Rows:

# Filter rows based on a condition
df[df['Score'] > 90]


# Filter rows with multiple conditions
df[(df['Graduation Year'] == 2023) & (df['Scholarship'] == True)]

Selecting Rows and Columns using loc and iloc :

.loc: This is used for label-based indexing. You can select data by row and column labels.

# Select rows and columns using label-based indexing
selected_data = df.loc[df['Major'] == 'CS', ['Name', 'Score']]
print(selected_data)
# Output: 

Name Score
0 Alex 85

.iloc: This is used for position-based indexing. You can select data by row and column numbers.

# Select rows and columns using integer-based indexing
selected_data = df.iloc[[0, 2, 4], [0, 2, 3]]
print(selected_data)
# Output:

Name Major Score
0 Alex CS 85
2 Clara Physics 78
4 Eva Biology 88

6. Data Transformation

Data transformation techniques in Pandas, such as applying functions, mathematical operations, and string operations, are essential for shaping data to meet specific analysis or modeling requirements. These methods enable you to create new columns, modify existing ones, and perform complex calculations.

Applying Functions:

# Apply a function to each element of a column
df['Score'].apply(lambda x: 'A' if x >= 90 else 'B')

Mathematical Operations:

# Perform mathematical operations on columns
df['Score'] + 2

String Operations:

# Perform string operations on columns
df['Hometown'].str.upper()

Renaming Columns:

# Renaming the "Hometown" column to "City"
df.rename(columns={"Hometown": "City"}, inplace=True)

Sorting:

# Sort a DataFrame by column
df.sort_values(by=['Score'], ascending=[False])

7. Data Aggregation and Grouping

Data Aggregation is the process of combining data in a manner that results in new information. The most common aggregations are sum, mean, maximum, minimum, etc.

Grouping is the process of categorizing data based on given conditions so that a group of data can be analyzed.

# Group the data by Major and calculate the mean of the scores
grouped_data = df.groupby('Scholarship')['Score'].mean()

print(grouped_data)
# Output:

Scholarship
False 85.25
True 93.50
Name: Score, dtype: float64

After grouping data, you can apply aggregate functions to each group. This enables you to summarize or calculate statistics for each group, providing comparisons between groups.

8. Merging and Joining Data

Merges:

In Pandas, merge is the process of combining two or more DataFrames based on common columns. The different types of merges are:

  • Inner Merge: Only the common values in both the left and right dataframes are kept.
  • Left Merge: Combines all rows from the left DataFrame with matching rows from the right DataFrame, filling missing values with NaN.
  • Right Merge: Combines all rows from the right DataFrame with matching rows from the left DataFrame, filling missing values with NaN.
  • Outer Merge: All the values from both the left and right dataframes are kept.

To demonstrate merging, let’s create a new DataFrame about student courses containing student names (common column) and their enrolled courses:

# Create a new dataframe
df_new = pd.DataFrame({'Name': ['Alex', 'Ben', 'Clara', 'Deric'],
'Courses': ['Algorithms', 'Calculus', 'Quantum Physics', 'Macroeconomics']})

# Left merge
left_merge = df.merge(df_new, on='Name', how="left")

print(left_merge)
# Output: 

Name Age Major Score Hometown Graduation Year Scholarship Courses
0 Alex 20 CS 85 York 2022 False Algorithms
1 Ben 21 Math 92 Angeles 2023 True Calculus
2 Clara 19 Physics 78 Chicago 2022 False Quantum Physics
3 Deric 22 Economics 95 Houston 2023 True Macroeconomics
4 Eva 23 Biology 88 Boston 2024 False NaN
5 Frank 20 Chemistry 90 Francisco 2023 False NaN

In Pandas merge() function, on specifies the column(s) used to match rows between DataFrames, while how defines the type of merge operation.

Joins:

This method is used to join rows from two or more DataFrames based on their index values. The join() method works similarly to merge() but only merges on the index values instead of specific columns. Thus, the join() method doesn’t need a common column. It’s handy for combining data that has a direct index-based relationship.

The method has four types of joins: Inner Join, Left Join, Right Join, and Outer Join.

# Create a new dataframe
df_new = pd.DataFrame({'Courses': ['Algorithms', 'Calculus', 'Quantum Physics', 'Macroeconomics']})

# Inner join
inner_join = df.join(df_new, how="inner")

print(inner_join)
# Output:

Name Age Major Score Hometown Graduation Year Scholarship Courses
0 Alex 20 CS 85 York 2022 False Algorithms
1 Ben 21 Math 92 Angeles 2023 True Calculus
2 Clara 19 Physics 78 Chicago 2022 False Quantum Physics
3 Deric 22 Economics 95 Houston 2023 True Macroeconomics

Concatenation:

concat() is used to append or concatenate rows or columns from one DataFrame to another. It doesn’t consider common values or matching indexes. Use it when you want to simply stack DataFrames:

  • Vertically (axis=0): Appends rows from one DataFrame to another, stacking them vertically.
  • Horizontally (axis=1): Appends columns from one DataFrame to another, stacking them horizontally.

Assume we have another DataFrame (df_new) of student information with similar columns:

# Concatenate the two DataFrames vertically
pd.concat([df, df_new], axis=0, ignore_index=True)

9. Working with Categorical data

Working with categorical data in Pandas involves handling variables that represent discrete, non-numerical values, such as gender, nationality, or status. Pandas offers a specific data type, Categorical, which provides more efficient storage and additional methods for working with categorical variables.

# Convert a column to Categorical
df['Major'] = df['Major'].astype('category')

print(df['Major'])
# Output:

0 CS
1 Math
2 Physics
3 Economics
4 Biology
5 Chemistry
Name: Major, dtype: category
Categories (6, object): ['Biology', 'CS', 'Chemistry', 'Economics', 'Math', 'Physics']

10. Data Visualization

With pandas, you can quickly create visual graphs like line plots, bar charts, and scatter plots from your data tables. These graphs help you understand patterns and relationships in your data easily, making it simpler to communicate your findings visually.

# Line Plot
df['Score'].plot()

# Histogram
df['Age'].hist()

# Scatter Plot
df.plot.scatter(x='Age', y='Score')

# Box Plot
df.boxplot(column='Score', by='Major')

To display the plots created with Pandas, you need to use plt.show() from the Matplotlib library.

Conclusion

Pandas provides many useful tools to clean, change, and work with data. It helps us fix missing data, combine datasets, transform & manipulate data, and calculate new values. While this blog provided an introductory overview, pandas’ capabilities extend to more complex transformations. Enjoy Pandas’ power!

--

--