Journey Analytics Engineer #4 :Data Wrangling with Python and Pandas

Ashhab Karami
The Deep Hub
Published in
7 min readMay 16, 2024

Introduction to Data Wrangling

Data wrangling, often a preliminary step towards data analysis, involves cleaning, structuring, and enriching raw data into a desired format for better decision making in less time. Python, along with its libraries like Pandas, provides extensive tools to handle these tasks effectively.

Introduction to DataFrames in Python

Imagine a table like you’d see in Excel, filled with rows and columns. That’s what we call a DataFrame in Python. It’s a way to store data neatly, where each column has its name (like ‘Age’ or ‘Name’) and a specific kind of data (like numbers or text). This setup helps anyone working with data to organize and handle it easily.

What is Pandas?

Pandas is a tool in Python designed to help with data analysis and manipulation. Think of it as a magic kit that helps you manage the data in your DataFrame. Whether you’re organizing large amounts of information, fixing messy data, or trying to understand complex datasets, Pandas gives you the tools to do it efficiently.

Getting Started with Pandas

How to Use Pandas:

First, you need to tell Python that you want to use Pandas. Here’s how you do it:


import pandas as pd

This line of code means you’re bringing in the Pandas toolkit and naming it pd, so you don’t have to type ‘pandas’ all the time—you can just use pd instead.

How to Load Data into a DataFrame

Example: Loading Data from a CSV File:

One of the most common tasks in data analysis is bringing in data, which often comes in a CSV file (like a simpler version of an Excel spreadsheet).

Here’s how you do it:

df = pd.read_csv(“/content/mock_data.csv”)
print(df)

What this does:

  • pd.read_csv("/content/mock_data.csv"): This tells Pandas to read the CSV file at the specified path and put the data into a DataFrame called df.
  • print(df): This line shows you what’s inside df right there in your output, letting you check that everything looks good.

Viewing the Top Rows of Your DataFrame

Using df.head():

When you first load data, it’s useful to peek at the first few rows to make sure everything looks right. This is what df.head() does. By default, it shows you the first five rows of your DataFrame.

Example:

print(df.head(10))

This command tells Python to display the first 10 rows of your DataFrame df. It’s a quick way to get a sense of what your data looks like.

Viewing the Bottom Rows of Your DataFrame

Using df.tail():

Just as you can look at the top rows with df.head(), you can also check out the last few rows with df.tail(). This is especially useful to see how the data ends and ensure everything is consistent throughout.

Example:

print(df.tail(10))

This command will show the last 10 rows of the DataFrame df. It helps you verify the end portions of your data, especially in large datasets.

Understanding Data Types in Your DataFrame

Using df.dtypes:

It’s important to know what type of data each column holds. Are the numbers integers? Is the date actually recognized as a date? Knowing this can help you decide how to process the data.

Example:

result = df.dtypes
print(result)

This line of code will list the data types of each column in your DataFrame df. For example, it will tell you if a column is storing text (object), numbers (int64 or float64), dates (datetime64), etc. This is crucial for understanding how you can work with each column of data.

Handling Missing Data in Pandas

In any dataset, it’s common to encounter missing values. These can arise for various reasons such as errors during data collection, corruption, or even by design if data is not applicable. Pandas provides several methods to address missing data, ensuring your dataset is accurate and ready for analysis.

What are Missing Values?

Missing values, or NaNs, indicate data that is absent. In Python’s Pandas library, these are typically shown as NaN (Not a Number), and managing them is crucial for maintaining a robust dataset.

Strategies to Handle Missing Data

There are two main strategies to deal with missing values:

  1. Removing Missing Data
  2. Filling in Missing Data

1. Removing Missing Data

If a dataset has only a few missing values, or if those values are likely to introduce bias, it might be best to simply remove them. This can be done by dropping the rows or columns that contain missing values.

Example:

# Drop any rows that have missing values
cleaned_df = df.dropna()

This method is straightforward but can result in a loss of data, which might not be ideal if the dataset isn’t large.

2. Filling in Missing Data

A more nuanced approach involves replacing missing values with a substitute that represents the typical value of that column. This can be a fixed number, the median, mode (most frequent value), or a value computed by another method like interpolation.

Filling with a Fixed Value: This is useful when you can safely assume a default value for missing data.

df_filled = df.fillna(0) # Replaces all NaNs with 0

Filling with the Median or Mode: For numerical data, the median is less sensitive to outliers than the mean. For categorical data, mode replacement keeps the data’s original distribution.

median_value = df[‘column_name’].median()
df_filled = df[‘column_name’].fillna(median_value)

mode_value = df[‘column_name’].mode()[0] # mode() returns a Series
df_filled = df[‘column_name’].fillna(mode_value)

Interpolation: This method is useful for numerical and time series data, where the value can be estimated by interpolating between points.

df_interpolated = df.interpolate()

Converting Categorical Data to Numerical Using One-Hot Encoding

Machine learning models typically require all input and feature data to be numeric. This is because mathematical equations underpin most model operations. Categorical data, which includes labels or categories, must be converted into numbers before you can feed it into your model. One common method to do this is through one-hot encoding.

What is One-Hot Encoding?

One-hot encoding transforms each categorical value into a new categorical column and assigns a binary value of 1 or 0. Each column represents one possible value of the feature, and a ‘1’ in that column indicates that the value is present, while a ‘0’ indicates it is absent. This method effectively creates a binary vector for each category.

Why Use One-Hot Encoding?

One-hot encoding is used because it allows the algorithms to treat each category distinctly without any implicit ordering (which might not be intended). This is particularly important for nominal categorical data where no ordinal relationship exists.

Practical Example: One-Hot Encoding in Pandas

Suppose you are analyzing data on flights and their delays. You have a category called ‘Delay’ with values like ‘Delayed’ and ‘Not Delayed’. Here’s how you can apply one-hot encoding to this categorical data:

Example:

import pandas as pd
# Sample data
data = {'Flight': [1, 2, 3, 4],
'Delay': ['Delayed', 'Not Delayed', 'Delayed', 'Not Delayed']}
df = pd.DataFrame(data)
# Applying one-hot encoding
df_encoded = pd.get_dummies(df, columns=['Delay'])
print(df_encoded)

Output Explanation: In the resulting DataFrame, the ‘Delay’ column is replaced by two new columns:

  • ‘Delay_Delayed’: This column has a ‘1’ if the flight is delayed, and ‘0’ otherwise.
  • ‘Delay_Not Delayed’: This column has a ‘1’ if the flight is not delayed, and ‘0’ otherwise.

result :

   Flight  Delay_Delayed  Delay_Not Delayed
0 1 1 0
1 2 0 1
2 3 1 0
3 4 0 1

Each row now has a binary indicator for whether the flight was delayed or not, allowing these categorical variables to be directly used in statistical models and machine learning algorithms without assuming any order or hierarchy.

Understanding and Handling Data Outliers and Anomalies

When analyzing data, not all observations contribute positively. Some can distort the outcome. These are known as outliers and anomalies.

What are Outliers and Anomalies?

  • Outliers: These are data points that differ significantly from other observations. They can occur due to variability in the measurement or may indicate experimental errors.
  • Anomalies: These are unusual points in the data that do not fit a well-defined pattern or expected behavior. They are often indicative of a problem such as a data entry error, faulty equipment, or fraud.

Identifying Outliers with Interquartile Range (IQR)

Box Plot explanation

The box plot is an excellent tool for visualizing the distribution of data and spotting outliers. Here’s how it helps:

  • Lower Quartile (Q1): This is the median of the lower half of the data.
  • Upper Quartile (Q3): This is the median of the upper half of the data.
  • Interquartile Range (IQR): This is the difference between Q3 and Q1. It measures the statistical spread of the middle 50% of the data.
  • Whiskers: These lines extend from Q1 and Q3 to the smallest and largest values, excluding outliers. Typically, they represent 1.5 times the IQR beyond the quartiles.
  • Outliers: Points that lie beyond the whiskers are considered outliers.

Example of Calculating IQR and Identifying Outliers:

import pandas as pd

# Example dataset
data = {'Values': [10, 12, 12, 13, 12, 11, 14, 13, 15, 102, 12, 14, 14, 100, 14]}
df = pd.DataFrame(data)

# Calculating Q1, Q3, and IQR
Q1 = df['Values'].quantile(0.25)
Q3 = df['Values'].quantile(0.75)
IQR = Q3 - Q1

# Defining outliers
outliers = df[(df['Values'] < (Q1 - 1.5 * IQR)) | (df['Values'] > (Q3 + 1.5 * IQR))]
print("Outliers:\n", outliers)
Outliers:
Values
9 102
13 100

Expected Results

The expected output will show which entries in the dataset are considered outliers based on the IQR method. For the provided data, let’s calculate the actual values:

  • Q1 (25th percentile): 12
  • Q3 (75th percentile): 14
  • IQR: Q3 — Q1 = 14–12 = 2

Now, calculate the outlier thresholds:

  • Lower bound: Q1–1.5*IQR = 12–3 = 9
  • Upper bound: Q3 + 1.5*IQR = 14 + 3 = 17

Outliers in this dataset are:

  • 102 (significantly higher than 17)
  • 100 (significantly higher than 17)

These values would be identified and listed as outliers in the output:

Handling Outliers

Once identified, you have several options for handling outliers:

  • Removing: Delete the outlier values if you determine they are due to errors or are not relevant to the purpose of your analysis.
  • Adjusting: If outliers are expected but extreme, they can be capped at a certain value.
  • Keeping: In some analyses, outliers are essential and should be included to understand the variability in data.

Summary

Understanding the nature of your data, including outliers and anomalies, is essential for accurate analysis. Tools like the box plot provide a clear visual representation and help in making informed decisions about which data points to investigate further.

--

--