Data cleaning mastery: Clean and organize your data like a pro.

Learn the data wrangling method with Python to transform your raw data into a reliable predictive model.

Asish Biswas
AnalyticSoul
3 min readMay 22, 2024

--

In this lesson, we’ll explore essential data-wrangling techniques that transform raw data into a more usable and structured format. Data wrangling is like sculpting raw marble into a beautiful statue — it requires skill, precision, and creativity. By the end of this lesson, you’ll be equipped to handle missing data, group operations, and transposing DataFrames for better analysis.

1. Handling Missing Data

One of the common problems we run into as data scientists or data analysts is handling null or empty values in the dataset. Some machine learning algorithms can’t deal with missing values and some perform poorly. Thankfully Pandas Dataframe comes with some handy functions to identify missing values.

  • isnull()
  • notnull()
  • isna()
  • info()

Try these functions on your DataFrame to understand their effects and use cases.

Once identified and understood we may want to go to the next phase, dealing with those records or individual values. Imputation and removing records are some of the ways to handle missing values. Imputation refers to the process of filling in missing or incomplete data values with estimated or substituted values. Let’s see how we can fill in missing values through imputation.

# load the dataset
df_transactions = pd.read_csv('data/sample_transactions_with_missing_value.csv', header=0)

# count number of missing values
print('Count number of missing values')
print(df_transactions.isna().sum())
Count number of missing values
CustomerID 0
InvoiceNo 250
UnitPrice 33
Quantity 0
InvoiceDate 40
dtype: int64
# show records with missing unit-price
print('Records with missing unit-price')
df_missing = df_transactions[df_transactions['UnitPrice'].isna() == True]

# impute missing unit-price with median of that column
print('Impute missing values with median of that column')
df_imputed = df_missing.fillna(df_transactions['UnitPrice'].median())
print(df_imputed)
Handling missing values with Imputation

Run the accompanying notebook to see it in action.

If appropriate, you can remove those records having empty values using the dropna() function of dataframe.

2. Transposing

Transposing a dataset means swapping the columns and the rows. It helps us to see the data from a different angle. Let’s see it in action with the descriptive statistics DataFrame we get from the describe() method.

# describe the dataset
print('Normal details')
print(df_transactions.describe())

# see the describe output in a transposed format
print('Transposed details')
print(df_transactions.describe().T)

You can see that the columns have become row indexes, and row indexes became column names.

3. Calculating Customer Sales with Group by

We are going to continue using the DataFrame where we imputed the missing unit-prices with the median unit-price.

The groupby() method involves, splitting similar records, applying some functions to them, and combining the results. You can imagine it as a collapse operation on some unique entities.

Before starting the groupby() operation, let’s first create a new column called TotalSale by multiplying UnitPrice and Quantity in order to get the total sale amount per invoice. Then we’ll group customers based on their IDs and aggregate their total sale amount.

# calculate total sale
df_imputed['TotalSale'] = df_imputed['UnitPrice'] * df_imputed['Quantity']

# aggregate total sale per customer
df_summary = df_imputed.groupby('CustomerID').agg({'TotalSale': 'sum', 'InvoiceDate':'first'})
print('Total sale per customer')
df_summary.head()

There are a couple of built-in aggregate functions like, count(), min(), max(), std(), etc. If these built-in functions don't fulfill the requirement, you can apply custom lambda functions while performing groupby() operations.

Data wrangling is like untangling a knotted necklace — it requires patience, skill, and the right tools. Armed with these techniques, you’ll be well-prepared to clean, reshape, and analyze your data effectively.

Please refer to the github repository to access the Jupyter Notebooks and practice along.

Join the community

Join our vibrant learning community on Discord! You’ll find a supportive space to ask questions, share insights, and collaborate with fellow learners. Dive in, collaborate, and let’s grow together! We can’t wait to see you there!

Next, we’ll start building machine learning models and apply them to solve real life use-cases.

--

--