Hey! You Forget to Clean Your Data

Let’s Learn data cleaning using python.

Ahmad Mizan Nur Haq

Published in

Data And Beyond

7 min readApr 28, 2024

What is data cleaning or data cleansing?

Data cleaning is the process of detecting, correcting or removing incorrect, corrupted, incomplete data in a dataset or table. Data cleaning can also be understood as the process of removing or discarding data that does not belong in the dataset.

Why is data cleaning important?

Because dirty, inconsistent, or corrupted data can affect the results and wrong conclusions. So our ability to process data cleaning will be crucial (has a high impact on the quality of the results).

A Real Case example from Wikipedia

For instance, the government may want to analyze population census figures to decide which regions require further spending and investment on infrastructure and services. In this case, it will be important to have access to reliable data to avoid erroneous fiscal decisions.

3 techniques that can be used for data cleaning using Python

Let’s try to break down what components can be used.

1. Handling missing values

Handling missing values is an important step in data cleaning. If missing data is not handled properly, it can cause bias in the analysis. This bias can affect the accuracy of the model and statistical predictions, leading to incorrect conclusions.

Approaches we can use in handling missing values:

Deletion

Where we can delete all rows that have missing values. This approach can be effective if the missing values are randomly distributed and the sample size is large enough.
- Perform analysis on the available pairs of variables, excluding cases with missing values for a particular variable. This approach retains more data but may lead to biased results if the missing values are not random.

Imputation

Imputation of mean, median, or mode in this method, the missing values are replaced with the mean, median, or mode of the variable in question.

Here is how you handle missing values using Python

Let’s imagine a transportation company that has a dataset about their service users. This dataset includes information such as:

user_id: A unique ID for each user
age: Age of the user
gender: Gender of the user
address_email: User’s email address
amount/payment: The amount paid by the user for the service

#Eits jangan lupa import librarynya
import pandas as pd
import numpy as np

# Membuat dataframe
data = {
    'user_id': [1, 2, 3, 4, 5],
    'umur': [25, np.nan, 30, 35, np.nan],
    'jenis_kelamin': ['L', 'P', np.nan, 'L', 'P'],
    'alamat_email': ['contoh1@contoh.com', 'contoh2@contoh.com', np.nan, 'contoh4@contoh.com', 'contoh5@contoh.com'],
    'jumlah_pembayaran': [100.0, 50.0, np.nan, 200.0, np.nan]
}

df = pd.DataFrame(data)
df

Deletion

involves deleting rows or columns that have missing values in them

# Drop rows with missing values
df_dropped = df.dropna()
df_dropped

result

Imputasi

# Fill missing values with mean/median
df_mean_filled = df.copy()
df_mean_filled['umur'] = df_mean_filled['umur'].fillna(df_mean_filled['umur'].mean().round())
df_mean_filled['jumlah_pembayaran'] = df_mean_filled['jumlah_pembayaran'].fillna(df_mean_filled['jumlah_pembayaran'].mean().round())
df_mean_filled

result

2. Removing Duplicated Data

One of the disadvantages of data duplication is the increased size of a dataset.

3 examples of data duplication

Exact Duplication:Duplication that occurs when all attributes or variables in the data are identical to other data in the data set.
Partial Duplication:Duplication that occurs when some attributes or variables in the data are identical to other data, but not all of them.
Subset Duplication: Duplication that occurs when some of the variables in the data are identical to other data. In this case, the duplication is based on a specific subset of variables instead of the entire data.

Let’s describe the same data situation as the 3 examples above:

from the picture above, we have gotten an idea that there is data duplication in the dataframe.

Delete rows that have identical data with other data

df_deduped = df.drop_duplicates()
df_deduped

Removing Duplication based on a Subset of a specific column
The duplicated records are:

user_id 1: duplicated twice.

df_subset= df.drop_duplicates(subset=['user_id', 'alamat_email'])
df_subset

Duplicates that have the same value in both the ‘user_id’ and ‘email_address’ fields will be deleted.

Partial Duplication — remove the duplication and keep the first one

# Keep the first occurrence of each duplicated row
df_keep_first = df.copy()
df_keep_first.drop_duplicates(subset='alamat_email', keep='first', inplace=True)
df_keep_first

by using ‘keep=first’ then the first row of each duplication is kept.

3. Fixed inconsistencies of data formats and types

Here we will try to understand what is meant by incorrect data format and data type.

Incorrect data format

Data may be stored in formats that do not conform to the expected standards, such as dates in different formats or numerical values with extra characters.

Incorrect Data Types

The data provided has the wrong data type, such as a string instead of a numeric value or vice versa

To avoid cases of inconsistent data formats and data types:

Use standardize data formats that can include date formats, currency symbols, units of measurement, and more.
Define a set of standard formats. For example, all dates should be in YYYY-MM-DD format, or currency amounts should use a consistent currency symbol and number of digits behind a comma.

Suppose we have a dataframe with the name df:

Incorrect Format and data type case

import pandas as pd

# Membuat DataFrame dengan tanggal dalam format YYYY-MM-DD
data = {
    'date_of_birth': ['1996-05-25', '2000-03-12', 
    '1989-11-30', '1985-07-18', 
    '2003-09-05'],  # Format string YYYY-MM-DD
    'name': ['John', 'Jane', 'Michael', 'Emily', 'Daniel']
}

df = pd.DataFrame(data)

Then we want to know the data type of the dataframe

#tinggal tuliskan
df.info()

If we want to change to datetime, we just write it like this

df['date_of_birth'] = pd.to_datetime(df['date_of_birth'])

Define a set of standard formats. For example, all dates must be in DD-MM-YYYY format

# Mengubah format tanggal menjadi DD-MM-YYYY
df['date_of_birth'] = df['date_of_birth'].dt.strftime('%d-%m-%Y')
df

The case where we need to do data type conversion


# Create the DataFrame 
data = {
    'age': ['25', '20', '30', '35', '18', '30', '30'],
    'gender': ['P', 'W', 'P', 'P', 'W', 'W', 'P'],
}
df = pd.DataFrame(data)
df

In the Gender column we will convert P = Male and W = Female

# Data Type Conversion
df['gender'] = df['gender'].replace({'P': 'Pria', 'W': 'Wanita'})  # Convert 'gender' values to full words
df

What are the benefits of data cleaning?

The following are some of the important benefits of data cleansing:

Clean data can reduce recurring errors in the dataset which leads to improved data quality.

Clean data makes analysis more accurate. By removing missing values, duplicates, or other noise, data analysis can provide a better understanding of the patterns and trends within it.
Clean data reduces the risk of inventory shortages, improper deliveries, and other business challenges, resulting in cost savings.

Lets Practice it

After learning what data cleaning we can make, for starters let’s try to make a simple application for data pre-processing using Python and the streamlit framework.

Flowchart Aplikasi Pre-processing data.

Lets code

Eits Don’t forget to import the library

import pandas as pd
import streamlit as st  
#misal belum install streamlit, bisa ke terminal terus ketik "pip install streamlit"

2. For data input we create an upload feature

# Load data from user input
uploaded_file = st.file_uploader("Upload data here (CSV file)", type=["csv"])

3. Defines the handle_missing_values function responsible for handling missing values in DataFrame

def handle_missing_values(df):
    # Tangani missing values jika dipilih
    st.write("## Mengatasi Missing Values")
    # Menampilkan data sebelum handling missing values
    st.write("### Data Sebelum Handling Missing Values:")
    st.write(df)

    # Tangani missing values
    for column in df.columns:
        if df[column].dtype == 'int64' or df[column].dtype == 'float64':
            # Jika tipe data kolom numerik, isi nilai kosong dengan mean
            df[column].fillna(df[column].mean(), inplace=True)
        else:
            # Jika tipe data kolom string, isi nilai kosong dengan modus
            df[column].fillna(df[column].mode()[0], inplace=True)

here we will apply the handle missing values with mean or mode imputation.

4. Create a button to trigger the data process

    if uploaded_file is not None:
        # Baca data CSV
        df = pd.read_csv(uploaded_file)

        # Tombol untuk memproses data
        if st.button("Process Data"):
            df = handle_missing_values(df)

5. Create a download button for pre-processed data

 # Tombol unduh data hasil proses
            csv = df.to_csv(index=False)
            st.download_button(label="Download Processed Data", data=csv, file_name="processed_data.csv", mime="text/csv")