Data Cleaning- Exploratory Data Analysis

Rina Mondal
Women in Technology
3 min readJul 4, 2024
Photo by Shubham Dhage on Unsplash

Data cleaning, also known as data cleansing or data preprocessing, is a crucial step in the data preparation process that involves detecting and correcting errors or inconsistencies in data to improve its quality.

Here are the main tasks that typically come under Data Cleaning:

1. Handling Missing Data

— Imputation: Filling missing values using techniques such as mean, median, or mode imputation.
— Forward or backward fill (propagation of the next or previous value).
— Interpolation (estimating missing values using linear or non-linear methods).
— Model-based imputation (predictive models to estimate missing values).

2. Handling Outliers

Detection: Identifying outliers using statistical methods such as z-scores, IQR (Interquartile Range), or visualizations like box plots.

Capping/Clipping: Setting a threshold to limit extreme values.
Transformation: Applying transformations to reduce the impact of outliers (e.g., log transformation).

3. Dealing with Duplicates

Detection: Identifying duplicate rows or entries in the dataset.

Removal: Deleting duplicate rows to ensure each observation is unique.

import pandas as pd

# Sample dataset
data = {
'CustomerID': [101, 102, 101, 103, 104],
'ProductID': ['A1', 'B2', 'A1', 'C3', 'D4'],
'Quantity': [2, 1, 2, 5, 1],
'Price': [50, 30, 50, 20, 100],
'OrderDate': ['2023-06-01', '2023-06-02', '2023-06-01', '2023-06-03', '2023-06-04']
}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Check for duplicates
duplicates = df.duplicated()
print("Duplicates in DataFrame:")
print(duplicates)

#Here, as there is no duplicates, it will show as False.

# Remove duplicates
df_cleaned = df.drop_duplicates()
print("DataFrame after removing duplicates:")
print(df_cleaned)

# Remove duplicates based on specific columns
df_cleaned_specific = df.drop_duplicates(subset=['CustomerID', 'ProductID', 'OrderDate'])
print("DataFrame after removing duplicates based on specific columns:")
print(df_cleaned_specific)

# Remove duplicates but keep the last occurrence
df_cleaned_last = df.drop_duplicates(keep='last')
print("DataFrame after removing duplicates and keeping the last occurrence:")
print(df_cleaned_last)

4. Correcting Data Types

Conversion: Ensuring that each column has the appropriate data type (e.g., converting strings to dates, integers to floats).

data['date_column'] = pd.to_datetime(data['date_column'])
data['numeric_column'] = data['numeric_column'].astype(float)

5. Addressing Inconsistent Data

Standardization: Ensuring consistency in data representation (e.g., consistent date formats, uniform text cases).

data['text_column'] = data['text_column'].str.lower()

6. Handling Erroneous Data

Identifying and correcting errors in the data (e.g., incorrect values, typographical errors).

data.loc[data['age'] < 0, 'age'] = np.nan # Assuming negative age is an error

import pandas as pd
import numpy as np

# Sample dataset
data = {
'CustomerID': [101, 102, 103, 104, 105],
'Name': ['John Doe', 'Jane Doe', 'Alice Smith', 'Bob Johnson', 'Charlie Brown'],
'Age': [25, -3, 34, -5, 45]
}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Replace negative ages with NaN
df.loc[df['Age'] < 0, 'Age'] = np.nan

print("DataFrame after replacing negative ages with NaN:")
print(df)

Fixing identified errors by manual correction, algorithmic adjustments, or referring to external sources.

data['price'] = data['price'].replace({'$': ''}, regex=True).astype(float)

import pandas as pd

# Sample dataset
data = {
'ProductID': [1, 2, 3, 4, 5],
'ProductName': ['Product A', 'Product B', 'Product C', 'Product D', 'Product E'],
'Price': ['$10.50', '$23.99', '$5.00', '$15.75', '$8.30']
}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Replace dollar signs and convert to float
df['Price'] = df['Price'].replace({'\$': ''}, regex=True).astype(float)

print("DataFrame after converting Price to float:")
print(df)

Data cleaning encompasses a variety of tasks aimed at improving data quality by handling missing values, correcting errors, addressing inconsistencies, and preparing data for analysis and modeling. Proper data cleaning ensures that the resulting dataset is accurate, consistent, and suitable for the intended analysis or machine learning tasks.

Complete Data Science Roadmap.

Blogs Related to Data Cleaning:

  1. Complete Data Cleaning.
  2. Remove Outliers using InterQuartile Range
  3. Remove Outliers using Z-Score
  4. Using Log Transformation to mitigate the effect of outliers

Give it :👏👏👏👏:
If you found this guide helpful , why not show some love? Give it a Clap 👏, and if you have questions or topics you’d like to explore further, drop a comment 💬 below 👇. If you appreciate my hard work please follow me. That is the only way I can continue my passion.

--

--

Rina Mondal
Women in Technology

I have an 8 years of experience and I always enjoyed writing articles. If you appreciate my hard work, please follow me, then only I can continue my passion.