Data Cleaning Across Multiple Formats Using Python

A Comprehensive Guide to Handling Multiple Formats

Richard Warepam

Published in

ILLUMINATION

5 min readJul 24, 2024

What types of data formats have you encountered so far?

From text files to images and videos, data comes in various formats.

Do you think data science professionals get to choose the format they work with?

The simple answer is: NO!

Dealing with messy and diverse data is crucial in your professional work. But don’t worry too much…

Your favorite data science mentor is here to help! In this article, I’ll share some practical insights on handling various data formats and sources in your data cleaning projects.

#Ad: Explore through my eBooks

Hello readers, This is a way to appreciate my writings and support me in producing more valuable contents.

I have written multiple small eBooks on Data Science and AI. Already, 2K+ readers have bought these products and they found it helpful! Check it out

Click on the Links:

Learn Data Science with me | Prepare SQL with me | Prepare Statistics with me

Kindly checkout my eBooks to support my writings

Why Data Cleaning Matters

Before we dive in, let’s address the elephant in the room:

Why should you care about data cleaning?

Simply put, “clean data” is the foundation of accurate analysis and reliable machine learning models.

Without it, you’re building on shaky ground. — meaning? prone to breaking, I guess!

In short, if you ask me, “What does data cleaning bring to the table?”

I’d tell you it’s all about boosting performance, enhancing reliability, and improving accuracy in your analysis or ML models.

Common Data Formats You’ll Encounter

Throughout my data science journey, I’ve encountered various data formats.

But here’s the thing: you don’t need to master every single one.

Let me share the most common formats you’ll likely deal with in your projects.

Trust me, these are the ones that really matter!

CSV (Comma-Separated Values)
JSON (JavaScript Object Notation)
XML (eXtensible Markup Language)
SQL databases
Parquet files

Now, I won’t sugarcoat it — each of these formats has its own quirks.

But don’t worry! With the right tools and a bit of practice, you’ll handle them like a pro.

Which Python Libraries is Used?

Now that we’ve covered the most common data formats, let’s talk about the tools you need to handle them.

The simple answer, as always in data science, is Python.

But why Python, and how exactly?

Well, it’s all about the libraries. Python is famous for its vast collection of powerful libraries, and I’ll highlight some that are perfect for working with different data formats.

Here are my go-to libraries for tackling various data types:

Pandas: This versatile library is fantastic for CSV, JSON, and even SQL databases.
xml.etree.ElementTree: When you’re dealing with XML files, this is your best friend.
SQLAlchemy: Think of this as your personal bridge to SQL databases.
PyArrow: If you’re working with Parquet files, this library is excellent.

These libraries make handling different data formats a breeze.

Trust me, I’ve used them countless times in my projects, and they’ve never let me down.

In the next section, I’ll show you some practical examples of how to use these libraries effectively. Stay tuned!

The Data Cleaning Process: A Step-by-Step Guide

Understand Your Data: Before you start cleaning, get to know your data. What format is it in? What does each column represent? Are there any obvious issues?
Load the Data: Use the appropriate Python library to load your data.

For example:

import pandas as pd

# For CSV
data = pd.read_csv('your_file.csv')

# For JSON
data = pd.read_json('your_file.json')

# For XML
import xml.etree.ElementTree as ET

# Reading an XML file
tree = ET.parse('data.xml')
root = tree.getroot()

# Extracting data from XML
data = []
for child in root:
    data.append(child.attrib)

# For SQL Database
from sqlalchemy import create_engine

# Creating a database connection
engine = create_engine('sqlite:///data.db')

# Querying data from a SQL database
data = pd.read_sql('SELECT * FROM table_name', engine)

# For Parquet 
import pyarrow.parquet as pq

# Reading a Parquet file
table = pd.read_table('data.parquet')

# Convert to pandas DataFrame if needed
data = table.to_pandas()

Handle Missing Values: Missing data can throw a wrench in your analysis. You might fill these with averages, or remove rows with missing data entirely:

# Fill missing values with mean
data = data.fillna(data.mean())

# Or remove rows with missing values
data = data.dropna()

Remove Duplicates: Duplicate data can skew your results. It’s usually safe to remove them:

data = data.drop_duplicates()

Correct Inconsistencies: Look for and fix any inconsistencies in your data.

For example, you might want to ensure all text is in lowercase:

data['text_column'] = data['text_column'].str.lower()

Transform Your Data: Sometimes, you need to change the shape of your data. This might involve encoding categories or scaling numerical values:

# One-hot encoding for categories
data = pd.get_dummies(data, columns=['category_column'])

# Scaling numerical data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data['numeric_column'] = scaler.fit_transform(data[['numeric_column']])

To learn in details about these data pre-processing steps, Read these:

Bringing It All Together

Now, you’ve got a handle on various data formats, haven’t you?

In real-world projects, data flows in from multiple sources. That’s why you’ll frequently need to merge data from all these different sources.

Here’s a simplified example:

# Reading customer data from CSV
customer_data = pd.read_csv('customers.csv')

# Reading transaction data from SQL
transaction_data = pd.read_sql('SELECT * FROM transactions', engine)

# Reading product data from JSON
product_data = pd.read_json('products.json')

# Reading operational data from Parquet (Another way)
parquet_data = pd.read_parquet('operations.parquet')

# Merging data
merged_data = pd.merge(customer_data, transaction_data, on='customer_id')
merged_data = pd.merge(merged_data, product_data, on='product_id')
merged_data = pd.merge(merged_data, parquet_data, on='operation_id')

# Cleaning and preprocessing the merged data
merged_data = merged_data.dropna()
merged_data['product_category'] = merged_data['product_category'].str.lower()
encoded_data = pd.get_dummies(merged_data, columns=['product_category'])

print(encoded_data.head())

Remember, data cleaning is more art than science. It requires patience, attention to detail, and a good understanding of your specific dataset.

Connect: LinkedIn | Gumroad Shop | Medium | GitHub | Buy me a pizza

Subscribe: Substack Newsletter