From Chaos to Clarity: The Art of Data Cleaning and Transformation

Sergio David
2 min readAug 14, 2023

--

The process of collecting data is only the beginning. Once we have the raw data, we often find that it’s messy, unstructured, or filled with inaccuracies. Data cleaning and transformation are essential steps to turn this raw data into valuable insights. Let’s explore these processes using tools like Pandas and Regex, along with examples and code snippets.

Cleaning with Pandas and Regex

Data cleaning involves handling missing values, correcting data types, removing duplicates, and more. We can use libraries like Pandas and Regex to automate much of this work.

Example: Cleaning City Coordinates

Imagine we have a DataFrame containing city names and coordinates as a single string. We need to split these into separate latitude and longitude columns.

1. Importing Libraries:

import pandas as pd
import re

2. Loading Data:

data = {'city_name': ['Berlin', 'Paris', 'Madrid'], 'coordinates': ['52.52;13.405', '48.85667;2.35222', '40.41694;-3.70333']}
df = pd.DataFrame(data)

3. Splitting Coordinates:

We can use Pandas’ `str.split` method along with Regex to divide the coordinates.

df[['latitude', 'longitude']] = df['coordinates'].str.split(';', expand=True)

4. Removing Unwanted Columns:

We can drop the original ‘coordinates’ column.

df.drop(columns=['coordinates'], inplace=True)

5. Converting Data Types:

We convert the latitude and longitude to numeric values.

df['latitude'] = pd.to_numeric(df['latitude'])
df['longitude'] = pd.to_numeric(df['longitude'])

Data Transformation

Data transformation involves reshaping, aggregating, or otherwise modifying data to prepare it for analysis.

Example: Aggregating Weather Data

Suppose we have weather data for different cities and want to find the average temperature per city.

1. Grouping Data:

We can use Pandas’ `groupby` method to group the data by city.

weather_df.groupby('city_name').mean()['temperature']

2. Reshaping Data:

We might also want to pivot the data to have temperatures across different dates.

weather_df.pivot(index='city_name', columns='date', values='temperature')

Conclusion

Data cleaning and transformation are foundational to data analysis. With tools like Pandas and Regex, we can clean messy data and transform it into the structure we need for our specific analysis.

These examples showcase how these tools can simplify complex tasks, like splitting strings into multiple columns or grouping and reshaping data. By mastering these techniques, we’re well on our way to turning raw data into actionable insights. Whether you’re a data scientist, analyst, or anyone working with data, understanding these processes is crucial in the modern data-driven world.

--

--

Sergio David

Data Scientist exploring the frontier of machine learning. Join me on Medium for insights into the evolving world of data.