Data Engineering Challenges Series: Data format and schema inconsistencies

7 min readJun 21, 2024

Introduction

Data in its raw format usually contains impurities, with the most disruptive being inconsistencies within the data formats and schemas. These inconsistencies might lead to data integrity violations, complicating analytics and hindering business processes. This article looks into the issues due to data format and schema inconsistencies and how to manage them effectively.

Syntactic Inconsistencies

Syntactic inconsistencies arise from data values that are not in the form they are supposed to be. For instance, differences in date formats within datasets (e.g., DD/MM/YYYY vs. MM/DD/YYYY) or if currency symbols are not available or misplaced. Such inconsistencies may result in misinterpretation of data and further cause processing errors. These demands required advanced data cleaning methods, including those with the application of regular expressions, and tools for data standardization, like pandas or by janitor, to make sure that the data is consistently formatted for any analysis that may be performed.

Example: Multinational Retail Corporation Sales Data

Scenario: A multinational retail corporation operates in over 30 countries, each having its own local systems for recording sales data. Due to the lack of a unified system, each country submits their monthly sales data in various formats. Notably, the date formats vary significantly — some countries submit data in the format DD/MM/YYYY (e.g., 31/12/2023), others use MM/DD/YYYY (e.g., 12/31/2023), and a few even submit data in YYYY/MM/DD format (e.g., 2023/12/31). Additionally, some datasets include currency symbols attached to sales figures (e.g., €500, $500), while others do not.

Problem: The data aggregation team at the corporation’s headquarters needs to consolidate these reports for a global sales analysis. However, the varying date formats lead to incorrect sorting and aggregation of dates. Furthermore, the presence and absence of currency symbols introduce discrepancies in financial figures, complicating the financial analysis.

Resolution: To address these syntactic inconsistencies:

Regular Expressions and Pandas: The data engineers use Python’s pandas library along with regular expressions to standardize the date formats. They implement a function that detects and transforms any date format into a standard ISO format (YYYY-MM-DD). This ensures that all dates are uniformly formatted across the dataset, facilitating correct sorting and aggregation.

import pandas as pd
import re

# Sample data
data = {'date': ['31/12/2023', '12/31/2023', '2023/31/12']}
df = pd.DataFrame(data)

# Function to standardize date formats
def standardize_dates(date_str):
    if re.match(r'\d{2}/\d{2}/\d{4}', date_str):  # DD/MM/YYYY
        return pd.to_datetime(date_str, format='%d/%m/%Y').strftime('%Y-%m-%d')
    elif re.match(r'\d{4}/\d{2}/\d{2}', date_str):  # YYYY/DD/MM
        return pd.to_datetime(date_str, format='%Y/%d/%m').strftime('%Y-%m-%d')
    else:  # MM/DD/YYYY
        return pd.to_datetime(date_str, format='%m/%d/%Y').strftime('%Y-%m-%d')

# Applying the function
df['standardized_date'] = df['date'].apply(standardize_dates)
print(df)

Currency Standardization: They employ a script that checks for currency symbols within the sales figures, removes them, and then stores currency information in a separate column. This process not only normalizes the sales data but also preserves the original currency information for future financial analysis.

def clean_currency(sales_str):
    currency = re.findall(r'[\$\€]', sales_str)
    currency_type = currency[0] if currency else 'USD'  # Default to USD if no symbol
    amount = float(re.sub(r'[\$\€]', '', sales_str))
    return currency_type, amount

df['currency'], df['amount'] = zip(*df['sales'].apply(clean_currency))
print(df)

Automated Cleaning Scripts: To prevent these issues in future reports, they develop automated data cleaning scripts that standardize new incoming data as it is uploaded to their central system. These scripts are part of their ETL (Extract, Transform, Load) process, ensuring data is cleaned and standardized before it reaches the analytical team.

def automate_cleaning(data_frame):
    data_frame['standardized_date'] = data_frame['date'].apply(standardize_dates)
    data_frame['currency'], data_frame['amount'] = zip(*data_frame['sales'].apply(clean_currency))
    return data_frame

# Example of automating the process for new data
new_data = pd.DataFrame(...)  # Assume new data is loaded here
cleaned_data = automate_cleaning(new_data)

Outcome: With these measures in place, the corporation successfully standardizes its global sales data, enabling accurate and efficient analysis. The data cleaning processes not only rectify current inconsistencies but also streamline future data handling, saving time and reducing errors in subsequent reports.

Semantic Inconsistency:

This is a condition where values in data are at variance with the expected values of a particular domain. An example of this is an unintended value under ‘gender’ that falls outside the standard ‘male’ or ‘female’ class or under ‘age,’ where negative numbers appear. Such inconsistencies can often lead to data analysis losing credibility and non-reliable results. Addressing semantic inconsistencies requires us to apply domain knowledge and the rules of data validation to ensure they conform to certain norms.

Handling Unusual Gender Values

Scenario: An international survey collects personal data from respondents worldwide, including a ‘gender’ field. However, instead of the standard responses ‘male’ or ‘female’, some entries include ‘N/A’, ‘unknown’, or various non-binary terms due to differing cultural norms or data entry errors.

Solution: To handle these semantic inconsistencies, the data team decides to standardize the ‘gender’ field to include ‘non-binary’ as an additional standardized category alongside ‘male’ and ‘female’. This approach respects the diversity of responses while maintaining the data’s analytical utility. They implement a Python script using pandas to map and standardize these values:

import pandas as pd

# Sample data
data = {'name': ['Alice', 'Bob', 'Charlie'], 'gender': ['female', 'unknown', 'N/A']}
df = pd.DataFrame(data)

# Standardizing gender entries
gender_map = {'male': 'male', 'female': 'female', 'unknown': 'non-binary', 'N/A': 'non-binary', 'none': 'non-binary'}
df['gender'] = df['gender'].map(gender_map).fillna('non-binary')
print(df)

This script not only standardizes the gender values but also assigns ‘non-binary’ to any unrecognized or missing entries, ensuring the data remains inclusive and analytically valid.

Correcting Invalid Age Entries

Scenario: A healthcare database records patient information, including age. Due to data entry errors, some age fields contain negative numbers or unrealistically high values (e.g., ages above 130 years).

Solution: The data management team uses validation rules to identify and correct these entries. They decide to set a reasonable age range (0 to 130 years) and treat any values outside this range as data entry errors. These entries are then flagged for review or replaced with the median age of the dataset. Here’s how they might implement this in Python:

import numpy as np

# Sample data with age inconsistencies
data = {'name': ['Alice', 'Bob', 'Charlie'], 'age': [29, -1, 150]}
df = pd.DataFrame(data)

# Correcting invalid age values
valid_age_range = (0, 130)
median_age = df[(df['age'] >= valid_age_range[0]) & (df['age'] <= valid_age_range[1])]['age'].median()
df['age'] = df['age'].apply(lambda x: median_age if x < valid_age_range[0] or x > valid_age_range[1] else x)
print(df)

This approach ensures that all ages in the dataset are within a plausible range, enhancing the reliability of any subsequent data analysis.

General Approach to Semantic Inconsistencies

For both examples, the key steps involve:

Identifying inconsistencies: Using data profiling to discover values that do not conform to expected norms.
Applying domain knowledge: Understanding what values are acceptable in the context of the specific field.
Implementing validation rules: Coding these rules into data cleaning scripts that standardize or correct values based on the predefined norms.

These examples highlight the importance of semantic validation in maintaining the integrity and utility of data, particularly when dealing with diverse and large-scale datasets.

Structural Inconsistencies

This type of inconsistency deals with differences in the structuring or representation of data between the sources themselves. The primary example would be that tables have different numbers of columns or differences in the naming convention. Effective tactics include using tools for data transformation and mapping — for example, SQL and ETL processes — and applying techniques for data normalization to achieve consistency across datasets.

Structural inconsistencies can pose significant challenges in data integration and analysis, especially when dealing with data from multiple sources. Here are two real-life examples demonstrating how to address these issues effectively:

Handling Different Column Names in Customer Data

Scenario: A company merges customer data from two different systems as part of a consolidation effort. System A uses the column names “FirstName” and “LastName,” while System B uses “GivenName” and “Surname” for similar purposes. Additionally, System B includes an extra column “MiddleInitial” that System A does not have.

Solution: To integrate these datasets into a unified customer database, the data engineering team uses SQL for data transformation and mapping. They create a unified schema in their database where data from both systems can be aligned. Here’s an example SQL query that they might use to standardize and merge the data:

SELECT
    FirstName AS GivenName,
    LastName AS Surname,
    NULL AS MiddleInitial -- Assign NULL for data from System A
FROM SystemA_Users
UNION ALL
SELECT
    GivenName,
    Surname,
    MiddleInitial
FROM SystemB_Users;

This SQL script standardizes the column names and handles the missing “MiddleInitial” by assigning NULL for entries from System A, ensuring the merged dataset is consistent and complete.

2. Normalizing Product Categories Across Different Retail Systems

Scenario: A retail conglomerate acquires a new company, and needs to integrate its product database with existing databases. The new company categorizes products slightly differently, and uses a separate encoding scheme.

Solution: The data team decides to normalize the product categories using an ETL (Extract, Transform, Load) process. They map the product categories from both companies to a new, unified set of categories. Here’s a conceptual approach using a data transformation tool:

Extract: Data from both companies’ product databases is extracted.
Transform:

Categorization rules are applied. For example, “Outerwear” in one database corresponds to “Jackets” and “Coats” in another.
A lookup table is created to map disparate category names to a unified naming convention.

Load: The transformed data is loaded into a new, unified product database.

Here’s a simplified Python snippet that might be part of the transformation step:

# Sample mapping dictionary for product categories
category_map = {'Jackets': 'Outerwear', 'Coats': 'Outerwear', 'Outerwear': 'Outerwear'}

# Function to apply mapping
def map_category(category):
    return category_map.get(category, 'Other')  # Default to 'Other' if no match

# Applying the mapping to the dataset
df['UnifiedCategory'] = df['Category'].apply(map_category)

This approach ensures that product information from both companies is represented consistently, facilitating accurate reporting and analytics across the newly merged inventory.

General Strategy for Handling Structural Inconsistencies

In both examples, the steps involve:

Identifying structural differences: Analyzing the schema and structure of datasets from different sources.
Designing a unified schema: Deciding on a common format that all data should conform to.
Implementing transformation rules: Using SQL, ETL tools, or programming scripts to map and transform data to fit the unified schema.

These strategies help in effectively managing structural inconsistencies, which is crucial for the integrity and usability of integrated data systems.

Conclusion

The challenges of format and schema mismatch are significant but manageable with proper strategies and tools. Comprehensive data management practices keep data reliable and robust for the organizations; therefore, they can support correct and insightful decision-making.

Connect with Me on LinkedIn

I’d love to keep the conversation going. If you found this article insightful or have thoughts, experiences, and ideas to share, let’s connect on LinkedIn!

I’m always eager to engage with fellow professionals and enthusiasts in the field.