ETL — Understanding It and Effectively Using It

Using ETL as an Enabler for Data Warehouses, Data Hubs, and Data Lakes

An Introduction to ETL

Why Do We Need ETL?

  • Offers deep historical context for business.
  • Enhances Business Intelligence solutions for decision making.
  • Enables context and data aggregations so that business can generate higher revenue and/or save money.
  • Enables a common data repository.
  • Allows verification of data transformation, aggregation and calculations rules.
  • Allows sample data comparison between source and target system.
  • Helps to improve productivity as it codifies and reuses without additional technical skills.
  1. Data Extraction
  2. Data Cleansing
  3. Transformation
  4. Load

Traversing the Four Stages of ETL — Pointers to Keep in Mind

  1. Know and understand your data source — where you need to extract data
  2. Audit your data source
  3. Study your approach for optimal data extraction
  4. Choose a suitable cleansing mechanism according to the extracted data
  5. Once the source data has been cleansed, perform the required transformations accordingly
  6. Know and understand your end destination for the data — where is it going to ultimately reside
  7. Load the data

Data Extraction and Data Cleaning

Data Source

  1. Databases
  2. Flat Files
  3. Web Services
  4. Other Sources such as RSS Feeds
  1. Querying directly in the database for a large amount of data may slow down the source system and prevent the database from recording transactions in real time.
  2. Data in the source system may not be optimized for reporting and analysis.

Usage and Latency

Auditing your Source Data

Analyzing Data Extraction

Steps to Perform Extraction

  1. Push Notification: It’s always nice if the source system is able to provide a notification that the records have been modified and provide the details of changes.
  2. Incremental/Full Extract: Some systems may not provide the push notification service, but may be able to provide the detail of updated records and provide an extract of such records. During further ETL processing, the system needs to identify changes and propagate it down.

Challenges Faced During the Extraction Process

  • Change in data formats over time.
  • Increase in data velocity and volume.
  • Rapid changes on data source credentials.
  • Null issues.
  • Change requests for new columns, dimensions, derivatives and features.
  • Writing source specific code which tends to create overhead to future maintenance of ETL flows.

Data Cleansing Requirements

  • Detection and removal of all major errors and inconsistencies in data either dealing with a single source or while integrating multiple sources.
  • Correcting of mismatches and ensuring that columns are in the same order while also checking that the data is in the same format (such as date and currency).
  • Enriching or improving data by merging in additional information (such as adding data to assets detail by combining data from Purchasing, Sales and Marketing databases) if required.
  • Data cleaning should not be performed in isolation but together with schema-related data transformations based on comprehensive metadata.
  • Mapping functions for data cleaning should be specified in a declarative way and be reusable for other data sources as well as for query processing.

Data Cleansing Problems

Potential Problems with Data from a Single Source

  • Uniqueness
  • Misspelling
  • Redundancy/Duplicates
  • Outside domain range
  • Data entry errors
  • Referential integrity
  • Contradictory values

Potential Problems with Data from Multiple Sources

  • Naming conflicts at the schema level — using the same name for different things or using a different name for the same things
  • Structural conflicts
  • Inconsistent aggregating
  • Inconsistent timing

Data Cleansing Approach

Data Analysis

Data Profiling

Data Mining

Defining Transformations and Mapping Rules

Verification

Transformation

Backflow and Cleaned Data

Data Transformation

  1. Multistage Data Transformation: In this process, extracted data is moved to an intermediate area (staging) where transformation occurs prior to loading the data into the final target area (data warehouse).
  2. In-Warehouse Data Transformation: In this process, the flow would be ELT (Extract, Load and then Transform). The extracted data will be loaded into the data warehouse and there the transformation will occur.

Basic Transformation

  • Format Standardization: Standardize the data type and length according to field format to make it easy for end user to retrieve data.
  • Cleaning: Includes mapping of values to some derived/short meaning like mapping ‘Male’ to ‘M’, null to ‘0’, etc.
  • Deduplication: Involves removing of duplicate values.
  • Constraints Implementation: Establishment of key relationships across tables.

Advanced Transformation

  • Decoding of Fields: Data coming from multiple sources many times will be described by varying field values and often times legacy source systems use fairly cryptic codes to represent business values making it necessary to remove fields having similar information and or changing obscure codes into values that make business sense to users that consume the data.
  • Merging of Information: It’s common to merge related fields together and view the merged fields as a single entity, e.g. product, product price, product type, description, etc.
  • Splitting single fields: Splitting a large text field into a single field for easier consumption, e.g. splitting full name into first_name, middle_name and last_name.
  • Calculated and Derived Values: At times, an aggregation can be required on the dataset before loading it to a Data Warehouse, e.g. calculating total cost and profit margin.
  • Summarization: Values are summarized to obtain a total figure which is subsequently calculated and stored at multiple levels as business fact in multidimensional tables.

Data Loading

Aggregation

An Effective Loading Process

Referential Integrity Constraints

Other Considerations for Effective Loading

  • Indexes should be removed before loading data into the target. They may be rebuilt after loading.
  • Manage partitions. The most recommended strategy is to partition tables by date interval such as a year, month, quarter, some identical status, department, etc.
  • In the case of incremental loading, the database needs to synchronize with the source system. The incremental load will be a more complex task in comparison with full load/historical load.

Challenges with Incremental Loads

  • Ordering: To handle large amounts of data with high availability, data pipelines often leverage a distributed systems approach which implies that data may be processed in a different order than when it was received. If data is deleted or updated, then processing in the wrong order will lead to data errors, therefore maintaining and ordering is crucial for keeping data accurate.
  • Schema Evaluation: It is necessary to evaluate the source schema at the time of loading the data to ensure data consistency.
  • Monitoring Capability: Data coming from a variety of sources presents complexities, and potentially failures due to an API being unavailable, network congestion or failure, API credential expiration, data incompleteness or inaccuracy — monitoring is critical as recovering from these issues can be complex.

Final Thoughts

--

--

NTT DATA acquired Hashmap in 2021 and will no longer be posting content here after Feb. 2023. You can view more content from innovative technologists and domain experts on data, cloud, IIoT/IoT, and AI/ML on NTT DATA’s blog: us.nttdata.com/en/blog

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store