ETL Process

Supun Bandara
4 min readAug 30, 2019

--

It’s all about the growth of the businesses. All businesses generate a huge amount of data from various sources and by analyzing this data, a business can gain better insights leading to the growth of the business.

To achieve growth every business needs to have insights such as how the organization is doing as an individual as well as how it is performing in the market. To understand all those it needs to gather data from different sources it could be data from social trends, it could be data from inside the organization, it could be market research, customer feedback, etc.

Once you have this data by performing register kind of analysis you can get valuable business insights which in return give you a guiding path to lead your business into a success.

Where we need ETL?

The problem here is we gather the data from varies sources and those different sources cannot be directly visualized. To solve this problem the most reliable solution is business intelligence. Business intelligence is a set of techniques and tools where we gather the data from various data sources and store to a warehouse. This data can be used for different analytical purposes and making it easier to end business users to understand this data.

To achieve single data source for insights, using all the data from different data sources we need to combine data first and filter them according to business insight. Those filtered data need to be stored in the warehouse. That’s where the process of ETL comes into the pitch.

What is ETL?

Extract, Transform and Load (ETL) is the process of extracting the data from various sources, transforming this data to meet business requirement into the target data warehouse.

Extract

In this step, data is extracted from the source system into the staging area. Transformations if any are done in the staging area so that the performance of the source system is not degraded. Also, if corrupted data is copied directly from the source into the Data warehouse database, rollback will be a challenge. Staging area allows validating extracted data before it moves into the Data warehouse. As best practice normally we do have a layer in-between source and the destination to perform the transformation on extracted data.

Three major data extraction strategies are using in industry

  • Full Extraction

Extract all the data to the staging area without a condition applying.

  • Partial Extraction — With Update Notification

Extract data which are with update notification send by the source, if some already extract data has a modification, we only extract the modified data to the staging area.

  • Partial Extraction — Without update Notification

Extract data which are having conditions in loading such as in some daily schedulers extract only that daily data. Rest of newly added data will be extracted in the respective day.

Transform

The Extracted data into a staging area/tables are raw data and it cannot be used as it is to insights. According to the business scenarios it needs to be Cleansed, Mapped and Transformed. As mentioned earlier according to the business scenarios we may need to have various of data transformation. Apart from that, there are some basic things we always cover in data transformation as well.

Basic / Business Transformation techniques

  • Selection & Matching — as we ingest data from various sources we may need to select the data and ingest. While selecting data we must match each data columns with other sources to find relevant data in it.
  • Cleansing/Enrichment — In this technique we mainly focus on null value removing, formatting issues, schema level validations, values enrichment, data consolidation or summarization.
  • Standardization Data — This is applying according to the business scenarios. Sometimes it may be standardizing a date and time format, Job titles Formats, etc.
  • Splitting and Merging
  • Calculate and Derived data — This transformation technique has a huge impact on insights. We may need to get aggregate values, Conversions, unit of measures to highlight in insight, According to plans of insight we use this technique to transform data.
  • File type Change — In most of the big data solutions, data file finally transformed into “Parquet” format, since it an open-source file format for Hadoop. Parquet stores nested data structures in a flat columnar format. Compared to a traditional approach where data is stored in a row-oriented approach, parquet is more efficient in terms of storage and performances.

Loading

In this phase, we fetch transformed data from the staging area and apply it to the data warehouse to store. In the loading phase also we are considering it as three strategical ways.

  • Initial Load — applying all the data to the warehouse table for the first time.
  • Incremental Load — apply ongoing changes periodically.
  • Full Refresh Load — truncate one or more data warehouse table and put applying fresh data to them.

Once it comes to the production level design as a best practice we normal run ETL process in-between Source to Staging and Staging to Data warehouse.

  • Source to Staging — Do the basic transformation
  • Staging to warehouse — Do the business level transformation

What are the ETL tools around us?

There is a lot more ETL tool around the world. Among them, I’ll list down bit popular ETL tool in the current industry.

1. Incumbent batch ETL tools

  • IBM InfoSphere DataStage
  • Informatica PowerCenter
  • Microsoft SSIS
  • Oracle Data Integrator

2. Cloud-native ETL tools

  • Alooma
  • Fivetran
  • Matillion
  • Snaplogic
  • Stitch Data

3. Open source ETL tools

  • Apache Airflow
  • Apache Kafka
  • Apache NiFi
  • Talend Open Studio

4. Real-time ETL tools

  • Alooma
  • Confluent
  • StreamSets
  • Striim

Referances

--

--