ETL Process.

3 min readAug 22, 2019

What is ETL process?

ETL stands for Extract, Transform and Load. This process is used for Data Migration process.

What is Data Migration?

Data migration is the process that used to move data from one location to another location. In any data migration process Transform and Load processes are occurred.

https://databricks.com/glossary/extract-transform-load

Extraction :-

First step is to extract data from the identified sources. These data can be different types. Sources can be database systems and applications. These sources do not contain well organized data.

Extraction process has to extract more data than necessary since, it is difficult to identify the specific subset of relevant data. Therefor, identification of consistent data will be done at the later.

Some Transactions are completed during the Extraction process. It depends on the capabilities of the source systems’. Extracted data amount is varied from hundreds of kilobytes up to gigabytes. This amount depends on the source system and the business situation.

Time also varies from days/ hours and minutes. Web server log files can easily complete the extraction in very short time period.

By doing the extraction process data can be consolidate, refine and process. Data can be stored in centralized location that would be helpful for the Transform process.

Three Data Extraction methods :-

Partial Extraction
Partial Extraction with update notification
Full extract

Examples for the data sources:

Existing databases and legacy systems

Cloud, hybrid and on-premises environments

sales and marketing applications

Mobile devices and apps

CRM systems

Transform :-

In the Extraction step data has been extracted and these data is often raw and not usable in its original form. As a solution for this, Transformation has to be completed. This step has number of steps to convert raw data into the desired format. Before transform, data must be cleansed by removing the inconsistency or missing values.

Once the data is cleansed, it should follow the following steps,

Data discovery
Data mapping
Generating code
Executing the code
Review

There are some additional steps that may used in customized operations.

Filtering
Merging
Splitting a column into multiple columns and vice versa
Joining together data from multiple sources.
Removing duplicate data.

Before execute the final step, data is loaded into staging database. This step is useful for a quick investigations, if something does not complete as planned. In this stage, data issues can be identified and repaired.

Load :-

Final step in the ETL process. During this process, converted data which is in the staging database is transferred into target database. Data can be loaded at once (Full loading) or scheduled intervals (Incremental loading).

ETL tools :-

Incumbent batch ETL tools:-

Microsoft SSIS
Oracle Data Integrator
IBM InfoSphere DataStage

Cloud native ETL tools:-

Alooma
Matillion
Snaplogic

Open source ETL tools:-

Apache Airflow
Apache Kafka
Talend Open Studio

Real-time ETL tools:-

Alooma
Confluent
Striim

If you are interesting to know more about this, you can follow below links and get more knowledge about this.

What is ETL? (Extract, Transform, Load) | Talend Cloud

ETL stands for "extract, transform, and load." The process of ETL plays a key role in data integration strategies. ETL…

www.talend.com

How ETL Works - Databricks

" Back to Glossary Index ETL stands for Extract-Transform-Load and it refers to the process used to collect data from…

databricks.com

Overview of Extraction, Transformation, and Loading

This chapter discusses the process of extracting, transporting, transforming, and loading data in a data warehousing…

docs.oracle.com