ETL Process.

Hansagee Savindre Jayawickrama
3 min readAug 22, 2019

--

What is ETL process?

ETL stands for Extract, Transform and Load. This process is used for Data Migration process.

What is Data Migration?

Data migration is the process that used to move data from one location to another location. In any data migration process Transform and Load processes are occurred.

https://databricks.com/glossary/extract-transform-load

Extraction :-

First step is to extract data from the identified sources. These data can be different types. Sources can be database systems and applications. These sources do not contain well organized data.

Extraction process has to extract more data than necessary since, it is difficult to identify the specific subset of relevant data. Therefor, identification of consistent data will be done at the later.

Some Transactions are completed during the Extraction process. It depends on the capabilities of the source systems’. Extracted data amount is varied from hundreds of kilobytes up to gigabytes. This amount depends on the source system and the business situation.

Time also varies from days/ hours and minutes. Web server log files can easily complete the extraction in very short time period.

By doing the extraction process data can be consolidate, refine and process. Data can be stored in centralized location that would be helpful for the Transform process.

Three Data Extraction methods :-

  1. Partial Extraction
  2. Partial Extraction with update notification
  3. Full extract

Examples for the data sources:

Existing databases and legacy systems

Cloud, hybrid and on-premises environments

sales and marketing applications

Mobile devices and apps

CRM systems

Transform :-

In the Extraction step data has been extracted and these data is often raw and not usable in its original form. As a solution for this, Transformation has to be completed. This step has number of steps to convert raw data into the desired format. Before transform, data must be cleansed by removing the inconsistency or missing values.

Once the data is cleansed, it should follow the following steps,

  1. Data discovery
  2. Data mapping
  3. Generating code
  4. Executing the code
  5. Review

There are some additional steps that may used in customized operations.

  1. Filtering
  2. Merging
  3. Splitting a column into multiple columns and vice versa
  4. Joining together data from multiple sources.
  5. Removing duplicate data.

Before execute the final step, data is loaded into staging database. This step is useful for a quick investigations, if something does not complete as planned. In this stage, data issues can be identified and repaired.

Load :-

Final step in the ETL process. During this process, converted data which is in the staging database is transferred into target database. Data can be loaded at once (Full loading) or scheduled intervals (Incremental loading).

ETL tools :-

Incumbent batch ETL tools:-

  1. Microsoft SSIS
  2. Oracle Data Integrator
  3. IBM InfoSphere DataStage

Cloud native ETL tools:-

  1. Alooma
  2. Matillion
  3. Snaplogic

Open source ETL tools:-

  1. Apache Airflow
  2. Apache Kafka
  3. Talend Open Studio

Real-time ETL tools:-

  1. Alooma
  2. Confluent
  3. Striim

If you are interesting to know more about this, you can follow below links and get more knowledge about this.

--

--

Hansagee Savindre Jayawickrama
Hansagee Savindre Jayawickrama

Written by Hansagee Savindre Jayawickrama

Gratuated from Sri Lanka Institute of Information Technology (SLIIT) and working as Senior Software Engineer at IFS R&D Sri Lanaka.