ETL Process.
What is ETL process?
ETL stands for Extract, Transform and Load. This process is used for Data Migration process.
What is Data Migration?
Data migration is the process that used to move data from one location to another location. In any data migration process Transform and Load processes are occurred.
Extraction :-
First step is to extract data from the identified sources. These data can be different types. Sources can be database systems and applications. These sources do not contain well organized data.
Extraction process has to extract more data than necessary since, it is difficult to identify the specific subset of relevant data. Therefor, identification of consistent data will be done at the later.
Some Transactions are completed during the Extraction process. It depends on the capabilities of the source systems’. Extracted data amount is varied from hundreds of kilobytes up to gigabytes. This amount depends on the source system and the business situation.
Time also varies from days/ hours and minutes. Web server log files can easily complete the extraction in very short time period.
By doing the extraction process data can be consolidate, refine and process. Data can be stored in centralized location that would be helpful for the Transform process.
Three Data Extraction methods :-
- Partial Extraction
- Partial Extraction with update notification
- Full extract
Examples for the data sources:
Existing databases and legacy systems
Cloud, hybrid and on-premises environments
sales and marketing applications
Mobile devices and apps
CRM systems
Transform :-
In the Extraction step data has been extracted and these data is often raw and not usable in its original form. As a solution for this, Transformation has to be completed. This step has number of steps to convert raw data into the desired format. Before transform, data must be cleansed by removing the inconsistency or missing values.
Once the data is cleansed, it should follow the following steps,
- Data discovery
- Data mapping
- Generating code
- Executing the code
- Review
There are some additional steps that may used in customized operations.
- Filtering
- Merging
- Splitting a column into multiple columns and vice versa
- Joining together data from multiple sources.
- Removing duplicate data.
Before execute the final step, data is loaded into staging database. This step is useful for a quick investigations, if something does not complete as planned. In this stage, data issues can be identified and repaired.
Load :-
Final step in the ETL process. During this process, converted data which is in the staging database is transferred into target database. Data can be loaded at once (Full loading) or scheduled intervals (Incremental loading).
ETL tools :-
Incumbent batch ETL tools:-
- Microsoft SSIS
- Oracle Data Integrator
- IBM InfoSphere DataStage
Cloud native ETL tools:-
- Alooma
- Matillion
- Snaplogic
Open source ETL tools:-
- Apache Airflow
- Apache Kafka
- Talend Open Studio
Real-time ETL tools:-
- Alooma
- Confluent
- Striim
If you are interesting to know more about this, you can follow below links and get more knowledge about this.