What is ETL?

Published in

Sciforce

5 min readSep 27, 2019

In our articles related to AI and Big Data in healthcare, we always talk about ETL as the core of the core process. We do not write a lot about ETL itself, though. In this post, we’ll give a short overview of this procedure and its applications in businesses.

ETL is the abbreviation for Extract, Transform, Load that are three database functions:

Extract is the process of reading data that is assumed to be important. The data can be either raw collected from multiple and different types of sources or taken from a source database.
Transform is the process of converting the extracted data from its previous format into the format required by another database. The transformation occurs by using rules or lookup tables or by combining the data with other data.
Load is the process of writing the data into the target database, data warehouse or another system

ETL in its essence is a type of data integration used to blend data from multiple sources.

ETL vs. ELT

The ETL paradigm is inherent to Data Warehousing, and Big Data has significantly changed the order of the processes. In Big Data, data is “lifted and shifted” wholesale to a repository, such as a Data Lake, and is held there in the original format. It is transformed “on the fly” when needed by Data Scientists, creating the procedure of ELT, or Extract, Load, Transform.

One of the main benefits of ELT is a shorter load time. As we can take advantage of the built-in processing capability of data warehouses, we can reduce the time that data spends in transit. This capability is most useful when processing large data sets required for business intelligence and big data analytics.

In practice, however, things are not so black and white. Many Data Lakes, for example, contain intermediate merged and transformed data structures to ensure that each Data Scientist doesn’t repeat the same work, or carry it out in a different way.

Where are ETL/ELT used?

ETL is not a new technology: businesses have relied on it for many years to get a consolidated view of the data. The most common uses of ETL include:

ETL and traditional uses

Traditionally, ETL is used to consolidate, collect and join data from external suppliers or to migrate data from legacy systems to new systems with different data formats. ETL tools surface data from a data store in a comprehensible for business people format, making it easier to analyze and report on. The key beneficiaries of these applications are retailers and healthcare providers.

ETL and metadata

ETL provides a deep historical context and a consolidated view for the business by surfacing the metadata. As data architectures become more complex, it’s important to track how the different data elements are used and related within one organization. Metadata helps understand the lineage of data and its impact on other data assets in the organization.

ETL and Data Quality

ETL and ELT are extensively used for data cleansing, profiling and auditing ensuring that data is trustworthy. ETL tools can be integrated with data quality tools, such as those used for data profiling, deduplication or validation.

ETL and Self-Service Data Access

Self-service data preparation is a fast-growing field that puts the power of accessing, blending and transforming data into the hands of business users and other nontechnical data professionals. ETL codifies and reuses processes that move data without requiring technical skills to write code or scripts. With this approach integrated into the ETL process, less time is spent on data preparation, improving professionals’ productivity.

ETL/ELT for Big Data

As today the demand for big data grows, ETL vendors add new transformations to support the emerging requirements to handle large amounts of data and new data sources. Adapters give access to a huge variety of data sources, including data from videos, social media, the Internet of Things, server logs, spatial data, streams, etc., Data integration tools interact with these adapters to extract and load data efficiently.

ETL for Hadoop

Taking a step further, advanced ETL tools load and convert structured and unstructured data into Hadoop. These tools can read and write multiple files in parallel, simplifying the way the data is merged into a common transformation process. ETL also supports integration across transactional systems, operational data stores, BI platforms, master data management (MDM) hubs and the cloud.

How is it built?

When building an ETL infrastructure, you need to integrate data sources, and carefully plan and test to ensure you transform source data correctly. There exist three basic ways to build an ETL process, loosely reflecting three stages in the evolution of ETL:

Traditional ETL batch processing

This approach requires meticulously preparing and validating data and transforming it in a rigid process. Data is transferred and processed in batches from source databases to the data warehouse. Traditional ETL works. but it is complex to build; it is rigid and unsuitable for many modern requirements. It is challenging to build an enterprise ETL pipeline from scratch, so such ETL tools as Stitch or Blendo are used to automate the process.

ETL with stream processing

Modern data processes often include real time data. In this case, you cannot extract and transform data in large batches, so you have to perform ETL on data streams. This approach implies that you use a modern stream processing framework like Kafka or Spark, so that you can pull data in real time from a source, manipulate it on the fly, and load it to a target system such as Amazon Redshift.

Serverless ETL

The recent rise of serverless architecture has opened new possibilities for simplification of the ETL process. Without a dedicated infrastructure to provision or manage. New ETL systems can handle provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. The most prominent of such services is the new AWS Glue that crawls data sources, identifies data formats, and suggests schemas and transformations, automating most of the work for users.

Automated data pipeline without ETL

With the new cloud-based data warehouse technology, it is possible to do ETL without actually having an ETL system. Cloud-based automated data warehouses can have built-in end-to-end data management. They rely on a self-optimizing architecture with machine learning and natural language processing (NLP) in their core to automatically extract and transform data to match analytics requirements.

Organizations need ETL and ELT to bring data together, maintain accuracy and provide the auditing required for data warehousing, reporting and analytics. As a rule, ETL and ELT tools work in conjunction with other data integration tools and are closely related to other aspects of data management — such as data quality, data governance, virtualization and metadata. In the present-day settings, it is important, however, to expand one’s view and to adopt stream processing, or even fully automated data warehousing, thus building a more effective data pipeline.