ETL Procedures for Data Warehouses

Naftal Teddy Kerecha
Plumbers Of Data Science
4 min readMar 28, 2022

Before jumping into the discussion in the title let’s first understand the concept of a data warehouse. A data warehouse is a data management system that is designed to support analytical and business intelligence functions such as reporting, data visualization, and ad-hoc-querying.

In the majority of the cases, the data warehouse acts as a repository for data from many different sources that have a defined relationship. All the relevant data is extracted from its source, then transformed — In this step, the relationships between the data are defined, datatype conversion is done here, aggregations as well as other operations and finally the data is loaded into the data warehouse. The above steps effectively define the ETL(Extract, Transform & Load) procedure used to load data into a data warehouse.

The ETL procedures would vary wildly due to several factors such as:

  • The type of data the organization uses — This references the complexity of data, the amount of data, and the level of transformation required to get the data into the desired state for the end-users to freely query.
  • The type of sources the data is extracted from — This references the ability of the organization to obtain the most updated records at will if needed, most modern data storage solutions will have integration capabilities that make accessing refreshed data easily possible at any point in time. In situations where the refreshed data needs to be provided by an individual in the form of flat files or authorization is required before refreshed data can be accessed will have slowed ETL processes and the data warehouse may potentially have stale data.
  • The frequency at which the data changes — Depending on the nature of the operations that the firm undertakes the data they store may change very frequently and as such the ETL procedures may need to be run multiple times a day to prevent the data warehouse from holding stale data. The converse of this may be organizations that have data that changes slowly thus the ETL processes need to be initiated less frequently.
  • Nature of the use of the data warehouse — This focuses on optimizing the data warehouse update based on its use case, for example, If the reporting team generates reports and visualizations on a bi-weekly basis then the ETL procedure to obtain refreshed data can also be done at the same frequency.

As mentioned above certain factors will affect how long the ETL process takes, a breakdown of them is in the image below

Employing a data warehouse is becoming more and more commonplace, particularly as more and more organizations migrate to cloud solutions. Some of the main benefits are:

  • Data Warehouse acts as a filter for data — The data loaded into the warehouse is there because it serves a functional purpose to at least one end-user and thus not all the extracted data is loaded into the warehouse.
  • Improved query run-times — Since not all the extracted data ends us in the warehouse, ad-hoc or structured queries that are run against it have the benefit of reduced run-times.
  • Increased data consistency — The consolidation of data into one repository essentially creates one source of truth that is referenced for any relevant business questions that can be answered from the data.
  • Easier implementation of data democratization — Data Democratization is a data access principle that is focused on allowing data-literate individuals access to whatever data they require without constantly needing approval from IT personnel. With one true source of data, the access needs to be provisioned just to the data warehouse and can be restricted to only the columns they need through column-level encryption.

Data Warehouses can be implemented within Data Lakes to act as the repository container for the data lake. The data lake in this case would allow for the storage of vast amounts of raw data that may not have a defined purpose or use case just yet. The data identified as already having analytical or reporting significance can be loaded into the data warehouse while the rest is retained in the raw and landing stages.

I hope this helped shed more light on the concept of data warehouses and the ETL procedures associated with them.

--

--

Naftal Teddy Kerecha
Plumbers Of Data Science

Data Engineer | Bsc Data Science, Wilfrid Laurier University | Writer for CodeX, Nerd For Tech and Geek Culture | Passionate about Data and the Cloud