An example of a Simple Metadata Driven ETL Process implementation. Part 1

Dejan Todorovic
Jeff Tech
Published in
7 min readJun 25, 2020

Have you ever found yourself in a situation when you had to implement a relatively simple ETL process in order to transfer data from one place into another while simultaneously performing certain data transformation with the result needed asap whilst having the possibility of introducing additional data checks, reporting and proper documentation later but when buying a new fancy ETL tool is not really an option? This set of articles will give you the idea how we tackled this at Jeff.

Let us assume that all of the required data sources are already accessible from within the database you are planning to use for your destination tables (i.e. data warehouse). If that is not the case, there are a number of ways to achieve that — e.g., by importing data from the external sources using an external helper app written in Python or similar, registering text files on database server as database tables, creating database links directly to your data source(s).

So, once everything you need is accessible from within the database, how do you do it?

Basically you would need to run a bunch of SQL insert, update and delete commands in order to transfer, transform and check the data. Now, you can run these SQLs as one big anonymous SQL block, however, things can easily get out of control in terms of error handling and maintenance. Once you are done with the data transfer queries, a number of additional SQLs for data checks should also be created, such as checking for value ranges, formats, list of values, relations. These are basically very similar SQLs for each type of data check, which means that should you write them manually, you will perform a lot of repetitions — the only thing changing in those SQLs are table and column names.

Instead of implementing a data transformation process for individual data sources, a better option is to create a small data model for storing metadata information and then creating and running SQL commands automatically. That way you will be able to easily change and add new functionalities into the existing solution — e.g., changing the order in which data sources are processed or adding the additional logging information.

Metadata

The most important function of an ETL process is transferring the data from one location (one or more tables) into another whilst performing certain data transformations. For more complex data transformations it is a good practice to break it down into simpler steps. In order to populate the destination data stores, we should be able to put together an INSERT INTO… SELECT command for each destination table that will select data from the source tables and insert it into the destination table. This information can be stored in a metadata table with the following attributes:

  • SQL query, used for storing the SQL for reading and transforming the data from the data source tables; we will separate the query into three parts: select, from and where so that we can easily determine the related tables (more about that later),
  • The list of attributes of the destination table, which will be used to create a proper INSERT command with all of the affected attributes specified,
  • The name of the destination table.

Even if the volume of source data is not huge, implementing incremental data loads is something that should be considered, however, not all data load scenarios can be easily covered. Incremental loads require a way to identify the changes on the source side (newly inserted, updated or deleted data) and appropriately refresh the destination data. The easiest way to accomplish that would be by identifying all new, updated and deleted records from the data change log information from the source. Unfortunately data change log information is not always available. Usually there should be a numeric primary key populated from the sequence, an attribute holding the record creation timestamp and eventually the timestamp of the record last update. That means that we should be able to select all records created and/or changed after the specific point in time (update window). The issue with this approach is that the records deleted on the source will not be picked up. Therefore, we need to determine which records are deleted and remove them from the destination. We can either compare old and new records using unique keys and remove records deleted on the source or, especially when we cannot determine how to uniquely identify records (no unique key defined), we can delete all destination records from the specific time window and simply copy all records from the same time window from the source.

For our case we will assume that the source data sets have the unique key defined by a single attribute, which we need to store in the destination tables as well in order to relate the source and the destination data. Furthermore, we want to update only the rows that were actually updated, thus we need a way to determine which destination records should be updated. We can achieve that by comparing all attribute values between source and destination records or by creating a hash value for the set of source attributes and comparing only the hash value with the destination hash value already stored in the destination table.

In order to automate the incremental update process described above, we need to add the following metadata into etl_data_transfer table:

  • Destination table unique key attribute,
  • Destination table set of attributes used for record comparison and creation of record hash values; since this is a set of values, we will store them in CSV format, and
  • Destination table update strategy where we can set whether the destination table is to be fully or incrementally updated, possibly adding few different scenarios of the incremental upload.

The ETL process should also have the possibility to run arbitrary SQL scripts in order to truncate destination tables, refresh materialized views, refresh database statistics or similar. For this purpose, a metadata attribute to store the SQL script is required.

Checking data validity is an important element of every ETL process regardless of the volume of data. Some basic data checks we should consider are listed below:

  • Data consistency check (check if a relation from one table to another is valid)
  • Check whether the attribute value is empty,
  • Check the attribute value (data range, attribute value is from the specific list of values, set of attributes have expected values in relation to one another etc.).

Destination tables should be created in a way that enables even erroneous data to be stored (i.e. without constraints defined on the attributes that are updated). In case of data errors, it is important to relate the errors with the data source so that we can unambiguously pinpoint the source of the error. For that purpose an unique key on the destination data is required.

For the data check element we need the following metadata attributes:

  • Data check type (REFERENCE, MANDATORY, VALUE),
  • Destination table name,
  • Destination table identity attribute, used for referencing potential errors to the erroneous record,
  • Destination table attribute name, storing the name of attribute to be checked,
  • SQL expression, used for checking data values against other attribute values.

The following additional attributes are required in order to check the references between the tables:

  • Referenced table name,
  • Referenced table attribute name.

Once the ETL steps are defined, we have to put them all together and define the sequence of execution. For that we need a table that relates to all ETL step types and defines the order of execution. We need the following attributes:

  • Step type (data transfer, data check, SQL script),
  • Relation to the step metadata (per step type),
  • Execution order

The only thing left to consider is adding identity and system attributes, such as creation/modification dates and descriptions to all metadata tables, and we are ready to put together the metadata ER model.

We have now defined what our ETL needs to do. In the following article we will explain how we put the wheels in motion with the ETL execution control part, the power of metadata driven systems and we will offer some implementation hints, so stay tuned.

--

--