Paving the way for Data Warehouse Continuous Testing

Andrei Claudiu
METRO SYSTEMS Romania
7 min readMar 29, 2018

I. Introduction

The concept of data warehouse is rather simple. Data is extracted from multiple source systems, transformed and then stored for Business Intelligence purposes.

Using an Agile methodology in data warehouse development is useful for achieving a quicker development cycle, but will also put the existing challenges in the spotlight even more than using a traditional development methodology. Such challenges include having high data quality level, maintaining a good database performance as well as responding well to changes of business requirements.

The data warehouse will inevitably grow, with more and more tables, filled with more and more data. Along with the new business requirements, there is the possibility of having a growing technical debt, as well as the possibility of having data quality issues. Ideally these issues are found and fixed early, before the users notice, most preferably from the testing phase of the development or even from the analysis stage.

Scrum methodology applied in developing on data warehouse requires even more attention to the testing phase of the development. It is imperative to shorten the time between implementation and quality feedback, as the development cycles are far shorter than in a Waterfall development style. In order to achieve this we turn to test automation (not to be confused with automated testing).

Test automation relies on tests, which in data warehousing are a bit different than in other technology branches. The data warehouse tables hold huge amounts of data, which is also why some types of constraints, such as foreign keys, cannot be used.

Most of the tests will be SQL statements used to ensure the correct implementation of the business cases. They can also be used to check the data consistency, check the transformation logic, check for generated duplicates, and validate data model and many other checks (more here).

Test automation is very much based on standardization. Tests have to follow some basic rules, in order to make them re-runnable, independent and environment agnostic. For this reason the tests have to follow a certain structure described in more detail by my colleague Alexandru Dragan in this article.

Prerequisites and enablers

The architectural design of the data warehouse is very important when implementing a test automation tool. In our case, the data warehouse architects who developed it many years ago made some important choices which had a great impact on our approach.

The first enabler was the splitting of the data warehouse into multiple smaller ETL processes, based on business logic, which are called Importing Systems. The Importing System is an ETL process, having its own transformation tables and SQL transformation scripts. Having this Importing System based structure, enables the possibility of associating a test with one of the Importing Systems, meaning that we can easily select which tests should be executed at a moment in time.

Importing System diagram

On top of this, we benefit of exhaustive logging in regard with installations and Importing System executions. This gives us the possibility to easily determine when to run the tests, since it only make sense to test an Importing System only after its installation or execution, depending on the test.

II. Data warehouse test automation tool

The automatic testing application is a web application. This way it provides a great user interface and it makes it easy to manage tests and test results. However the most important part of this testing framework happens in the background and is unseen by the end user.

The test automation tool polls data related to installations and Importing System runs every few minutes. This is done on the multiple environments at our disposal. The retrieved data is processed and compared to the existing information, in order to determine the required action from the tool.

This process is too complex to demonstrate in code, but in short there is an ETL process with three main steps:

Metadata refresh process

Following an installation detection, the application gathers metadata regarding the installed Importing System. This metadata can directly detect flaws of the Importing System of logic. Some practical examples include the finding of unused tables, or finding tables used before being inserted into. Keep in mind that an Importing System can have hundreds of steps and hundreds of tables, with some very complex logic behind each step.

In order to really test a system, as much information is needed in order to determine how and what to test, making this metadata is invaluable for our purposes.

Another good use of the gathered metadata is to help generate “Generic tests”. These generated tests help perform validations on a large scale. For example, by determining all of an Importing System tables, a query checking for certain behavior patterns at the Importing System level can easily be generated.

A simple example for a generic test is the one checking for duplicates. Notice that this test does not check for specific business rules, but for full row duplicates. This kind of test can reveal issues either from the source systems, or from the internal Importing System logic. Such cases can be very harmful, impacting the data warehouse performance and decreasing the data quality. The root cause can be due to an incomplete business key (used in JOIN clauses).

The application generates this generic test in compliance with our standard test structure using the following simple steps:

· Create temporary table (Volatile table in Teradata DBMS)

· For each Importing System table insert the relevant data (table name, environment, count, distinct count and their difference)

Volatile data insertion

· Determine the test status

Test status

· Return all of the content from the temporary table

· Drop temporary table

Since the application gathers the metadata regarding Importing Systems, the to-be-tested for tables are already known and the test will adapt according to the changes in the Importing System. For this reason, there is no maintenance needed in regard with this kind of test.

Test executions

Besides determining the Importing Systems installations, we determine their executions as well. The application handles the test execution based on this information. As soon as Importing Systems are executed, the application determines which tests should be executed and places them in a queue.

The application handles test execution for multiple environments, handling multiple database connections, but taking care to avoid connection flooding (i.e. a maximum of 4 active connections at once). All of the described processes happen asynchronously, in an optimal but highly performant fashion.

Users get their test results only a few minutes after the execution of an Importing System. The test result consists of several relevant pieces of information such as:

Test result components

An article from Alexandru Dragan describing how the test result page is populated is available here.

Test result view

The amount of executed tests in a given week is somewhere between 10.000 and 15.000, depending on the user activity on the environments. This, and the fact that about 10% of the tests fail with good reason, demonstrate utility of this application.

Test scheduling

While the application handles the execution of tests, their scheduling is a different subject. In order to have a good test management control, there are a couple of mechanisms in place.

The two mechanisms are the use of Test Batches and Test Running Definitions.

Tests are stored in test batches. Scheduling is done by the use of running definitions, which simply link one or more test batches to one or more environments.

Test scheduling diagram

A step beyond

As mentioned before, in order to perform quality testing a good deal of information regarding the tested Importing System is gathered. This information is required, as without it no testing can be done. This information is metadata of the tested systems.

This metadata is valuable both in testing and in development, as it is helpful in performing analysis quicker and gaining a better understanding of the systems developed on. For this reason, all of the gathered metadata is made available to the users of the application. Such data includes:

· Available Environments — view for managing environments

· General information regarding the Importing System executions and installations — gives a full image of the activities over all of the available environments

· Table information — import tables, transformation tables, target tables, together with information regarding their uses in their Importing System scripts.

· Import information — data regarding what data was imported for each system and table.

It is also worth mentioning that the application is available for all members of all teams working in the data warehouse department, using various security groups to determine the types of allowed actions.

III. Conclusion

Since the beginning of this project, the objectives changed bit by bit, but so did the requirements from our users. The current result is a mature application which is easy to use for testing and analysis purposes and to develop on.

It is a difficult process to change the way of working from Waterfall methodology to Agile. Along with changes in processes and approaches, tools have to adapt as well. This makes it especially valuable to have a continuous testing tool for running regression and generic tests, as the feedback regarding the development quality is very quick. As the testing process has no pipeline in our case, it does not perform ContinuousTesting, but it does continuous testing all the same.

Having all this in mind, I hope that I offered a clear view of the test automation possibilities in data warehouse development.

--

--

Andrei Claudiu
METRO SYSTEMS Romania

Test Automation expert in Business Intelligence at METRO Systems Romania working with Teradata and developing .NET applications.