Test repository definition for data warehousing

Published in

METRO SYSTEMS Romania

5 min readOct 11, 2017

I. Introduction

As soon as software programming emerged, software testing has been discussed, standardized and proven to bring quality to the final deliverable.

One of the most time consuming testing activities is regression testing. If done manually, it requires tremendous effort that simply cannot fit in an agile sprint, so the overall quality will surely suffer if not done in an automated way.

Automatic testing solution deals with both automatically generated tests (generic tests) and user defined ones. This article focuses on the user defined ones.

II. Context

For the last year I have been working with my colleague Claudiu on creating an automatic testing platform for the company’s data warehouse. It proved (and still proves) challenging in many areas. This article deals with the fact of making test creation as easy as possible for both application users (developers, QA, operations) and those supporting it.

III. Repository

From the start, we have concentrated on achieving a simple user interface that also prevents common errors right from the start, so that tests have a minimum chance of failure when running.

I will briefly explain the structure:

Statistics provide a quick overview of test runs which is particularly important to check how the test performed or timed out more times than expected.
Importing system is the core ETL entity within the data warehouse. Very roughly, it is made up of script files (mainly Teradata SQL) dealing with data transfer and processing.
Setup statement is an optional field that allows defining the temporary tables to be used in the test
Status statement is a mandatory field that specifies the expression to evaluate test status (passed or failed).
Result statement is an optional field that allows the user to include some context information (e.g. some records that should help investigating why the test has failed).
Teardown statement is reserved to drop whatever volatile (temporary) tables were created within the setup statement
Test environment is a mandatory field that specifies the environment (database) to check test against. The test is not accepted if it cannot run (no syntax, no semantic errors, referenced objects exist) on this environment.
Run timeout (seconds) specifies the maximum run time accepted for the test for both check environment and actual environment. This helps stopping tests running too long (e.g. due to server overload or preventing tests that poorly written).

OK, so the user interface is clarified, but the user can fill lots of arbitrary text fields, so the chance of doing something wrong is not negligible. So, how do we validate all this information?

First of all, there is an important difference between an enabled a disabled test. An enabled test means that it is ready to be run, so more severe validations apply to it. The user can define a disabled test (e.g. a draft one) which is subject to less validation.

Validation list

Test environment should be used at least one within the queries for enabled tests. This is crucial because test running job will replace it within the actual environment (database)
Limit the size of context information is obtained by injecting a TOP value if one is not specified, or it is too big (e.g. do not allow more than 1000 records to be returned)
Time bound — all queries should execute in less than specified timeout. In order to minimize wait time, this is checked after each test statement to stop as soon as possible (i.e. avoid checking after a long execution)
Setup and teardown must be both not-specified or specified
Setup should execute with no errors and not return any data
Status statement should contain a single statement that returns a single result set containing a single cell (one row x one column). The single cell should contain 'PASSED' or 'FAILED' if test is disabled, otherwise the test must pass in order to be enabled (i.e. prove me that your test at least run successfully on the test environment of your choice).
Result statement must run successfully
Teardown statement must run successfully and return no data
Rerunnable (idempotent) test means that we are checking that test can be run twice without errors. This is done to ensure that teardown is correctly written to drop everything the setup has performed. Why this requirement? When running many tests, we noticed that from to time Teradata refuses our connections, although connection pooling is used. Our solution was to reuse the same connection for several test runs, but ensure that they cannot interfere.

More about timeouts

One important aspect of tests definition is to ensure that they do not run too much or consume too much server resources. Theoretically, Teradata.NET library allows to run with a timeout, but it seems to be ignored (it can actually run much more than defined). So, a trick had to be used: asynchronously cancel the query if the timeout has been reached.

IV. Conclusions

Although defining simple custom tests seems simple at first glance, there are many aspects to be considered to make them simple and robust. This approach greatly helps data warehouse teams to create and maintain tests in the data warehouse repository.

1. Simplicity means as few input fields as possible. The structure is standardized to minimize errors and allow less technical users (e.g. analysis, consultants) to enter their own tests. The interface also allows copy-pasting SQL code from another tool (e.g. Teradata SQL Assistant) and having it work with virtually no change (i.e. just select the correct database).

2. Best-practices are favored or even enforced by the standardized structure:

one business case per test to ensure a good defect triangulation
clear test structure — setup (arrange and act), status (assert) + get context information and tear-down
independent tests
re-usability — tests are environment agnostic (e.g. module test, pre-production)

3. Robustness takes away the burden of being very careful when defining the test from the user. The application does not allow enabling of tests that do not successfully run on a provided environment.

4. Overview (statistics) allow to set a reasonable value for the timeout based on actual run times and also assess if the test pass ratio is reasonable (i.e. too many failures typically means that the test is not correct).

I would love to hear from you about how you deal with automatic testing, regardless of being built in-house or a third-party application stack.

Test repository definition for data warehousing

I. Introduction

II. Context

III. Repository

IV. Conclusions

Written by Alexandru Dragan