ETL Testing Best Practices

Ronan
techburst
Published in
4 min readMar 2, 2018

For as long as enterprises have been using data as a fundamental component of Business Intelligence and as an important piece of the decision-making puzzle, there has been a need to integrate and consolidate disparate enterprise data sources in one place.

Analytic queries, BI software, and reporting tools all work best when data is in a standard, homogenous environment, such as a data warehouse. Data integration essentially solves the complex problem of getting all the data from CRM software, text files, web analytics tools, ERP systems, and third-party data sources into one place.

ETL is one of the oldest and most widely used methods of data integration, but like many aspects of technology, it has undergone several important changes in recent years, particularly after the growth in concepts such as cloud computing services, streaming data, and Big Data. Below, you’ll find out what exactly ETL entails, before understanding the importance of ETL testing and some best practices on performing ETL testing in modern enterprises.

What Is ETL?

ETL is a data integration process composed of three phases — extract, transform, and load. ETL first became popular in the 1970s when enterprises began to build data repositories containing multiple sources of information.

More specifically, the ETL process can be viewed as follows:

  • Extract: This step pulls data in multiple formats from many sources and applies validation rules to ensure data has the expected values.
  • Transform: Apply a set of rules or functions to extracted data, helping to make sure the data meets both the business and technical needs of the end target system. Transformations include joining, aggregating, sorting data, selecting specific columns, deriving new values, etc. Data should end up in a single, clean, consolidated format after transformations.
  • Load: Depending on the business needs, consolidated data is loaded into an end target system either in increments or with a full load.

ETL can be done with the help of an ETL tool, which is specially designed software that helps with data integration. Some enterprises opt for hand-coding the ETL process. Cloud computing services have also led to the growth of data pipelines as-a-service, in which cloud vendors offer web services to help enterprises reliably ETL their data.

ETL Testing

ETL might sound straightforward, however, it is a complex process. Underpinning the success of any data integration effort, regardless of the method used, is the need to test for accuracy. A 2015 study revealed that U.S. organizations believe 32 percent of their data is inaccurate. ETL testing essentially involves the verification and validation of data passing through an ETL pipeline. After all, if the data that ends up in the target systems is not accurate, neither are the decisions and analyses based on that information.

Some of the issues ETL testing aims to address include:

  • Data validation
  • Constraint validation: ensuring the constraints for tables are properly defined
  • Data completeness: checking all expected data has been loaded
  • Data correctness: making sure the data has been accurately recorded
  • Validating dates
  • Data cleanliness: removing unnecessary columns

ETL Testing Best Practices

ETL testing can be quite time-consuming, and as with any testing effort, it’s important to follow some best practices to ensure fast, accurate, and optimal testing.

Exhaustive Data Validation

One of the main drivers of data quality is the idea that data should be complete, meaning all the data required to meet business demand is available in the data resource. Testers must use tools which can exhaustively validate data at all intermediate stages in the ETL process to ensure data completeness. Much of this testing can be automated.

Conduct Business Test Cases

These tests aim to verify that the data fulfills the mission-critical business requirements. The tests check whether data has been mapped appropriately during the transformation stage according to the business rules defined by the specific enterprise using ETL to integrate its data.

Automate

Increased automation is at the heart of all modern software development and testing processes, and ETL testing is no exception. Enterprises are looking to shift to Agile ETL testing, in which data issues are addressed on the fly, mostly automatically, without interrupting the data integration process.

Enterprises should be looking to create automated test scripts for much of the ETL process, including validations and metadata testing. There will of course be a need to create manual test cases from time to time, but much of the focus needs to be on automation so that ETL testing is fast and optimized. The data pipeline as-a-service concept plays an important role in catering for more automation.

Use The Right ETL Tool

In the context of testing, your company’s chosen ETL tool should have the capability to report all invalid data. This can help ensure no such data makes it into the target system.

Test For Speed

The speed of the ETL process is arguably as important as data accuracy. Enterprises can’t afford time-consuming ETL when their BI analysts need ready access to data for use with analytics tools. Speed should be tested by ensuring that the data loads into the target system within a given timeframe, and any inefficiencies should be removed to speed up the process.

Closing Thoughts

ETL is one of the oldest forms of data integration, and it is still relevant for modern enterprises looking to become data-driven. However, the ETL process has shifted with the emergence of cloud computing and data pipelines offered as-a-service. Regardless of the method your enterprise uses to ETL data into a target data warehouse or other analytical store, it’s imperative to fully test your ETL to ensure accuracy, consistency, and speed.

--

--

Ronan
techburst

Ronan writes about technical IT topics, including cybersecurity, software development, and cloud computing. Visit: http://ronanthewriter.com/blog/