DATA ENGINEERING, AUTOMATION TESTING
Common Database Testing Problems
In an earlier post in my series on automation testing, I talked about automation tests' broad categories to write for databases and warehouses. I’ll go over specific types of tests in a later post in this series. For now, you can get the gist of it here.
This post will talk about the common problems data engineers face when testing databases and warehouses. Most of the problems I’ll discuss I have had a first-hand experience with, and the rest I got to learn about by discussions with my awesome friends and colleagues in the data engineering space. Let’s get started.
Too Many Calls
When you have to run thousands of tests against your database or warehouse, you should consider not making one call per test, especially if you’re testing on a database instance that other people are using. Too many database connection requests result in additional work for the connection manager. But that’s not usually a huge problem, although it contributes to the overhead as it takes time to establish a connection, check grants and privileges, etc.
Grouping of Tests
The main problem with too many databases calls in the network. If your data centre is situated far away, you’d be consuming over a quarter of a second to reach the database. This can really hurt when you’re working on scaling your application. Although You cannot always do it, try to reduce the number of calls to a minimum by batching the tests. Wherever batching is not possible, look at reusing the open database connections to reduce database overhead.
Architect your testing infrastructure so that it’s easy to group tests based on tags etc. You should be able to run a specific group of tests at any given time without having to run all the tests. It would be best if you didn’t have to comment the tests out. It induces errors. I have found it the hard way.
Referring to the earlier point about network latency, see the latency matrix for different AWS regions:
Cost from Overkill
An analogy from the SRE world is more than relevant here. To ensure the reliability of systems, there was a movement to capture every possible metric and analyze every possible logline. Based on the analysis, trigger a fantastic amount of alerts, and expect engineers to respond to them. The same concept can be applied to database testing. You don’t want important failures and markers to get lost in the noise of petty failures. Assign priority, issue alerts on critical test failures, and look at the rest later.
Get Rid of Unnecessary Tests
Overdoing doesn’t guarantee reliability. Test whatever is necessary to be tested. Everything else is good to have as long as it doesn’t hinder the testing exercise. Also, it’s important to realize that a test doesn’t have to be run forever. You should be able to periodically change the scope of testing as, it seems to me, that with time, some of the tests don’t remain useful or important, and it’s best to remove them.
While creating the test modules, you’ll end up defining the same tests over and over in different ways without realizing that you are checking for the same thing in two different ways. It would help if you looked for such a pattern early on; otherwise, the test code will pile on, and a cleanse will cost as much as it did to develop the tests in the first place.
The Practical Test Pyramid
Ham Vocke Software has become an essential part of the world we live in. It has outgrown its early sole purpose of…
Blocking Database Resources
Data testing can be highly compute-intensive, especially when talking about data warehouses. You will probably need to run aggregations, do full table scans, sort data, and whatnot. As it requires a good amount of computing resources, you should not risk running the tests in an environment which other people are using.
Impact on Read/Write Workloads
Heavy queries can bring the system to a halt, affecting everyone using the database. Ideally, it would be best if you had a separate environment for testing. If you don’t have a separate environment, run the tests in off-hours and, if possible, limit the resources for your tests so that they don’t bring your instance or cluster down.
Also, ensure that you’re not running tests in parallel using the same underlying database objects. To know more about that, read the following basic introduction to transaction isolation in OLTP systems. Although isolation doesn’t exist in data warehouses the same way, there are limitations on reads because of other reasons.
Demystifying Database Systems, Part 1: An Introduction to Transaction Isolation Levels
For decades, database systems have given their users multiple isolation levels to choose from, ranging from some flavor…
Using Database/Warehouse Metadata Too Much
Database structure and metadata testing don’t consume many resources, but they can block your work if you overuse the database metadata. Frequently accessing metadata can cause special locks in the database, which slow down reads and writes.
Replication Lag on Read-Replicas
In transactional systems with a replication set up, if you’re testing on the main (master) with inserts, you will replicate the data down to the slave unless you have explicitly configured the replication to prevent that from happening. Master databases are usually more powerful to cater to high write volumes, while read-replicas are lighter in compute and memory. The read replicas might lag if they cannot apply the incoming changes from the write-based tests on the master.
What to Look for if Your MySQL Replication is Lagging
A master/slave replication cluster setup is a common use case in most organizations. Using MySQL Replication enables…
Cost of Serverless
Almost all major cloud platforms provide a serverless query engine on top of cheap storage like AWS S3 or Azure Blob Storage. These are commonly used for dumping raw data from multiple sources. This, essentially, becomes your data lake.
Cloud providers offer you a way to query the data lake without provisioning any compute resources, making it a serverless query engine. Almost all such offerings are priced based on the amount of data scanned. You need to be careful about the type of tests you perform on serverless query engines frequently.
It’s a good practice to allocate a budget for testing and issuing alerts when the budget is close to breaching. Moreover, it is important to keep reevaluating your testing needs as time passes. You can decrease the frequency of some of the tests after a certain period of observation.
This problem arises when tests become as complex as the code itself. Sometimes, even more so. It would be best if you avoid this from happening by organizing the tests better.
As discussed earlier, the modularization of tests helps a lot in reducing clutter. Having good visibility into what’s already being tested is central to fixing the readability problem. Your data test coverage should be well-documented and updated for every team member to look up, not just for coverage but to learn about testing patterns and best practices. This goes a long way in making the testing effort a success.
Software Testing Documentation: Types & Examples | Performance Lab
Data storage and management lie at the core of software testing. It's important to keep track of test cases and user…
Coming Up Soon
I’ll talk more about specific automation testing anti-patterns and discuss the testing technology landscape in the upcoming posts. You can find more of my blog posts here. Reach out to me on LinkedIn if you are keen on discussing your experiences with automation testing.
We, at Servian, have helped several companies build their data systems from scratch and fix issues with existing infrastructures. You can find more about us here.