Automated Testing and Production Data

Thomas Packer, Ph.D.
TP on CAI
Published in
3 min readOct 22, 2019

Say you develop software that relies on data. Not an uncommon situation. It is a good practice to get that data cheaply, e.g. as a byproduct of software being used naturally in production. So, say, your production system generates data, either by users’ entering data or by recording the behavior of users. When you go back and maintain or continue to develop new features for this software, test-driven development is a good practice to adopt from software engineering, even if you are a data scientist. But the question arrises sooner or later: should those tests rely on data that appears in the production environment? I originally thought, yes, and I was dogmatic about the test data matching production data exactly. Not any more.

Photo by Science in HD on Unsplash

The following is a monologue that came out of a discussion with the certain work project that leveraged certain data (dictionaries) that are periodically updated in production. It also is about automated unit testing and test-driven development, which is a best-practice I like to follow. The dilemma is this: do we automatically update data in dev and QA environments from changes that happen in production when the changes to data might affect automated testing.

My Recommended Policy

Automatically copy production data from the production environment for “pre-production” or “prod parallel” testing in one environment (call it Staging or UAT). Do not automatically copy production data to dev nor QA environments. You can manually create test cases from production data when necessary. If we find missing data in QA/UAT, they should be added there and propagated to Production. Production users should certainly continue to update the data in production.

More Details

I have revised my opinion about making dev and QA environments match production. I have done some further reading and thinking about this. There is a practice called “pre-production” or “production parallel” testing which should be adopted in some software engineering and data science projects that are rolled to production. It is a good idea because, using it, we can catch bugs in new code that will only show up with production data. On the other hand, this requires only a single pre-production environment e.g. Staging or UAT.

I don’t think automated unit testing should rely on production files that will be updated automatically. I’m not even sure that manual QA testing should be performed on data that matches production data exactly.

Two reasons:

  1. Production data will not always be complete enough to test all the edge cases we need to test before deploying code to production. I have seen this to be the case in the past with smaller production data files. Therefore, the dev and QA environment data should be allowed to be different from prod data so it can be more complete.
  2. My original concern was that if QA and especially DEV data is overwritten automatically and frequently, our tests are shooting at a moving target. They may break even when there are no bugs simply because the data changed in unexpected ways. Therefore, the DEV and QA environment data should not be updated automatically so test cases don’t need to be rewritten frequently.

Production data is valuable for testing in these environments, but should probably be copied selectively and manually.

My recommendation:

Automate the copying of production data to UAT. Do not automate copying of production data to QA or Dev. I could be persuaded to automatically copy data from PROD to QA, but I’m not currently a fan. Get the input of your manual QA testers.

There is also a good discussion here: https://sqa.stackexchange.com/questions/5737/copying-production-data-to-a-qa-environment

Join the CAI Dialog on Slack at cai-dialog.slack.com

About TP on CAI

Other stories in TP on CAI you may like:

--

--

Thomas Packer, Ph.D.
TP on CAI

I do data science (QU, NLP, conversational AI). I write applicable-allegorical fiction. I draw pictures. I have a PhD in computer science and I love my family.