Data Veracity and Trusting our Test Data
Inaccurate and manipulated information threatens to compromise the insights companies rely on to plan, operate, and grow. Unverified data is a new type of vulnerability, and one that every business leveraging digital technologies must address. — Accenture’s 2018 Tech Vision
When collating data for Glasswall’s Core Team Dashboard, I noticed the sample data passed into the dashboard was unbalanced an inaccurate. This led me on a data journey.
Why is this relevant to Glasswall?
We at Glasswall rely on web-scraped test data to calculate file throughput manage rates. This is NOT a true reflection of the engine’s performance out in the real world. The reason for this is the data can often be outdated and unverified.
How can we also manage the fact that the constant use of test data hasn’t led to accidental manipulation of the data and be providing inaccurate decision making?
The presence of bad data in a system isn’t always the result of malicious intent, but may be a sign that a process isn’t working the way it was intended.
How can we verify that the test data collated is leading us to developing the correct solutions? Bad Test data can lead to inaccurate results and can lead to bad business decisions. We may be heading towards a point where data may be affecting the Glasswall Core Team’s decision-making with regards to work that gets prioritised. e.g. important bugs that need to be fixed dependent on customer data. This could be more dangerous to a company than a data breach as it can lead to serious funds and resources going to a direction that provides no value. e.g., an oil company being told by data to drill in a location with no oil.
Currently we web-scrape for files but this practice does not imply that we have non-malicious files in our file specification conforming test sets.
So what are possible solutions to this problem?
Methods to stop unintentional data manipulation. A lock on files once obtained or upon the deliberate creation of a non conforming file.
Use of timestamps on data collected to ensure new ‘live’ data flows through and we have the latest files being used for certain tests and use old data when needs arise.
Multiple validation approaches when obtaining new test data and automating the process. This allows the company to verify good data and eliminate bad data or in our case it can be categorised to conforming file and files that do not conform to the file specification.
Securing test data in storage that can scale, be backed up and be accessed quick such as AWS S3.
There are many solutions to an issue such as this.
Data Integrity is Crucial
Thirty-five years ago, Soviet watch officer Stanislav Petrov jumped out of his chair. According to the satellite system he was monitoring on September 26, 1983, the United States had launched a nuclear missile at the Soviet Union. Protocol dictated that Petrov notify Soviet leaders, who would order an immediate counterattack.
Fortunately for the world, Petrov wasn’t convinced that the alerts were true. He didn’t notify his superiors, thereby preventing a global catastrophe. With no other alerts to show such attacks were underway, Petrov knew that the data the system was showing didn’t match what was expected. That, combined with his understanding of the risks if he followed protocol, informed his ultimate decision.
The Soviets later determined that their satellites had confused the reflection of sunlight off clouds for a missile launch. By questioning the validity of data, Stanislav Petrov had saved the world from nuclear disaster.