Data Engineering and Neutronian Certification

Vivek Vijayan
affinityanswers-tech
3 min readJan 7, 2021

When purchasing third-party data, the buyers are placing a high level of trust in its provider. Without standardized measurements, the data buyers do not know how to trust the data provider. Affinity Answers was recently certified by an independent data quality measurement experts, Neutronian. Read more about why we did this exercise here and about the certification itself here. The certification exercise involved teams from Product Management, Legal, and Technology. Let’s dive into what aspects of the certification were dealt with by Data Engineering.

Dataset Characteristics

What matters is the consistency and completeness of data — sourced as well as processed. We shared with Neutronian a sample of a week’s worth of data across three to four months for this exercise; for example, seven days' worth of data collected from the first Monday of every month for the past four months were shared without any pruning. This enables the data quality measurement experts to analyze aspects such as

  • Duplication of records and effectiveness of de-duplication
  • Consistency of data formats, for example, that of time and is it in a common time zone
  • Consistency of data used for machine learning

Data Storage and Protection

What matters here is pretty basic but important aspects of where data is stored, protected, and the longevity of the data.

  • Securing of data storage in terms of who access and adoption of least privilege principle
  • Access information and storage of access logs especially in cases where PII (Personally Identifiable Information) is stored in the logs for reasons of nonrepudiation
  • Mechanism of purging the data; some data cannot be stored beyond a particular time frame even if they are (pseudo) anonymized.

System configuration and operation

This part resembled the good-old CMMi process assessment. Some of the aspects the assessment covered were:

  • Change Management: for example the deployment process for the software we develop
  • Identity and Access Management both internally and externally
  • Vulnerability management like applying security patches

What did we learn from this exercise?

  • This is a test of your best practices. The certification is not akin to an FDA approval of drugs where soon after the certification, data buyers are queueing to buy your data. However, this tells you where you stand and what you need to catch up. In our case were pleasantly surprised with some findings where we exceeded the expectations and it was good learning to know that we were missing on a few things which were simple but yet important
  • Be transparent. Data that is shared with the certification organization for this exercise should not be tampered with to make it appear “perfect”. No one is perfect, it is better to know where you stand. And on your practices, do not fudge. For example, in one of our products, the test suites were absent, which was not an omission but it had its own peculiarities. We did express it as such and the data quality experts at Neutronian were appreciative of the transparency as we need to do things only if necessary and not because the whole world is doing it.
  • Learn about how your architecture and design can be made more efficient. We as a Data Engineering team were impressed with the architecture and design suggestions provided by Neutronian which we thought were beyond what a certification organization usually is expected to do. This was perhaps the best part of the whole exercise for this team although in the larger sense it was of much more value to Affinity Answers.

Finally, we now proudly say that we are “Certified by Neutronian”.

--

--