Ensuring Data Validity

Published in

B6 Engineering

4 min readJan 11, 2021

A principal part of data engineering is to provide valid data to users. When ingesting data from APIs, files, or other data feeds, it is essential to check that the data conforms to what you expect.

It is difficult to trust an application if it supplies questionable data. Same goes for a predictive analysis that relies on questionable data. Such is the saying “garbage in, garbage out”. When we have data verification checks in place, we can maintain (with confidence) that our data at scale is good (and getting better) rather than considering quality as an afterthought.

So how can we improve the quality of data?

Let’s dive into an example within the commercial real estate industry. A core product offering for a brokerage could be an application that displays recent commercial transactions and associated buyers and sellers. This would allow a broker to identify buyers who are active and contact them in order to pitch a listing.

During the ingestion of the transaction data and contact data from public sources or third parties, we would want to ensure that a buyer’s contact info is correct before presenting it to the user. We could first check if the phone number has the specified number of digits for the associated country. Then, we could check if the phone number has a valid area code. Although these are helpful steps to ensuring valid data, they are not sufficient. Even with the correct number of digits and a valid area code, the phone number provided may not be in service, might be a general company line without a direct extension, or may not be the preferred number for that person! The phone number would pass all tests proffered by a machine, but could not be successfully used to reach the desired individual.

Even if a piece of data conforms to the schema and the unique characteristics of a field, it does not mean it is valid to a human.

Therefore, we believe that data validity can be distilled into two distinctive parts: “Machine Validation” and “Human Validation”.

What does “Machine Validation” mean?

“Machine Validation” is a process that can be employed by a machine to ensure data follows certain field constraints. This can be as easy as type checking. It can also be more complicated like identifying and fixing typos in emails using known mail servers. Here is a non-exhaustive list:

Email validations

@ symbol is present
Mail server is valid (ex. gmil vs gmail)
Top Level domain is valid (.ocm vs .com)
SMTP callback verification

Phone validations

Only contains numbers
Extensions are identified and separated
10 digits are present for US numbers
Area code is valid in US

Property validations

Street address contains matching zip code
Measurements are within an appropriate limit (ex. Building height should be at least 0 feet but no more than 2,717 feet)

What does “Human Validation” mean?

“Human Validation” is a process that can be employed by a human to ensure data is accurate and actionable. This can include many different (often subconscious) rules applied to data such as:

Contact Info (Email/Phone/Mailing Address) validations

Contact info is up to date (ex. Person has left a company and now receives email elsewhere)
Contact info is the preferred method of communication (ex. Office number should be called before cell)
Contact info does not contain a typo (ex. Email is incorrect because two characters are swapped)
Person does not want to be contacted

Property validations

Listed legal entity on the transaction is not the true owner, instead another company or person is the true owner
Property type is currently residential but zoned in an area that would make it suitable for a development sale

We realize that Machine and Human Validations have started to converge in some ways. Years ago, it may have been very difficult for a machine to validate that a typo occurred in an email domain or to identify that a phone is disconnected, but current machines are covering more and more ground. However, we believe it is still useful to separate these two processes out in order to maximize data quality. Even if data is detected as ‘invalid’ by a machine, we still may want to pass it on because a human could fix it!

How do we ensure data quality at B6?

Here are some of the tools and processes we have actually applied to help improve the quality of our data within our Machine and Human Validation framework:

Machine validations

Validate patterns in various fields using Regex
Normalize phones using python-phonenumbers
Normalize address formats using Scourgify
Normalize emails using python-email-validator
Fix common misspellings of email domains using pymailcheck
Remove previously ‘invalid’ (bounced or unsubscribed) emails using email client’s Suppression List

Human validations

Identify preferred email/phone/address using boolean verification fields in Salesforce (that expire if activity has not been detected in months)
Identify Primary Owner for each property using boolean field (that reset after a verified sale)
Maintain Do Not Send email list that users can add to manually
Improve true property ownership and transaction details using our Weekly Transfer Verification funnel (check out a previous post detailing this)

Where could we go next?

In the future, it would be great to design a suite of systems or tools that can evaluate and verify data sources more holistically. This could help quantify the reliability of each data source and help our machines and humans preference trustworthy sources. Plus, we could become more granular by tracking reliability for each field within one source in order to determine cases where one field from a data source is dependable while another field is not. We could also imagine evaluating data quality over time to determine if we should rely on a source more or less. Finally, we could envision applying a metric to quantify human validations, which could open up the door to identify the most discerning business users interacting with the data.