Using domain specific data to build better data processes

Twinkl Data Team
Twinkl Educational Publishers
4 min readJun 16, 2022

Learn about how we’ve utilised the value stored in our data assets to build a powerful email domain validation algorithm.

This article was originally published by Oscar South, Data Scientist at Twinkl, at https://www.twinkl.co.uk/blog/data-informed-domain-validation

Data informed data pipelines allow us to provide the absolute minimum friction between our customers and their next teaching engagement.

At Twinkl, we help those who teach. This vision informs every decision we make, and in line with this we want to provide the least friction possible to teachers, parents and tutors who want to access our teaching resources. Typo in your email address on the sign up page because you’re in a hurry to prep for your next class? No worries, come on in.

-> all email addresses used as examples in this document are fictionalised <-

This presents us with a problem though — we want to be able to get in touch to provide you with the best possible services and support that will allow you to spend more of your valuable time on teaching. How can we get in touch with Mrs Davies to keep her up to date with the best teaching resource recommendations that are going to streamline her day, when her email address in our database is “k.davies@gmal,com”. What about Mr Jones at “rjones@9llowland.com.au” and Ms Tailor at “terri.taylor@royal.yorksch.uk”?

The challenges of detecting such varied errors are subtle and the costs of a poor implementation could be high — we would not want to create any additional barriers for our customers, so this is a situation where type I errors (false positives) must be minimised and how we implement the validation of the identified correction must be carefully considered. The data inputs and outputs types of the process are both strings (email addresses) which have their own technical substructure. A semi-structured data type like this can be very messy to work with while at the same time having a syntax that needs to be maintained to a technical specification. It is clearly a task that needs to be approached with both caution and creativity.

How do we solve this problem? With smarter data processes, of course! At Twinkl, we recognise that the data stored in our databases is a business asset that can be used to create value for our customers, as well as informing and facilitating construction of more powerful and useful data processes. To summarise the process we built to solve this problem:

-> technical talk incoming <-

We built a proprietary hierarchical model that learns on the domains we see in our database most regularly. The model breaks aparts the strings into their sub-parts, identifies which addresses are likely to be incorrect using an ensemble decision incorporating calculated similarity measures with previously frequently seen domains, syntax validity and internal domain expertise that has been encoded into the process. Smart stuff! Throughout the process, there are a number of iterative optimisation steps which we visualise and examine to find the absolute best ‘sweet spots’ to fix hyperparameters at. Without needing to explain the technical details or meaning, you can have a look at what some of the outputs of these steps look like:

The hierarchical aspect of the algorithm stems from the international nature of our business structure, resulting in customers falling into various subgroups over different regions. The model breaks the data apart into these hierarchical categories, learns and runs inside each category and then converges back into a generalised ‘global’ form of the algorithm, where the final classification+correction is selected as the results pass through a cascading layer of filters implementing a wide range of domain specific business logic rules to prune and curate the outputs. It is in this layer of domain specific logic encodings that we rigorously identify and exclude and potential Type I (false positive) errors. There are additional checks and balances implemented in the deployment process to ensure that every action we take is in line with our vision to help those who teach.

We’ve found that this structured approach has allowed us to identify errors in and validate automated suggestions for both extremely common as well as very specific or uncommon domains quite accurately. Have a look at some fictionalised examples based on real validations flagged and suggested by the algorithm:

-> all email addresses used as examples in this document are fictionalised <-

-> technical talk over <-

These kinds of data transformation processes may seem mundane on the surface, but as soon as you start digging into the problem and potential solutions, solving the challenges of implementing them at scale, thinking about deployment, maintenance and governance of this kind of process (considering that there are dozens or hundreds of such processes being developed and deployed all the time by our diversely skilled data team) it quickly becomes highly rewarding work, the positive impact of which is far reaching.

If you like the sound of the above, remember we’re always looking to hire into our Data Scientist team.

Check out some more articles from other members of the data team.

--

--