The Use Case Treasure-Trove (Part I): Reducing Administration Costs with AI-driven Record Linkage

Published in

Machine Learning Rambling

4 min readMay 13, 2018

Yes, Data Science is exciting, Yes, Hacking is fun, but which Use Cases are valuable and have an impact on your business?

As you certainly have experienced, finding the right use case for your company is a daunting task. Every day, you get hundreds of emails from internal managers claiming their data is highly valuable. They need your help. However, after careful analysis, your team of Data Scientists later reports to you that the data set is too small, lacks targets and even more annoying unhelpful details.

In this non-technical series, I take you through common but highly valuable Data Science Use Case, answering the following questions for you:

What are they about precisely?
Are they applicable to your business?
How does the end-product look like?
What should you expect from your team of Data Scientists?

In this first part, I tackle Record Linkage and how it will reduce your administration costs.

Story Time

It is another sunny Monday morning at the Head Office. Sophia arrives at the office. She glances over the unending list of new unread emails. She opens the first one.

Her manager James, reports that last week was pretty successful. A large batch of new clients were registered recently, but although the number of reported registered forms seem to match, some clients data seem to be missing.

Sophia scrolls through the different the databases. There she sees it again! The same problem, she keeps encountering over and over again, Transcription errors.

Transcription errors. Misfiling. Common administrative challenges, which can easily be tackled by Machine Learning and reduce your costs.

She will have to once again reallocate administrative staff to check the database record by record and manually correct the data.

“These clients will have to wait”, she says. “What a waste of time and ressource”, she thinks.

Yes, it is Sophia, it sure is!

It does not matter which industry you belong to. Every day countless forms are registered. It could be containers from Hong Kong arriving by ship in Rotterdam, suffering from mistyped customs ID. What about misplaced client ID’s in the mortgage form? The list goes on…

This problem is commonly referred to as Record Linkage.

What is Record Linkage precisely?

“Record linkage is about extracting information on a single entity, e.g. client, commodity, etc… , from different datasets which may or may not share common identifiers. These identifiers could be keys, identification numbers, etc…”

Sounds like something that should never go wrong doesn’t it?

Unfortunately not.

With the Digitalization of forms, many historical analog data had to be transcripted. Even more so, people will always make mistakes. Even the best secretaries, mistype a name, a client id, etc.. . The larger the company, the greater the occurences of oversights.

Automation of this Data sanitisation means less administrative cost, needed to supervise the processes.

How does the end product look like?

The simplied work-flow is illustrated in figure below.

The client has been registered in different facilities, some of which are outside your company and often might be the source of faulty data. The data is then uploaded to your central database. Regularly, a sanitization of the data sets is scheduled. Each unchecked new entries in the databases are checked by your (developed) blackbox Machine Learning model, which links the entries, delivers the final form and updates the central Database with cleaned entries.

What to expect from your Data Scientists?

The most common approach is often designated as Fuzzy Matching. Your team analyses the common errors in the database and uses algorithm to estimate the similarity between records using distance measures.

An basic example of such distance measures, is the Levenshtein distance. This measure estimates the similarity between Strings, i.e. words/texts/sequence of characters, based on the number of deletions, insertions, or substitutions required to match two Strings.

Siamese Network is a Deep Learning method for Record Linkage.

You will hear words such as cosine similarities, cosine similarity TF-IDF, Doc2Vec, variational auto-encoders, triplet networks, siamese networks, etc… Whether advanced techniques might be necessary, depends on the quality of data. There are no silver bullets after all…

Conclusion

Record Linkage is a common occurrence within the industry. Administrative costs can be reduced using a sanitisation work-flow. At the center of this flow, lies a black box machine learning model which compares newly added entries in your database and matches them based on similarity measures.

References

Christen P., Advanced record linkage methods: scalability,classiﬁcation and privacy
AT&T Bell Laboratories, Signature Verification using a “Siamese”
Time Delay Neural Network
Navarro, Gonzalo (2001). A guided tour to approximate string matching
Bell RM, Keesey J, Richards T. The urge to merge: linking vital statistics records and Medicaid claims.