Entity resolution, also known as record linkage, is the task of disambiguating real world entities from data. That is to say, it’s the process of identifying and resolving multiple occurrences of a single entity to reveal a clearer picture of the information within the data. It’s simple enough conceptually, but exceedingly difficult to achieve in practice and at scale, which is why there aren’t many master data management solutions available.
In 2019, we formed the DataLabs team to tackle this problem and provide our users with entity resolution tools that are performant, flexible, and customized to their particular use case.
Our Namara platform and its suite of tools constantly evolve, and our users have access to more than 250,000 datasets in addition to whatever data they’re using internally. With that many datasets in the mix, automating the process of solving for data variety becomes an absolute necessity.
We often find duplicated entities that need to be resolved or linked, whether within a single source of data or from multiple data sources. A reliable entity resolution tool was critical to our users so that the most refined and aggregated information was available to our clients and their entire organizations. After many trials and iterations (and a few failures…) we now proudly offer entity resolution as a core enrichment service for our clients.
Why does anybody need entity resolution?
A typical data scientist spends 80% of their time on cleansing and preparing data. This is a shocking statistic. As processing and refining data is required for producing data-driven insights, a growing number of industries are adopting machine learning approaches to improve their productivity.
We built an internal tool to help efficiently refine and manage data. As more and more data is added to an ecosystem, the need for an operational layer to tie it all together becomes increasingly important — it also becomes impossible for humans to manage manually. There are limitless applications for record linkage to have an impact on every sector, and through our current client deployments, we understand the need for it as the world of data grows. Data is good and more data is great, but connecting data is the key to learning from it.
How does ThinkData resolve entities?
For entity disambiguation, we first classify entity type (e.g. organization, address, etc.) and preprocess the data in a way that best suits the entity itself. The data type informs how we optimize the tokens (the sequence of characters used to demarcate the input) within the entity and how we distribute the computing load. We then translate each feature into a vector representation and use a compressed sparse matrix to compute pairwise similarity and link duplicate entities on a graph structure.
The example in the image gives you an idea of the disparity data scientists face when it comes to the way a single company can be recorded — typos, short forms, omissions, and variations abound when working with data, within a single dataset or across multiple.
Instead of using conventional word embedding models, we designed our own embedding model for more accurate, efficient, and scalable entity resolution.
The workload is to optimally distribute workload across multiple compute nodes using Spark so that users can efficiently work with large datasets.
Connecting data from any number of sources at scale
As we grow our data variety, our data refinement capabilities using entity resolution will improve in parallel. This means that data scientists, instead of processing datasets on a case-by-case basis, will have an automated solution that creates links between data points, producing robust master data records that lead to deeper insight. The less time data scientists spend mired in prep and processing, the more time they can spend on actual data science.
The future of data enrichment
We’ve achieved many of the goals we set out to hit, including besting the accuracy of leading entity resolution tools when handling dirty data. However, that doesn’t mean we haven’t set new goals. We’re working hard to automate the process moving forward, and to create an automatic data-knowledge transformation pipeline.
By training on one of the largest catalogs of public data in the world, we have designed our solution around the real-world dirty data environment that most data scientists face daily. Rather than building a product in a sanitary, synthetic environment, we’re designing tools that manage and neutralize data variety quickly, effectively, and at scale.
Want to learn more about data enrichment?
Request a consultation with one of our data experts to talk about our data services and how ThinkData’s tech can advance your projects. If you’re interested in learning more, read how MaRS Discovery District applied entity resolution to Ontario Businesses.
Originally published at https://blog.thinkdataworks.com.