Entity resolution with Namara and Thomson Reuters PermID

Yulia Chepurna
5 min readJan 25, 2018

--

More leading organizations now start to rely on previously overlooked external data to derive new insights, map them to some attributes, and feed directly into their predictive models, this way gaining a competitive edge. Examples of such alt data are countless: starting from analyzing government bids to adjust an estimate of a company valuation, relying on historical weather data to anticipate a spike in customers’ activity, and even processing a series of satellite pictures of parking lots of a popular chain to predict the direction of its stock.

For the most part external data feeds are complimentary and are used to augment carefully curated internal datasets. And this creates a new challenge: how to ingest such feeds, yet maintaining the integrity of a reference dataset? What happens if some of these sources overlap and have duplicate values, or somewhat duplicate? Then imagine something worse: imagine these duplicates having different values for the same attribute. How these records should be consolidated then?

Mapping such data points is known as an entity resolution (or an entity mapping) problem, and it is not a trivial one. Usually organizations resort to internal master data management strategies to address it. However they are quite context-specific, and hence pretty laborious to put in place. There is also a trade-off between an exact matching that might dismiss a good number of records, and having a fuzzy matching too relaxed, such that it will result in a high number of false positives.

On the bright side, there a is a number of widely recognized third-party solutions for different domains. In this post we are going to look at how we at ThinkData Works use some of Thomson Reuters APIs for business entity matching and data enrichment.

Let’s consider the following scenario: an organization wants to use Namara to integrate a few premium datasets with its internal CRM. They are particularly interested in:

Let’s have a quick overview of aforementioned sources.

Their internal dataset is a pretty typical CRM with its internal identifier, company name, company category, location, and, of course, contact information (see the breakdown of the attributes by a category below).

Internal attributes of the master dataset (internal CRM)

Canadian Company Capabilities is a database of local businesses curated by Industry Canada that is aimed at opening exporting opportunities, as well as facilitating search of prospective partners, and analysis of the competition. Hence it is quite rich with respect to classification of company activities/services provided/goods manufactured, its financial situation, and their representatives’ contact information (up to 60 for one of them!).

Canadian Company Capabilities attributes (over 50)

Corporate directory contains the information that Canadian corporations are required to file, including standard identifiers such as business and corporation numbers, corporation names, its type, governing legislation, whether this corporation is active or not (and why), location information, lists of activities and directors.

Corporations Canada attributes (around 30)

TMX feed consists of the information pertaining to a traded security, including its symbol, CUSIP, corresponding company names, nature of business, number of outstanding shares, dividend factor, etc.

TMX feed attributes (around 25)

Even though all of these datasets have at least one unique identifier for every record, they are internal and cannot be used as a key for a join. One possible way to address this could have been to resort to a company name. However, after a quick examination, we can clearly see that records like “Shopify Commerce Inc.”, “SHOPIFY INC” and “847871746RC0001 Inc.” would not be easily disambiguated.

This is where Thomson Reuters PermID comes to play. PermID is a unique identifier assigned to a variety of different business entities (organizations, persons, instruments, and quotes) in Thomson Reuters internal universe of linked data.

Open PermID comes with the following set of APIs for entity querying and retrieval:

  • Record Matching
  • Entity Search
  • and Tagging

We will be focusing only on the first two. Record matching API allows to run a set of business entities against TR database, and returns a ranked list of best possible matches. In order to match an organization you need to specify its name and some of the following optional arguments:

  • standard identifiers (ticker or RIC)
  • street
  • city
  • postal code
  • state
  • website

You can see a sample response below:

So first, we’ll run all of the datasets through this API and assign PermIDs to the companies that were above the cut-off point (internally set to a match score of 85%). Since now all of the datasets have a global unique key, we will be able to seamlessly join them. Voilà!

And this is not all! Thomson Reuters Entity Search API allows you to access descriptive fields for 3,460,500 organizations, 240,000 equity instruments, and 1,170,000 equity quotes, using their PermID. You can see the attributes typically available for organizations below.

PermID entity attributes (20)

So the last step in our enrichment process will be to augment these joined datasets with metadata coming from Entity Search API. We will query the metadata for all of the unique PermIDs attached to the original CRM via entity lookup and append to the resulting dataset.

This is how starting from a basic business directory with some 20 attributes led us to developing a holistic representation for both private and public businesses with almost 150 different attributes. Click here to explore the final output dataset on Namara.

***

Thomson Reuters is one of our exclusive data partners. Follow us on Twitter to stay up-to-date with amazing work they are doing with us. Don’t hesitate to contact us, if you have any questions. And if you liked this post, make sure to check out our case study on using Unity for joining and enriching geospatial datasets.

--

--