Using RecordLinkage Toolkit to Link Records

Sze Zhong LIM
Data And Beyond
Published in
12 min readMar 8, 2024
Photo by Clint Adair on Unsplash

Anyone who tried to work with user-entered data will know how tough it is to manage and gain insights from that data. One example is if we allow users to key in their company name without any constraints or guidance. SomeCompanyName Pte. Ltd., could be written as ‘somecompanyname’ or ‘somecompanyname pte ltd’ (without the dots and all small caps), or ‘somecompanyname pvt’, or some other variation.

To resolve this issue, we will have to check for similarity and group them together.

Problem Joining Datasets

Recently, I had an issue when trying to join two datasets from ACRA and HDB (both downloaded from data.gov.sg. In both tables, they had the uen and company_name columns. By right, the HDB uen and company_namedata should be a subset of the ACRA uen and entity_name data. However, life is never perfect.

I did expect the HDB company_name and ACRA entity_name to have very very minor discrepancies like a dot here and there. As such, I decided to use the uen from both datasets. Surprisingly, there were some uen data from the HDB dataset which couldn’t be found from the ACRA dataset. The error could have been from either table.

A snapshot showing 4 HDB entities not joining with the ACRA entities information.

I did a check in the ACRA entity csv and found that the acra_uen is slightly different. HDB uen was 200416909E whereas ACRA uen was 200416906E. Most likely one of them was wrong. I also realized that the HDB company_name and ACRA entity_name did not have an exact match. The HDB company_name for index 241 used PTE LTD whereas the ACRA entity_name used PTE. LTD.

ACRA information csv downloaded from data.gov.sg

The idea was to find company names with a similarity of 90% and above (for this particular case) as there were only very minor differences. For this, I decided to use RecordLinkage after asking ChatGPT.

Using RecordLinkage

The documentation for RecordLinkage can be found here.

The recordlinkage project makes extensive use of data manipulation tools like pandas and numpy.

Record linkage is used to link data from multiple data sources or to find duplicates in a single data source. In computer science, record linkage is also known as data matching or deduplication (in case of search duplicate records within a single file).

There are 5 steps to record linkage. Namely:

  1. Preprocessing
  2. Indexing
  3. Comparing
  4. Classification
  5. Evaluation

However, the scope of this article will be on how RecordLinkage was applied to join our table, the experiences, and the mistakes I learned from trying it out.

Using full index (A mistake)

I started using recordlinkage totally blind so basically my first attempt at it was just to use the ChatGPT code and see what turns up. I used the code as below:

# Try to use recordlinkage
indexer = recordlinkage.Index()
indexer.full()
candidate_links = indexer.index(hdb_df['hdb_company_name'], acra_df['acra_entity_name'])
print('here')
print(type(candidate_links))
print(candidate_links)
compare_cl = recordlinkage.Compare()
compare_cl.string('hdb_company_name', 'acra_entity_name', method='jarowinkler', threshold=0.9)

features = compare_cl.compute(candidate_links, hdb_df, acra_df)
print(features)

matches = features[features.sum(axis=1) >= 1]
print(matches)
Code snippet from my Jupyter Notebook where using full index failed

I used a full index method, as can be seen on Line 3 indexer.full().

For the candidate links, I used only the 2 columns, as can be seen from candidate_links = indexer.index(hdb_df['hdb_company_name'], acra_df['acra_entity_name'])

After that, I compared the string from hdb_company name and acra_entity_name using the jarowinkler method, which had a similarity of 90% or more.

I then got the code to compute out, and return me the values that met the requirement.

This attempt failed due to memory issues. Basically to do a full index, it would take the number of row in HDB and multiply by the number of rows in ACRA.

Using full index, but smaller dataset (Still not the best, but could work)

I then thought that to reduce the memory usage, perhaps I could just focus on using a smaller dataset. I could achieve this by:

  1. Creating a smaller dataset of HDB for those which cannot match the ACRA records
  2. Creating a smaller dataset of ACRA by assuming that the first 4 letters of the entity_name and company_name is exactly the same.
# HDB dataset reduced to 4 rows. Only with those with nulls.
empty_uen = hdb_2[hdb_2['acra_uen'].isna()]

# Creating a list of first 4 letters of the 4 rows.
empty_uen_first_3 = empty_uen['hdb_company_name'].str[:4].unique()

# ACRA dataset reduced to only those with the same first 4 letters.
filtered_acra = acra_df[acra_df['acra_entity_name'].str[:4]\
.isin(empty_uen_first_3)]
Code snippet from my Jupyter Notebook where the dataset is reduced.

Basically, the full index will now only have 4 x 29899 rows. Which is much lesser than the bigger amount previously. This is coincidental because my HDB dataset is small. However, I was also wondering what would happen if I had a bigger HDB dataset and more nulls to fill up. Even if i could probably do a loop to handle the issue, settle row by row, and then append them into a pandas.DataFrame before joining it up with the main DataFrame, that did not seem too smart.

Output from running the Full Index method with a smaller dataset

As expected, the result was 119596 rows for the full index. After matching it on a 99.9% similarity, I did not manage to get 1 of the results. There were only issues fulfilled. I decided to lower the similarity. When I lowered the threshold to 90% similarity, you can see that, except for index 313, other indices had many other similar matches.

Upon investigation for HDB index 1175, and some of the indices from the ACRA index, we can see that a 90% similarity is actually quite wide. The initial company’s name is THE CARPENTER’S WORKSHOP PTE LTD. “The”, “Pte”, “Ltd”, being relatively common in many companies.

What I learnt from this is:

  1. We can actually use another column to check. For instance, the postal code. We can check assuming the postal code is the same, instead of basing in on the first 4 letter being the same.
  2. We can choose from a high threshold, to get the highest threshold of an individual index, and go down the threshold until we get the highest threshold for all individual index. We can use a loop to achieve it. I suspect 99.9% going down to 95%, in decrements of 1% should be sufficient. But even if we set the threshold to go all the way down to 90% should also be fine as the loop will terminate accordingly. A loop may be computationally expensive if the data is huge. So I would consider using this method sparingly too.

Using block index with smaller dataset (Still not the best way)

Based on my learnings from above, I decided to use block index method to see what happens.I used a smaller HDB dataset, the one with only 4 records, but I decided to use the full ACRA dataset instead of the one resized based on the assumption of the same first 4 letters.

indexer1 = recordlinkage.Index()
indexer1.block('hdb_postal_code','acra_postal_code')
candidate_links = indexer1.index(empty_uen, acra_df)
compare_cl = recordlinkage.Compare()
compare_cl.string('hdb_company_name', 'acra_entity_name', method='jarowinkler', threshold=0.9)
features = compare_cl.compute(candidate_links, empty_uen, acra_df)
print(features)
matches = features[features.sum(axis=1) >= 1]
print(matches)
print(matches.index)
print(type(matches.index))

I start with a threshold of 90% since I expect that the possibility of having a company with the same postal code is pretty low. I got the result below.

Snapshot of result using the block index method

The match was what I expected that even with a threshold of 90%, the match would still be 1 for 1. However, what I did not expect was that one of the index was missing. There was no match for index 313 from the HDB dataset. My suspicion was that the postal code was not a match, hence the lack of a result.

After checking on Jupyter Notebook (obtaining the ACRA index from the previous exercise of using a full index), I realized I was correct. The HDB dataset (bottom) showed a postal code of 756939, whereas the ACRA dataset (top) showed a postal code of 539483. A totally different address. However, we know that both are referring to the same company from the uen. There is a one number difference in the uen, and the company_name and entity_name is 100% similar.

Investigation results for HDB index 313 as compared to ACRA record.

I feel that it would be best if I could use the best of both world to eliminate the issues. In the meantime, I read the recordlinkage documentation and found out that there is a SortedNeighbourhood indexing method that can also be used.

Using SortedNeighbourhood indexing

The explanation from the recordlinkage documentation for Sorted Neighbourhood indexing as as below:

This algorithm returns record pairs that agree on the sorting key, but also records pairs in their neighbourhood. A large window size results in more record pairs. A window size of 1 returns the blocking index.

The Sorted Neighbourhood Index method is a great method when there is relatively large amount of spelling mistakes. Blocking will fail in that situation because it excludes to many records on minor spelling mistakes.

I understood the general gist of it, but did not really understand the concept of how it works. One good reference material i found online is a pdf file by the Hasso Plattner Institut.

Another simpler to understand but less detailed example can be found from the link given below.

I decided to try the Sorted Neighbourhood (SN) indexing method in 3 different combinations:

  1. SN with uen , comparing company_name and entity_name
  2. SN with company_name and entity_name , comparing company_name and entity_name
  3. SN with company_name and entity_name , comparing uen

The code is as below:

# For Combination SN uen, comparing company_name
indexer2 = recordlinkage.Index()
indexer2.sortedneighbourhood('hdb_uen','acra_uen')
candidate_links = indexer2.index(empty_uen, acra_df)
compare_cl = recordlinkage.Compare()
compare_cl.string('hdb_company_name', 'acra_entity_name', method='jarowinkler', threshold=0.9)
features = compare_cl.compute(candidate_links, empty_uen, acra_df)
matches = features[features.sum(axis=1) >= 1].sort_index(level=0)
print(features.sort_index(level=0))
print(matches)
print(matches.index)
print(type(matches.index))

# For Combination SN company_name, comparing company_name
indexer3 = recordlinkage.Index()
indexer3.sortedneighbourhood('hdb_company_name','acra_entity_name')
candidate_links = indexer3.index(empty_uen, acra_df)
compare_cl = recordlinkage.Compare()
compare_cl.string('hdb_company_name', 'acra_entity_name', method='jarowinkler', threshold=0.9)
features = compare_cl.compute(candidate_links, empty_uen, acra_df)
matches = features[features.sum(axis=1) >= 1].sort_index(level=0)
print(features.sort_index(level=0))
print(matches)
print(matches.index)
print(type(matches.index))

# For Combination SN company_name, comparing uen
indexer4 = recordlinkage.Index()
indexer4.sortedneighbourhood('hdb_company_name','acra_entity_name')
candidate_links = indexer4.index(empty_uen, acra_df)
compare_cl = recordlinkage.Compare()
compare_cl.string('hdb_uen', 'acra_uen', method='jarowinkler', threshold=0.9)
features = compare_cl.compute(candidate_links, empty_uen, acra_df)
matches = features[features.sum(axis=1) >= 1].sort_index(level=0)
print(features.sort_index(level=0))
print(matches)
print(matches.index)
print(type(matches.index))

It is important to note that the Sorted Neighbourhood method has a very important ‘window’ parameter, which when using recordlinkage, is set to 3. The window size determines how many matches are done, and so a window size of 1 will return the blocking index (essentially and equal match), whereas a huge window size will be computationally expensive.

It should be noted that based on my experimentation, the window size has to be an integer, and has to be an odd and positive number.

Documentation extracted from RecordLinkage site here.

Although I am doing these 3 combinations just to experiment, based on what I am looking for, by right I should go with the last combination. This is because I want to use the Sorted Neighbourhood indexing method to find similar names, and then only check the similarity of the uen. This makes sense especially if using the default window of 3 because there are not too many companies with similar names. We will also be able to avoid any typos on both the company names (thru the SN), and typos on UEN (thru the comparison)

The first combination of using the SN method to find similar uen’s then only find the company name, wouldn’t make sense because there will be many uen’s which are similar, and there is a window limit to the SN algo, which will limit the number of combinations. We could expand the window size but as the UEN is randomly generated, that just still wouldn’t make it computationally justified.

The second combination of using the SN method is also not useful because it first uses the SN method to find the company name, then again uses the similarity algo to check again the company name. Pretty much a double effort exercise.

Below are the results of using a default window of 3 with the 3 different combinations.

Combination 1 with only 2 results
Combination 2 with more than 4 results
Combination 3 with exactly 4 results.

Just as we expected, Combination 3 yielded the best results.

However, lets not get ahead of ourselves and deep dive further into the individual results.

Combo 1 — SN with uen , comparing company_name and entity_name

Using index 241 as an example, we can see that the SN had generated only 2 matches for each index. The uen numbers were very close, but basically useless because of the window size.

Combo 1 with default window size of 3

To experiment with window size, I tried to set the window size to 101 to see what would happen. I expected to see an increase in the index matches. As the results show, the number of index created was increased from 8 to 400. We can once again see that the uen matches are there but is of no use to us, as out of 4 indices, only 3 found relevant matches.

Lastly, just for fun, i used a window size of 1001. Instead of 400 results, it returned 3998 rows. But the final result is still sadly the same as using a window size of 101, as out of 4 indices, only 3 found relevant matches.

Combo 2— SN with company_name and entity_name, comparing company_name and entity_name

We will use index 241 again to show why double effort is not useful. It can be seen from the results that we are not utilizing the uen, which is similar to company name to make any improvement in our search.

Combo 3— SN with company_name and entity_name, comparing uen

The results for Combo 3 are good as the concept is aligned with what we are looking for. We are looking for similarities in company_name (with limited window size), that allows for minor deviations, and comparing similar uen.

The match is good.

Wrap Up for Sorted Neighbourhood Method

I think the SN method is good but it is important to know the context as to how it will be applied before actually applying it onto the dataset. The window parameter is also crucial as it will affect the computation resources.

Wrap Up for RecordLinkage

There are basically 4 types of indexing methods to use as shown below:

Snapshot from University of Waterloo. Link here.

For our particular case, we tried the full index, block index and Sorted Neighbourhood index. I do think for this particular case, the Sorted Neighbourhood index combo 3 is a good way to resolve issues for this particular dataset going forward. However, for bigger and more complex datasets, I might consider using a combination of block index and Sorted Neighbourhood index, wrapped up in an iterative loop to get the final results.

To wrap everything important up in a few points, below are the key points to take note when using recordlinkage to solve the problem:

  1. Zero in on the problematic dataset first to reduce the index build up size.
  2. Understand the context of the problem and figure out which indexing method or which combination of indexing is the best way to resolve the issue. Try to find for similar columns. In this case uen and company names are clearly unique identifiers.
  3. For Sorted Neighbourhood, the window parameter is very important and should be set to the appropriate size based on the allowable variation in the data. This can only be done after understanding the allowable variation as per your domain knowledge.
  4. For the threshold when comparing, it is important to try out different threshold to get a gauge on the variation allowed too.

I did not cover the other parts of RecordLinkage such as Classification and also Evaluation, as this particular dataset is relatively clean. However, if your dataset contains alot of user-entered data as mentioned in the introduction, you will likely need to explore the Classification and Evaluation portion.

--

--