DATA STORIES | STRING MANIPULATION | KNIME ANALYTICS PLATFORM

Matching “Non-Matching” Strings

A Guide to Resolving Non-Matching Records with KNIME

Hans Samson
Low Code for Data Science

--

Introduction

Matching data files can be a challenge, especially when it comes to strings. Every year I take up this challenge when I match the pop songs from the Top2000 annual list with my basic collection of all Top2000 annual lists since 1999. The Top2000, a compilation of the 2000 most popular pop songs of all time, has been compiled yearly and broadcasted between Christmas and New Year’s Eve by NPOradio2 in the Netherlands. However, due to different spellings and the use of punctuation in one of the files, I end up with non-matches for records that should actually match.

In this blog post, I’ll describe step by step how “a human in the loop” uses KNIME to generate rules that are automatically processed. This process ensures that the string in one input file is made equivalent to the string in the other input file, making a proper match possible. I will delve into the process and showcase my KNIME workflow using the Top2000 as a real-world case.

Download my KNIME workflow from the KNIME Community Hub. The data used in the KNIME workflow is not the complete dataset, but a sample to support the methodology described in this blog.

Matching Files

The first step is to examine the result of matching the new list with the compiled list. From the 2000 songs in the latest list, as experience over the years shows, a very large part (95%) should match the songs in the base list. Only a small portion of the rankings in the new list will be new entries for the latest year.

Since both artists and titles can appear multiple times in the list (The Beatles 60x; (title=) Crazy 5x by: Aerosmith, Gnarls Barkley, Icehouse, Patsy Cline, Seal), The songs are matched based on the title of the song and the performing artists. Each entry on the list, which consists of a combination of a title and an artist, is made unique in both files by concatenating title and artist into one variable with the Column Aggregator node. The choice for the delimiter may look a bit strange, but at the end of the workflow it will turn out that it was a useful choice.

Result of concatenating “title” and “artist” by using the Column Aggregator node with *|* as the delimiter.

The Column Aggregator gives me a key to match all songs from recent years and the latest annual list with the Joiner node.

The configuration of the Joiner node, don forget the checkboxes, to split the results into matched and unmatched rows.

The Joiner node has three output ports. For now, the outputs containing the unmatched records are of interest. However, it could be that there are matches to be made here, as one record is spelled slightly differently than the other. To discover this, we will compare the records form both output ports one by one.

Comparing Records

In order to compare the non-matches from both output ports of the Joiner node, I create all possible combinations with the Cross Joiner node. All non-matches (2929) from the base file are crossed with all non-matches (179) from the new file. This results in a total of 2929 x 179 = 524291 records. These are the records that differ in some way and need to be examined to see if they might still be the same combination of title and artist. Going through and evaluating 525K records by hand is impossible. This can also be done automatically. The String Similarity node calculates the degree to which both strings match (or differ).

String Similarity

There are several methods to calculate the “similarity” with the String Similarity node. In this case I usually opt for the Levenshtein method, but you can also use another method like Jaro-Winkler, n-gram overlap or choose a hybrid form by using multiple similarity measures.

The Levenshtein distance measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another. It is computed by constructing a matrix, where each cell represents the cost of transforming substrings. The bottom-right cell of the matrix provides the distance, indicating the similarity between the two strings.

In the case of Levenshtein measure in KNIME, the higher the value, the more similar the two strings are. By sorting the table with the matched strings from a high value to a low value, you can quickly see with your own eyes in the table at what Levenshtein value it becomes interesting to evaluate the matches. And therefore, which records definitely don’t need attention.

Evaluating Matches

After the two Row Filter nodes, which filters first on duplicate records in the new file (keep records with maximum value of the similarity score) and second on the Levenshtein value (the choice for the similarity threshold is arbitrary), a number of non-matches remain that need attention from a human in the loop. Evaluating the non-matches is done in a Table View node. If a record is deemed a match, you check that and a new column is added to the data table indicating which records have been checked (true) and are therefore matches.

In the Table View check the non-matches node that should match, and close the view (Apply Setting Temporarily).

A few steps back. The goal is to match all records from the new file to the base file, provided they are the same songs. And only the new entries on the list for this year remain as non-matches. To match the non-matches, I need to adjust the title and artist of the new file the same to the notation in the base file. For this, I will use the Rule Engine Dictionary node.

Adjusting Titles and Artists in the New File

The Rule Engine Dictionary node has two inputs. One for input data (the latest Top2000 file) to which the rules apply and one for the rules applied to the input file. The syntax of the rules for the Dictionary has a special structure. The beauty of KNIME is that this structure, with the non-matches from the TableView, can be replicated in a String Manipulation node..

Configuration of the String Manipulation node to creates the rules needed in the Rule Engine.
The rules created by the String Manipulation node, so the entry from new_concat will match base_concat.

The workflow also includes a Table Creator node with one rule (TRUE => $new_concat$). This is the rule that ensures that all records that do not match the rules created in String Manipulation node keep their original value (as in the new file). The rules in the Rule Engine node are executed from top to bottom. So it is important that the Table Creator node is connected to the lower port of the Concatenate node.

Configuration of the Rule Engine (Dictionary) node.

The Final Steps

By applying the rules in the Rule Engine node, the concat_new column of the latest Top2000 file is adjusted in such a way that it will match the base_file with the history of all the year lists. I have now an updated version of the new Top2000 list.

After matching the only dropouts, I will be left with the records that are completely new to the Top2000 list (70 songs in 2023).

--

--

Hans Samson
Low Code for Data Science

Hans is a data analyst/data scientist (but what's in a name)