SSIS Fuzzy Grouping
A simple scenario to use Fuzzy Grouping in SSIS
To put simply, the Fuzzy Grouping transformation lets you group rows of data together based on the similarity of the values in a specified column. It is essentially used as a data cleaning component, which identifies rows that are possible duplicates and then selects a specific row of data to use in standardizing the data. This is especially useful for performing aggregations on this data after fuzzy grouping has been done.
Take for example a table containing data about the number of shares owned by certain users. The table contains two columns: Name, Shares — with the ‘Name’ being the name of the user and ‘Shares’ being the number of shares owned by that user.
The table contains the following values:
Name | Shares
adeeb | 20
james | 15
jamess| 35
adeeba | 5
mark | 60
Not going into detail regarding how the data was sourced in SSIS (could have been an Excel File or an OLE DB Source). Performing a Fuzzy Group will let you specify which of the columns to perform fuzzy matching on, which of the columns to ‘pass through’ to output and what the name of the group output column should be. If needed, you can adjust the similarity threshold in the advanced tab i.e. threshold of 0.5 means 50% similarity to the searched value will be displayed. The output of the fuzzy grouping transformation contains several other rows such as _score, which indicates the similarity to the searched value, a _key_out field where rows with the same value in the _key_out field is part of the same group. The important output field we will be concerned about is the group output field, which will contain the unique name that groups all other similar values. Let’s say we named this column ‘Name_Cleaned’. Assuming that the fuzzy match grouped ‘adeeb’ and ‘adeeba’, as well as ‘james’ and ‘jamess’, it will contain the same value in the Name_Cleaned field per group. So both ‘adeeba’ and ‘adeeb’ rows will have either ‘adeeb’ or ‘adeeba’ in the Name_Cleaned field, and the ‘james’ and ‘jamess’ rows will both have either ‘james’ or ‘jamess’ in the Name_Cleaned field depending on which name was chosen as the unique namemby the transformation component. We will assume that ‘adeeb’ was chosen for the first group, and ‘james’ for the second. Of course, there is also ‘mark’, which has just one result so it remains ‘mark’.
Once the fuzzy grouping transformation is set up, it can be connected (output) to an aggregate transformation. We will group by the ‘Name_Cleaned’ column (which is now unique per group) and SUM the ‘Shares’ column. The resulting output would be:
adeeb | 25
james | 50
mark | 60
And there you have it, a simple example of using Fuzzy Grouping. For a more elaborate tutorial on using Fuzzy Grouping in a similar scenario, have a look here. Note that Fuzzy Grouping and Fuzzy Lookups are two different transformations. I will cover examples of Fuzzy Lookups in a separate post.