Implementing Soundex in Power Query (M)
I’ve recently been working with some data in Power BI containing contact names and looking into how I might be able to clean it up and reduce issues with typos etc., or attempt fuzzy joining data. I’ve looked at similar challenges in T-SQL before using the SOUNDEX
function to give me a place to start in this area and was hoping that Power Query had something similar… unfortunately not.
Fortunately, the M language that sits behind Power Query is quite extensive and I figured I could probably implement this in a function that could be called from my transformation.
(It had occurred to me that I could probably use an R transformation for this but from a personal perspective, it’s a welcome opportunity to see a bit more of what the M language is capable of)
The Soundex I’m Looking For…
The algorithm I’m implementing is the American Soundex, but this code could be tweaked for the other types if you want to mix things up a bit.
I’ve created some simple people data from Mockaroo and here’s a small sample:
If I choose a few othese last names from the above table, based on the algorithm, my expected output would be:
- Paynton =>
P535
- Epple =>
E140
- Bremen =>
B655
The Finished Product
If you just want to get on with using it then you can follow the instructions below to get started. I have commented and whitespaced the code fairly heavily but if you want to learn more about how it works then I have written a detailed analysis after this section.
Disclaimer: the code hasn’t been performance tested or tuned and might run quite terribly with a lot of rows. I’d like to optimise this further and I think that there’s a lot of low hanging fruit in there if I ever get to it…
2018/04/12: there’s a slightly better version in the comments below.
- Create a blank query (in Power Query or Excel) — the instructions assume you’ve named the query
fn_Soundex
but you can rename accordingly if it doesn’t suit your personal preference. - Open up the Advanced Editor and paste in the following code:
Once you click Done, you should get something like the following in your Power Query editor:
I can then add this to my People query:
- Under Add Column, click Invoke Custom Function.
- Fill out the dialog with the necessary parameters and click OK, e.g.:
You’ll now see a new column for your soundex result, e.g.:
Doing a quick spot-check against our earlier choices shows that we’re doing this right. I have also done a lot more testing against a more extensive list of names and these all look good too!
Breaking it Down
This section helps to decompose the approach taken to show how some of the algorithm has been solved using the M language. We’ll work with the Paynton surname as our example and step through the code and look at the output.
Cleaning the Input
First things first, we need to do a couple of things to our input:
- Get the text down to alpha characters only (a through z). Even though the algorithm throws a lot of characters out, if the first character is a vowel or an excluded consonant then we still need to use it for our generated code.
- As M is case-sensitive, convert the string to upper-case.
We can do this in one go by converting each character to its numeric representation and surrounding it with the case conversion:
#"Cleaned String" =
Text.Upper(
Text.Remove(
string,
List.Transform(
{0..64, 91..96, 123..50000},
each Character.FromNumber(_)
)
)
)
Submitting our data will simply result in an upper-case version of the string we submitted which is ideal, but if our name contains whitespace, hyphens or some other non-alpha character for whatever reason, then it will strip it out.
Grabbing the First Character
Our generated soundex will always want the first alpha character, so this is simply getting the first character of the cleansed input:
#"First Character" = Text.Start(#"Cleaned String", 1),
For our example, we’ll simply end up with P
here.
Creating our Working Data for the Digits
So far, so straighforward. The next set of steps gives us our working data for the generation of the 3-digit portion of the generated code.
In some other language I might have solved this with a simple loop and counter to allow me to iterate over the remaining characters, checking and mapping accordingly, and then checking the previous characters for consecutive exclusions. I could apply a set-based approach here to be in-keeping with M’s modus operandi:
- Create a table with single column; each row representing each individual character.
- Add an index column, starting at
0
and incrementing by1
, which is a proxy for a loop counter. - Add an empty column to hold our mapped character.
#"Split Character List" =
Table.FromList(
Text.ToList(Text.Range(#"Cleaned String", 1)),
Splitter.SplitByNothing(),
{"Character"}
),
#"Character Index Column" =
Table.AddIndexColumn(
#"Split Character List",
"Index",
0,
1
),
#"Soundex Code Column" =
Table.AddColumn(
#"Character Index Column",
"Soundex",
each null,
type text
)
After this stage, we have the following table established:
Processing the Table and Mapping Values
The largest portion of the code is concerned with applying the mappings to the data. The basic idea here is that we look at each row and populate the Soundex column accordingly:
- If the character is a vowel or excluded consonant, then set it to
null
. - If the character is usually permitted but the subsequent character is a repetition, then set this to
null
also.
I’d considered filtering excluded characters first but this way, if i want to tweak my algorithm to handle other characters differently, then I have the capacity to do so here without too much refactoring:
#"Processed Characters" = Table.FromRecords(
Table.TransformRows(
#"Soundex Code Column",
each
Record.TransformFields(
_,
{"Soundex", (f)=>
let
/* Set flag to true if consecutive characters found */
omitChar = (
if [Index] <> 0
and #"Soundex Code Column"
{[Index] - 1}[Character] = [Character]
then true
else false
),
/* Check for character in list and assign a numeric
value, if not eligible for omission */
soundexChar = (
if omitChar = false
then
if List.Contains(
{"B","P","F","V"},
[Character]
)
then "1"
else if List.Contains(
{"C","G","J","K","Q","S","X","Z"},
[Character]
)
then "2"
else if List.Contains(
{"D","T"},
[Character]
)
then "3"
else if List.Contains(
{"L"},
[Character]
)
then "4"
else if List.Contains(
{"M","N"},
[Character]
)
then "5"
else if List.Contains(
{"R"},
[Character]
)
then "6"
else null
else null
)
in
soundexChar
}
)
)
)
The code applies a table transformation, then a nested record transformation, singling out the Soundex column for update. Doing it this way allows us to access the other records in the row fairly easily.
This means that we can use the current Index columm value and subtract 1
from it to access the previous row in the table and compare it.
The sizeable if..else
branching has been done due to the lack of a switch
(or similar) structure in M, but our algorithm is simple enough that it isn’t too much of a problem here.
At this point, our query now looks like this:
As tihs doesn’t clarify the consecutive letter situation very well, let’s take a short detour and look at what happens if we submit Paynnton:
Removal of Unmapped Characters
As our excluded characters aren’t needed in the generated soundex code, we just remove any null
entries from the result set:
#"Removed Placeholders" =
Table.SelectRows(
#"Processed Characters",
each [Soundex] <> null
)
Our table now looks like this:
Final Preparation of the Digits
Now that we’re done mapping, we just need to do the following:
- Flatten down the digits to a continuous string of text.
- If the flattened string is more than 3 digits, cap it at 3. For example,
52314
would resolve to523
. - If the flattened string is less than 3 digits, pad the end with zeros. For example,
1
would become100
, ornull
would become000
.
#"Soundex Digit Combiner" = Combiner.CombineTextByDelimiter(""),#"Soundex Digits" =
Text.PadEnd(
Text.Start(
#"Soundex Digit Combiner"(
Table.ToList(
Table.SelectColumns(#"Removed Placeholders", {"Soundex"})
)
),
3
),
3,
"0"
)
We firstly create a Combiner
with an empty delimiter. This is used to join our list back together.
The inner Table.ToList
is fed into the invocation of this Combiner
, which is wrapped with a Text.Start
to get the first three characters of the flattened string. The outer Text.PadEnd
adds the extra zeroes if the inner string is less than the required minimum length.
And Finally… the Generated Code
For this we just stick together the first letter with the three digits we’ve derived:
#"Soundex Code" = #"First Character" & #"Soundex Digits"
And we’re done!