Implementing Soundex in Power Query (M)

Daniel Marsh-Patrick
Daniel Marsh-Patrick
7 min readAug 16, 2017

I’ve recently been working with some data in Power BI containing contact names and looking into how I might be able to clean it up and reduce issues with typos etc., or attempt fuzzy joining data. I’ve looked at similar challenges in T-SQL before using the SOUNDEX function to give me a place to start in this area and was hoping that Power Query had something similar… unfortunately not.

Fortunately, the M language that sits behind Power Query is quite extensive and I figured I could probably implement this in a function that could be called from my transformation.

(It had occurred to me that I could probably use an R transformation for this but from a personal perspective, it’s a welcome opportunity to see a bit more of what the M language is capable of)

The Soundex I’m Looking For…

The algorithm I’m implementing is the American Soundex, but this code could be tweaked for the other types if you want to mix things up a bit.

I’ve created some simple people data from Mockaroo and here’s a small sample:

Some of our sample data

If I choose a few othese last names from the above table, based on the algorithm, my expected output would be:

  • Paynton => P535
  • Epple => E140
  • Bremen => B655

The Finished Product

If you just want to get on with using it then you can follow the instructions below to get started. I have commented and whitespaced the code fairly heavily but if you want to learn more about how it works then I have written a detailed analysis after this section.

Disclaimer: the code hasn’t been performance tested or tuned and might run quite terribly with a lot of rows. I’d like to optimise this further and I think that there’s a lot of low hanging fruit in there if I ever get to it…

2018/04/12: there’s a slightly better version in the comments below.

  1. Create a blank query (in Power Query or Excel) — the instructions assume you’ve named the query fn_Soundex but you can rename accordingly if it doesn’t suit your personal preference.
  2. Open up the Advanced Editor and paste in the following code:

Once you click Done, you should get something like the following in your Power Query editor:

fn_Soundex added to Power Query

I can then add this to my People query:

  1. Under Add Column, click Invoke Custom Function.
  2. Fill out the dialog with the necessary parameters and click OK, e.g.:
Adding the function column to soundex our Last Name field

You’ll now see a new column for your soundex result, e.g.:

Soundex calculated!

Doing a quick spot-check against our earlier choices shows that we’re doing this right. I have also done a lot more testing against a more extensive list of names and these all look good too!

Breaking it Down

This section helps to decompose the approach taken to show how some of the algorithm has been solved using the M language. We’ll work with the Paynton surname as our example and step through the code and look at the output.

Cleaning the Input

First things first, we need to do a couple of things to our input:

  1. Get the text down to alpha characters only (a through z). Even though the algorithm throws a lot of characters out, if the first character is a vowel or an excluded consonant then we still need to use it for our generated code.
  2. As M is case-sensitive, convert the string to upper-case.

We can do this in one go by converting each character to its numeric representation and surrounding it with the case conversion:

#"Cleaned String" = 
Text.Upper(
Text.Remove(
string,
List.Transform(
{0..64, 91..96, 123..50000},
each Character.FromNumber(_)
)
)
)

Submitting our data will simply result in an upper-case version of the string we submitted which is ideal, but if our name contains whitespace, hyphens or some other non-alpha character for whatever reason, then it will strip it out.

Grabbing the First Character

Our generated soundex will always want the first alpha character, so this is simply getting the first character of the cleansed input:

#"First Character" = Text.Start(#"Cleaned String", 1),

For our example, we’ll simply end up with P here.

Creating our Working Data for the Digits

So far, so straighforward. The next set of steps gives us our working data for the generation of the 3-digit portion of the generated code.

In some other language I might have solved this with a simple loop and counter to allow me to iterate over the remaining characters, checking and mapping accordingly, and then checking the previous characters for consecutive exclusions. I could apply a set-based approach here to be in-keeping with M’s modus operandi:

  1. Create a table with single column; each row representing each individual character.
  2. Add an index column, starting at 0 and incrementing by 1, which is a proxy for a loop counter.
  3. Add an empty column to hold our mapped character.
#"Split Character List" =
Table.FromList(
Text.ToList(Text.Range(#"Cleaned String", 1)),
Splitter.SplitByNothing(),
{"Character"}
),
#"Character Index Column" =
Table.AddIndexColumn(
#"Split Character List",
"Index",
0,
1
),
#"Soundex Code Column" =
Table.AddColumn(
#"Character Index Column",
"Soundex",
each null,
type text
)

After this stage, we have the following table established:

Working table for remaining characters, with index and column to store mapping

Processing the Table and Mapping Values

The largest portion of the code is concerned with applying the mappings to the data. The basic idea here is that we look at each row and populate the Soundex column accordingly:

  • If the character is a vowel or excluded consonant, then set it to null.
  • If the character is usually permitted but the subsequent character is a repetition, then set this to null also.

I’d considered filtering excluded characters first but this way, if i want to tweak my algorithm to handle other characters differently, then I have the capacity to do so here without too much refactoring:

#"Processed Characters" = Table.FromRecords(
Table.TransformRows(
#"Soundex Code Column",
each
Record.TransformFields(
_,
{"Soundex", (f)=>
let
/* Set flag to true if consecutive characters found */
omitChar = (
if [Index] <> 0
and #"Soundex Code Column"
{[Index] - 1}[Character] = [Character]
then true
else false
),
/* Check for character in list and assign a numeric
value, if not eligible for omission */
soundexChar = (
if omitChar = false
then
if List.Contains(
{"B","P","F","V"},
[Character]
)
then "1"
else if List.Contains(
{"C","G","J","K","Q","S","X","Z"},
[Character]
)
then "2"
else if List.Contains(
{"D","T"},
[Character]
)
then "3"
else if List.Contains(
{"L"},
[Character]
)
then "4"
else if List.Contains(
{"M","N"},
[Character]
)
then "5"
else if List.Contains(
{"R"},
[Character]
)
then "6"
else null
else null
)
in
soundexChar
}
)
)
)

The code applies a table transformation, then a nested record transformation, singling out the Soundex column for update. Doing it this way allows us to access the other records in the row fairly easily.

This means that we can use the current Index columm value and subtract 1 from it to access the previous row in the table and compare it.

The sizeable if..else branching has been done due to the lack of a switch (or similar) structure in M, but our algorithm is simple enough that it isn’t too much of a problem here.

At this point, our query now looks like this:

Working table after mapping (noting ignored vowels)

As tihs doesn’t clarify the consecutive letter situation very well, let’s take a short detour and look at what happens if we submit Paynnton:

…we get points for the first ’N’ but not the next, which is what we want :)

Removal of Unmapped Characters

As our excluded characters aren’t needed in the generated soundex code, we just remove any null entries from the result set:

#"Removed Placeholders" = 
Table.SelectRows(
#"Processed Characters",
each [Soundex] <> null
)

Our table now looks like this:

Characters that managed to stay on the island, and their mappings

Final Preparation of the Digits

Now that we’re done mapping, we just need to do the following:

  • Flatten down the digits to a continuous string of text.
  • If the flattened string is more than 3 digits, cap it at 3. For example, 52314 would resolve to 523.
  • If the flattened string is less than 3 digits, pad the end with zeros. For example, 1 would become 100, or null would become 000.
#"Soundex Digit Combiner" = Combiner.CombineTextByDelimiter(""),#"Soundex Digits" = 
Text.PadEnd(
Text.Start(
#"Soundex Digit Combiner"(
Table.ToList(
Table.SelectColumns(#"Removed Placeholders", {"Soundex"})
)
),
3
),
3,
"0"
)

We firstly create a Combiner with an empty delimiter. This is used to join our list back together.

The inner Table.ToList is fed into the invocation of this Combiner, which is wrapped with a Text.Start to get the first three characters of the flattened string. The outer Text.PadEnd adds the extra zeroes if the inner string is less than the required minimum length.

And Finally… the Generated Code

For this we just stick together the first letter with the three digits we’ve derived:

#"Soundex Code" = #"First Character" & #"Soundex Digits"

And we’re done!

--

--

Daniel Marsh-Patrick
Daniel Marsh-Patrick

Full-stack developer and BI afficianado, based in Auckland, NZ| I seem to enjoy writing about Power BI a lot | @the_d_mp