Perl 6 small stuff #13: Did you mean X?

Given that I a while back wrote an article suggesting that Perl 6 should change its name to Century, you’d probably expect me to have an opinion about the Raku debacle. Well, I don’t. For me, I think the time has come to share something else than interesting peculiarities of Perl 6, namely real code I use in my work.

Too soon?

I work as a kind of Jack-of-all-trades at a Norwegian media company, mostly wrangling numbers. Now and then I need to program a few ad hoc scripts to help extract interesting data from various databases, web pages, etc. Some of these end up in production, deployed in the cloud. But a surprising amount of what I do is to write ad hoc code solving a one-time specific problem. Code that is neither particularly optimised nor especially robust, given that it will be used once and never leave my MacBook.

What follows is one example of the latter.

One day I noticed that quite a few of the searches our clients made in one of our databases returned zero results. It was easy to see that many of them were caused by incorrect spelling [1]. Wanting to investigate this a little further, I pulled six months worth of queries from our log database. Since this was not meant as anything but an experiment, I reduced the amount of that by removing multi-word queries, and keeping only one-word queries that contained alphabetic characters. I also normalized the words, i.e. I dropped accented versions of characters. That reduced the dataset to a manageable size. I stored the resulting dataset in a simple CSV file.

File: st2.csv (containing 726,609 queries; a subset shown here)
Gaza,94279,editorial
Kaulitz,1667,editorial
senkvald,0,editorial
senkveld,1608,editorial
plenklipper,0,editorial

The first column is the word that was searched for, the second column contained the number of results (the third is irrelevant for the discussion here).

Looking at this data I started wondering whether I could make qualified guesses as to what words the misspellers actually thought they wrote. In short: Could I create data that could form the basis of a Google-like “Did you mean X” feature?

I knew that I could use the Levenshtein-Damerau algorithm (among others) to compare one word to another and figure out how similar they were. But it seemed like overkill to compare each word to all the words in, say, a dictionary. A program doing that would take forever to run.

But: If you look at the example CSV above you see that the word “senkvald” has zero results while “senkveld” has 1608 results. That made me realize that for each mis-spelled query there’d likely be a correctly spelled query that returned results. I could limit the comparisons to queries with results.

What I needed was a program that created a list of mis-spelled zero-hit words and a list of words with hits; just that optimisation alone would speed things up. In addition I thought that the speed would increase even further if I utilized Perl 6 parallelisation stuff.

Now, bring out your torches and pitchforks, because I’m sure you’ll see lots of stuff to improve and optimize here. But in this case it’s a little beside the point, because some of the stuff I’ve done here is done because it’s simple to do and easy to read, and not because it’s the fastest way to do these things [2].

#!/usr/bin/env perl6
use v6.c;
use Text::Levenshtein::Damerau;
my %term;
my %alphabet-hits;
my %alphabet-zero-hits;
my @results;
my @promises;
my @zero-hit;
my @non-zero-hit;
my $PARALLELS = 4;
my $MAX-DISTANCE = 0.3;
my $MIN-COMP-LENGTH = 4;
my $MAX-COMPARISONS = 1_000_000;
constant @ALPHABET = (("a".."z"), (<æ ø å>)).flat;
$*ERR.out-buffer = 0;
$*OUT.out-buffer = 0;
read-words;
similarity;
sort-and-spurt;
sub read-words {
my $file = open "st2.csv";
for $file.lines -> $line {
my @elms = $line.split(",");
%term{lc(@elms[0])} += @elms[1];
}
$file.close;
note "Read search queries.";
for (%term.keys) -> $t {
@zero-hit.push($t) if %term{$t} == 0;
@non-zero-hit.push($t) if %term{$t} > 0;
}
@zero-hit = @zero-hit.grep(
{ $_.chars >= $MIN-COMP-LENGTH } ).sort;
@non-zero-hit = @non-zero-hit.grep(
{ $_.chars >= $MIN-COMP-LENGTH } ).sort;
note "Processed " ~ %term.keys.elems ~ " terms.";
for (@ALPHABET) -> $char {
%alphabet-hits{$char} =
(@non-zero-hit.grep( { $_.starts-with($char) } ));
%alphabet-zero-hits{$char} =
(@zero-hit.grep( { $_.starts-with($char) } ));
$max-no-comparisons +=
%alphabet-hits{$char}.elems *
%alphabet-zero-hits{$char}.elems;
}
note "Split dictionaries into alphabetized chunks.";
}
sub sort-and-spurt {
for (@results.sort) -> $o {
my ($a, $b) = $o.split(',');
my $near-equal =
($a.split('').unique.sort cmp $b.split('').unique.sort
== Same ?? '*' !! '');
say "$o,$near-equal";
}
note "Sorted and output finished.";
}
sub similarity {
my $processed-comparisons = 0;
my $zero-chunk-size = (@ALPHABET.elems / $PARALLELS).Int;
my @alphabet-copy = @ALPHABET;
for 1..$PARALLELS -> $t {
$zero-chunk-size++ if
($t == $PARALLELS && $zero-chunk-size < @alphabet-copy.elems);
my @alpha-chunk = @alphabet-copy.splice(0, $zero-chunk-size);
push @promises, start compare(@alpha-chunk, $t);
}
note "Finished waiting: " ~ await Promise.allof(@promises);
  sub compare(@alphabet-subset, $parallel-no) {
note "Parallel $parallel-no: " ~ @alphabet-subset.elems;
for (@alphabet-subset) -> $start-char {
for ((%alphabet-zero-hits{$start-char}).flat) -> $z {
for ((%alphabet-hits{$start-char}).flat) -> $nz {
my $distance = dld($z, $nz);
my $fraction = $z.chars > $nz.chars ??
$distance / $z.chars !! $distance / $nz.chars;
if $fraction < $MAX-DISTANCE {
@results.push: $z ~ "," ~ $nz ~ "," ~ $fraction;
}
note "Compared $processed-comparisons of " ~
"$max-no-comparisons."
if $processed-comparisons %% 10_000;
return False if ++$processed-comparisons > $MAX-COMPARISONS;
}
}
}
}
}

This script did the job! I got a list of around 17.000 suggestions that could be useful for implementing a “Did you mean X” type of routine. But what was unexpectedly more useful, and more immediately useful at that, is that we could use it internally to improve how we tagged our content. Instead of improving things automatically, it first and foremost improves manual work.

Anyway, back to the interesting stuff (at least to me) this uncovered about Perl 6:

  1. I used the keyword note a lot; note is like say, only that it outputs to STDERR. A little like warn in Perl 5. I could direct the output of the program to a file, while still being able to see the debug/error info.
  2. Output is often buffered. Most of the time I want it not to be. I want to see the results immediately, so I used this: $*OUT.out-buffer = 0; — This is equivalent to Perl 5’s $|++, except that the Perl 6 version is unexpectedly verbose but at the same time self-explanatory.
  3. constant @ALPHABET = ((“a”..”z”), (<æ ø å>)).flat; —as I had normalized all of the keywords and only kept keywords with Norwegian alphabetic characters, I had to generate a list of norwegian characters. I couldn’t use a range, i.e. “a”..”å”, as that would contain all kinds of characters. So I combined two lists, the range “a”..”z” and the list <æ ø å>, into one by using the .flat method.
  4. my @alpha-chunk = @alphabet-copy.splice(0, $zero-chunk-size); — I must have used this in other variants before, but splice felt like a new discovery. Splice is kind of like a super pop. It removes and returns not only a single element, but an entire chunk of an array. In other words, the original array is shortened accordingly, while you can assign the removed elements to a new array. This was useful for splitting the array into pieces I could use when parallelising the comparison task.
  5. And parallelisation is dead simple! All I needed was the start keyword. I did this: push @promises, start compare(@alpha-chunk, $t); — the subroutine compare is parallelised by the start keyword. You don’t need the push, but you should: The @promises array contains all the, well, promises. The note “Finished waiting: “ ~ await Promise.allof(@promises); ensures that all the parallels had finished running before moving on. That was all there was to it! The $PARALLELS variable manages how many parallels you want to run. I chose 4 as that’s the number of cores in my MacBook’s processor.
  6. A sub within a sub: Since the compare sub was a subroutine that was only to be used by the similarity sub, I used Perl 6’s ability to have a sub within a sub. So the compare sub is only visible from within compare. I probably did this only to impress myself.
  7. The similarity calculation my $distance = dld($z, $nz); my $fraction = $z.chars > $nz.chars ?? $distance / $z.chars !! $distance / $nz.chars; is a combination of Levenshtein-Damerau and my own stuff. The L-D algorithm — simplified — compares two string and tries to calculate how many substitutions you need in word 2 for it to be similar to word 1 (think of how many keyboard presses you’d need). You’ll get an integer with the number of substitutions as result. But two substitutions in a short word is more significant than in a long word. So I try to create a factor based on the number of substitutions divided by the length of the longest of the two words. This factor will be a floating point number between 0 and 1. The lower the number, the more similar the words are.
  8. In the line my $near-equal = 
     ($a.split(‘’).unique.sort cmp $b.split(‘’).unique.sort 
     == Same ?? ‘*’ !! ‘’);
    I reduce each word to a sorted array of the unique characters in each words. I.e. ABBA is reduced to the array <A B>, but so is BABA. My assumption was that similar words containing the identical characters are more similar than the other matches, even though the difference may be larger. I.e. ABBA and BABA is more similar than ABBA and ABDA — <A B> and <A B D>. But I discovered that if you compare two arrays for similarity, i.e. whether their elements contain the same values in the same order, the comparison return Same. In Perl 5 you get 0 as a result. In Perl 6 you can choose to check for 0 as well as Same and both will work. But I have to say Same is more self explanatory.

So it’s safe to say that I learned a lot by programming this. It was equally nice that the results were useful!


NOTES
[1]
Of course quite a few of the zero-hit searches was caused by the fact that we didn’t have matching content, but those were beyond my scope.

[2] But I probably should have been more concerned with speed. Even though I’ve said earlier that speed depends on your point of view, slow is slow. Even with the parallelisation, this script used almost five hours before it was finished. It may be faster in the latest version of Rakudo; for this I used Rakudo Start 2018.06.


POSTSCRIPT
I had to do it, didn’t I? Since I’m flirting with two ladies at the moment, both Camelia and Julia, I couldn’t resist implementing the same thing in Julia version 1.0.0.

One thing I didn’t expect: Although Julia supports parallelisation, arguably just as easy if not easier to implement than with Perl 6, Julia parallelisation was unusable for this. If I understood the documentation correctly, Julia is good at parallelising code that deals with IO; but if you try to parallelise non-IO code, it’s queued up and run in a single thread.

A win for Perl 6 there, in other words.

However, the Julia code was 2/3 the size of the Perl 6 version. And execution speed? Well, this being a Perl 6 blog I’m almost shameful to admit that the non-parallelised Julia version finished the comparisons in 12 minutes, approximately 20 times faster than Perl 6.

When it comes to the size of the code, much of the difference is probably caused by the fact that having already written this once, I knew where I could make things easier. But the thinking behind the Julia version is the same, and how I filter and flip and flop the data, is the same. That’s why I don’t have a good explanation of the speed difference (just to double check I also made a Perl 5 version as well, single threaded like the Julia version; that version more or less used the same amount of time as my original Perl 6 version). One of these days I have to run the profiler on the Perl 6 version. I guess that much of the time is spent calculating Levenshtein-Damerau distance, but I believe the for loops in themselves are taking time as well.

When we’re talking about Levenshtein, it’s worth noting that Julia’s Levensthein implementation automatically calculates a distance factor. I made my own in the Perl 6 version. The sharp-eyed reader will see that in my code the lower the factor is, the larger the similarity. The Julia version is exactly the opposite.

All the optimisation stuff is a story for another day, if ever. Meanwhile, here’s the Julia version. Thanks for bothering to read all the way to the end!

#!/usr/bin/env julia
using Base.Unicode
using StringDistances
alphabet = vcat('a':'z', ['æ','ø','å'])
dictionary = Dict{String,Int}()
zero_words = []
hit_words = []
similarity_count = 0
total_similarities = 0
function read_words()
open("st2.csv") do file
for ln in eachline(file)
c, d = split(ln, ",")
e = parse(Int, d)
try
dictionary[lowercase(c)] += e
catch
dictionary[lowercase(c)] = e
end
end
end
for (k, v) in enumerate(dictionary)
(word, value) = v
if value == 0
push!(zero_words, word)
else
push!(hit_words, word)
end
end
sort!(zero_words)
sort!(hit_words)
for c in alphabet
hw = size(filter(p->startswith(p, c), hit_words))[1]
zw = size(filter(p->startswith(p, c), zero_words))[1]
global total_similarities += (hw * zw)
end
end
function similarity()
for k1 in zero_words
for k2 in filter(p->startswith(p, string(first(k1))), hit_words)
x = compare(DamerauLevenshtein(), k1, k2)
if x > 0.7
println("$k1\t$k2\t$x")
end
global similarity_count += 1
if similarity_count % 10000 == 0
perc = (similarity_count / total_similarities) * 100
println(stderr, "Progress... $perc % [ $similarity_count of $total_similarities]")
flush(stdout)
end
end
end
end
read_words()
similarity()