Text Search algorithms and Elixir — A case study

Juli.Smz
Elixir Labs
Published in
5 min readAug 24, 2022

A bit of context

This article arose from the need to find a sublist, within a list. The first attempt to solve it, albeit with very poor performance, was to use the Enum.chunk_every/2 and Enum.member?/2 functions. To my surprise, there are no native functions to solve this in the most popular programming languages. Therefore, for this purpose, I explored different search algorithms with the aim of proposing a new native function that would solve it. However, the results that I found along the way ended up redirecting the focus to string-searching algorithms.

Part 1 — The algorithms

Naive

There are different search algorithms. The simplest but not always the most efficient is called Naïve, also known as “brute force”.
This algorithm works by shifting the pattern just one place to the right. No matter if there is a match or not, it will always move one place to the right. Besides, match verification is carried out by using the first character of the pattern, rather than the last one, as another algorithm called Boyer-Moore does. We´ll get into Boyer-Moore just below.

Naive

Boyer-Moore

In order to improve the performance of Naïve, another algorithm called Boyer-Moore was created. It calculates how many positions the pattern should move to the right, starting from the last character instead of the first one.

The Boyer-Moore algorithm has 2 shift-rules that make it even better: “The bad-character rule” and “the good-prefix rule”. This means we have 3 variants, but we are only going to focus on the Boyer–Moore–Horspool, which applies “the bad-character rule”.

The key to this algorithm is the creation of the “shift table”, which will indicate the number of spaces that we must move, depending on which letter there was NO match. That is why it is called “Bad Character”.

Let’s take, for example, the following text and pattern:

T = ANPANMANAM
P = NNAAMAN

Before looking at the formula, let’s clarify that pattern_length is the total length of pattern (7) and pattern_index is the number assigned to each letter, starting with zero.

Indexes start on cero

To calculate the “shift” of each letter, except for the last character, we use the following formula:
pattern_length - pattern_index - 1

Boyer–Moore–Horspool 3 golden rules

When creating the algorithm we must consider the following:

  • When a letter is repeated, as in the case of N and A, the value is replaced by the new one.
  • When a character is the last one, 2 things might happen:
    A) If it was already defined, it maintains that value and it is not replaced again.
    B) If it was not previously defined, its value is pattern_length, in this case 7.
  • Any letter that is not within the pattern, such as L, has 7 as value. (Maximum Shift Value).

Now, let’s build the “Shift Table” starting with the maximun shift value:

This is the code example to build the table in Elixir:

Part 2 — Algorithms against “lists”

Now that we have the table ready, let’s search! We just have to take into account when a mismatch occurs, and at which letter. We look for that K letter in our table (%{K: V}), if it exists, we shift V spaces to the right.

For further information about the algorithm, check out this video: https://www.youtube.com/watch?v=PHXAOKQk2dw

One of the problems with this algorithm compared to Naïve, is the time it takes to create the shift table. But first, let’s take a look at Naïve’s code.

Because the Boyer-Moore algorithm works best with long text and patterns, we are going to apply 3 benchmark tests, using:

1)A string of 1658 characters, and a pattern of 155 characters.
2)A string of 1658 characters, and a pattern of 7 characters.
3)A string of 32 characters, and a pattern of 7 characters.

Results

Text of 1658 characters, and a pattern of 155 characters

Text of 1658 characters, and a pattern of 7 characters

Text of 32 characters, and a pattern of 7 characters

First conclusion

As we can see, Naive always comes out on top. If you look closely, you will notice that the script uses lists as its data type, so we have an O(n) efficiency problem. To go deeper into these Big O problems and why lists can be a problem, you can read this article.

Part 3 — Algorithms against “binaries”

As we know, community learning has rich potential. In this case, Steve Cohen @icecreamcohen, admin of the official Elixir Discord channel, contributed with an improvement using “binaries” (which Elixir uses as strings) instead of “lists”, and the results were surprising:

Results:

As you can see, the data type used makes a difference, and choosing this algorithm on a data type other than binary (lists, arrays, etc) is a serious mistake.

Excited with the results, we decided to compare the native function to find text, “String.contains?/2”, against the latest implementation of “Boyer–Moore–Horspool”.

The native function dominates our boyer-moore algorithm, but… why?

As we can see in the following native function code, the :binary module’s match function is used to perform the search:

Code of the native contains?/2 function

The secret is that :binary.match is a BIF. And if you are wondering what a BIF is, we are going to resort to the official Erlang definition:

BIFs are functions that are built into Erlang. They usually do tasks that are impossible to program in Erlang. For example, it’s impossible to turn a list into a tuple or to find the current time and date. To perform such an operation, we call a BIF.

In short, we are using Elixir to call native code via Erlang.

To conclude, despite the trials to identify the application of the best possible algorithm, there is no greater optimization than the use of a BIF in the “virtual machine”. As Steve Cohen says regarding the attempts to do otherwise:

a BIF in the VM will (almost always) crush your attempts.

This is so, unless you use Rust, but that’s a whole new chapter.

--

--