Web development and computer science can often seem worlds apart. But every so often, a practical problem connects with an interesting theoretical issue. This post is the story of a bug fix, and how it led us on a detour through the theory of ordering.
The set up for the story is pretty simple. We built a Kanban board to allow Multiverse employees to manage the progress of apprenticeship applications. Each board had a number of columns representing stages in the application process. Cards within each column could be reordered by dragging.
Shortly after the feature was deployed, users reported a bug: the order of cards within a column would sometimes change at random.
At Multiverse we use a Phoenix+Elm stack. The Kanban board is implemented as a single page application in Elm. The bug was caused by our decision to sort cards using Elm’s native 32-bit signed integer type. The first card to be inserted in a column was assigned the index (2³¹-1)/2 (where / indicates integer division). A card inserted before the first card was assigned the index (2³¹-1)/2/2. A card inserted after the first card was assigned the index ((2³¹-1)/2 + (2³¹-1))/2. A card inserted between two other cards with indices i and j was assigned the index (i + j)/2.
With a limited number of indices, repeated reordering of cards quickly led to multiple cards being assigned the same index. Consider the scenario where a card is repeatedly added and then moved into the first position of a column. How many moves before the new first card receives the same index as the old first card? Say we start with one card at index (2³¹-1)/2. The question is then “How many times do we have to divide (2³¹-1)/2 by 2 to reach zero?” In binary (2³¹-1)/2 is a sequence of 30 ones. For every integer division by two, we lose one of the ones. So it takes thirty divisions to get to zero. At this point, every card moved into the first position is assigned the index zero.
We’d foolishly assumed that cards would never be reordered enough times to run into the degenerate case. But it turns out that our users really like reordering cards!
Are we dense?
To fix the bug, we needed a system of indices that ensured that there was always another index available between any two indices. In other words, we needed a dense ordering of indices.
Our integer indices could not be densely ordered for two reasons. First of all, the ‘less than’ relation doesn’t yield a dense order over the integers (as, for example, there is no integer between 1 and 2). More fundamentally, the use of a fixed precision representation restricted us to a finite number of indices, making it impossible to define any dense order.
We could have switched to representing indices as arbitrary precision rationals. The ‘less than’ relation yields a dense order over the rationals, and it is easy to compute a rational that lies between any two other rationals. However, Elm lacks a native rational type, and if possible we wanted to avoid non-trivial conversions between Elm values and values stored in the database.
What about strings? Strings can be of any size, and have a pre-defined lexicographic ordering. This ordering is tantalisingly close to being dense, in the sense that there are an infinite number of pairs of strings that have an infinite number of strings between them. Unfortunately, there are also an infinite number of pairs of adjacent strings. Assuming A to be the first character of the alphabet, examples of these include A and AA, ABABBA and ABABBAA, BB and BBA. These examples illustrate the following generalisation:
- Two strings S and T are lexicographically adjacent iff T = SA or S = TA.
If we could find a way to avoid generating adjacent strings, it seemed that it should be possible to sort cards by the lexicographic order of their indices.
The key observation is the following. The set of strings can be decomposed into subsets such that
- none of the subsets is densely ordered by lexicographic precedence; and
- if one string is chosen from each of the subsets, the resulting set of strings is densely ordered by lexicographic precedence.
Each subset is based on a string S that does not end with A, and contains all strings of the form SA*. With subsets running left to right, and the strings in each subset running from top to bottom, the picture is as follows. It’s important to understand that there is an infinite number of columns (represented by ellipses) between each pair of columns shown adjacent in the diagram.
'' ... AAB ... AB ...
A ... AABA ... ABA ...
AA ... AABAA ... ABAA ...
AAA ... AABAAA ... ABAAA ...
..... ... ....... ... ... ...
A densely ordered index set can be derived by choosing any one string from each of the columns. To keep indices as short as possible, it makes sense to take the shortest string from each column. Setting aside the first column, these strings are easily identifiable as the strings that do not end in A.
We now have our index set. All that remains is to find a means of deriving an intermediate index from two existing indices.
Deriving intermediate indices
The following algorithm takes as its input two strings that do not end with the first character of the alphabet. It derives a third string lying between these two strings that also does not end with the first character of the alphabet.
Set the output to the empty string
Pad the shorter string to the length of the longer string by repeatedly appending the first character of the alphabet.
From left to right, loop through each pair of characters (c, d) from the lesser and greater strings respectively:
- Let e = max(c, (c + d)/2) (integer division).
- Append e to the output.
- If e ≠ c, terminate the loop early.
If the preceding loop did not terminate early, then append to the output the middle character of the alphabet (or any character other than the first).
If we start out with indices that don’t end in the first character of the alphabet, and derive subsequent indices using the algorithm, then we never end up with two indices that are lexicographically adjacent.
How can we be sure that the algorithm never outputs a string that’s adjacent to one or both of its inputs? First consider the case where the loop terminates early. This happens on the first occasion that e ≠ c, in which case, given the expression for e, c < e < d. Previous iterations of the loop have duplicated the first few characters of the lesser input string, and e lies between c and d, so the output string lies between the two input strings. Moreover, as c < e, the last character appended to the output cannot be the first character of the alphabet.
Now take the case where the loop fails to terminate early. In this case we have duplicated the (possibly padded) lesser string in the output. In this case, given that nether input string ends with the first character of the alphabet, appending any character yields a string between the two input strings. We can therefore choose to append a character other than the first. (If input strings could end in A, then inputs such as e.g. A and AA would become identical after padding, and the output string would be greater than both.)
In our code, we use a slightly more complicated padding logic to reduce the length of the output strings in some cases. The lesser string is padded using the first character of the alphabet whereas the greater string is padded using the last character of the alphabet. This has the effect that the output for e.g. YM and Z is YS, rather than the longer YMM.
How many moves?
Our Elm code uses strings over the alphabet A–Z. To date, the longest index in the database is
A quick simulation shows that after the first card is added to a column, it takes 60 moves of a card into the first position to generate this index. (So if Elm had native 64-bit integers, we still wouldn’t have noticed the bug!)
If you need a densely ordered index set, your first thought would probably be to reach for the rationals. However, not all programming languages or databases provide arbitrary precision rational types, and arithmetic on arbitrary precision rationals can often be quite memory intensive and slow. Consider strings instead. With a careful choice of initial indices and the right algorithm for deriving intermediate indices, the default lexicographic order suffices.
You may have noticed that we haven’t shown rigorously that the set of strings not ending in A is densely ordered by lexicographic precedence. We’ve made an attempt to work through this in more detail.
Join the team
Multiverse’s mission is to create a diverse group of future leaders. The products we’re building are designed to connect apprenticeship candidates to employers and to set them up for success throughout their qualification. Our social mission is at the heart of our business, so if doing good is as important as building a commercial business to you, you will find Multiverse incredibly rewarding.
To view our latest open roles, visit our careers page.