Regex bytecode interpreter: looking for needles in session haystacks

Vladimir Kazanov
Bumble Tech

--

Here at Badoo, 17 billion events, millions of user sessions and a huge number of virtual dates take place every day. Each event is carefully anonymised and stored in relational databases for subsequent analysis using SQL and SQL-like languages.

Contemporary distributed databases containing dozens of terabytes of data are a true miracle of engineering genius. However, SQL, as a reflection of relational algebra, in most of its standard versions doesn’t yet allow you to formulate queries in terms of ordered tuples.

In this final article in a series of articles on virtual machines (Part 1: Home-grown bytecode interpreters; Part 2: Optimising bytecode interpreters), I am going to explore an alternative approach to searching for sessions of interest, namely a regular expression engine (‘Piglet matcher’) defined by sequences of events.

The virtual machine, bytecode and compiler all come free-of-charge!

Concerning events and sessions

Let’s say we have a data depository allowing us to quickly retrieve events that constitute user sessions.

We want to find sessions based on queries along the following lines:

  • “count the total number of sessions where there is a specified subsequence of events”;
  • “find the parts of the session described by the specified template”;
  • “return the part of the session which occurred after the specified template”;
  • or “count how many sessions reached certain parts of the template.”

This is useful for all sorts of types of analysis: searching for suspicious sessions, funnel analysis etc.

The desired subsequences need to be described in some way. In its most simple form this task is similar to searching for a substring in the text; but a more powerful tool, namely regular expressions, is preferable. Contemporary implementations of regular expression engines mostly use (yes, you’ve guessed it!) virtual machines.

Creating small-scale virtual machines that match user sessions against regular expressions is what we’re going to do below. But let’s start by clarifying some definitions.

An event consists of an event type, time, context and a set of attributes specific to each of the relevant event types.

The type and context of each event are integers from predetermined lists. If everything is clear as far as event type goes, then the context will, for example, be the number of the screen on which the specified event occurred.

The attribute of an event is also an integer, the meaning of which is determined by the event type. A given event might not have any attributes at all, or it may have several.

A session is a sequence of events, sorted according to time.

Right, let’s get on with it! It’s all quiet on the Western front so let’s make a move.

Basic event sequence comparison

What distinguishes this virtual machine is that it’s passive with respect to incoming events. We don’t want to retain the whole session in the memory and let the virtual machine transition on its own from one event to another. Instead, we’re going to send events from the session to the virtual machine one-by-one.

Let’s write interface functions first:

matcher *matcher_create(uint8_t *bytecode);match_result matcher_accept(matcher *m, uint32_t event);void matcher_destroy(matcher *matcher);

If everything is clear in terms of the matcher_create and matcher_destroy functions, then it is worth commenting on matcher_accept. The matcher_accept function receives an instance of a virtual machine and the next event (32 bit: 16 bit on the event type and 16 bit on the context), and returns a code which sets out what the user code needs to do next:

typedef enum match_result {
// request the next event
MATCH_NEXT,
// regex matched, done with the session
MATCH_OK,
// regex did not match, done with the sessionне подавать
MATCH_FAIL,
// internal error
MATCH_ERROR,
} match_result;

Opcodes for the virtual machine:

typedef enum matcher_opcode {
//zero opcode, stops execution with an error
OP_ABORT,
// check current event’s type (arg — the type to check)
OP_NAME,
// check current event context (arg — the context to check)
OP_SCREEN,
// request a next event
OP_NEXT,
// successfully matched the session
OP_MATCH,
} matcher_opcode;

Main loop of the virtual machine:

match_result matcher_accept(matcher *m, uint32_t next_event)
{
#define NEXT_OP() \
(*m->ip++)
#define NEXT_ARG() \
((void)(m->ip += 2), (m->ip[-2] << 8) + m->ip[-1])
for (;;) {
uint8_t instruction = NEXT_OP();
switch (instruction) {
case OP_ABORT:{
return MATCH_ERROR;
}
case OP_NAME:{
uint16_t name = NEXT_ARG();
if (event_name(next_event) != name)
return MATCH_FAIL;
break;
}
case OP_SCREEN:{
uint16_t screen = NEXT_ARG();
if (event_screen(next_event) != screen)
return MATCH_FAIL;
break;
}
case OP_NEXT:{
return MATCH_NEXT;
}
case OP_MATCH:{
return MATCH_OK;
}
default:{
return MATCH_ERROR;
}
}
}
#undef NEXT_OP
#undef NEXT_ARG
}

In this simple version our virtual machine coherently compares the template described by the bytecode with incoming events. In this form it’s simply a rather long-winded comparison of prefixes relating to two strings: the desired template and the incoming string.

Prefixes are one thing, but what we need to be able to identify are the desired templates not only at the start but at any point in the session. A naïve solution would be to re-run comparisons for each session event. However, this would mean that each event would be viewed multiple times and a huge amount of algorithmic computation would also be required.

An example from the first article in this series uses backtracking to mimic re-running comparisons. The code in the example, of course, looks neater than the one given here, but that doesn’t mean the problem has gone away: each of the events has to be checked multiple times.

And that’s not something you can live with.

Me, myself and me yet again

Let’s revisit what we’re trying to do: we need to compare the template with incoming events, starting a new comparison in the case of each of the events. So why not do just that? Let the virtual machine go through the incoming events following several threads!

After all, this is what we need in order to create a new entity: a thread. Each thread stores a single pointer for the instruction in-hand:

typedef struct matcher_thread {
uint8_t *ip;
} matcher_thread;

Naturally, we’re not going to store an explicit pointer within the virtual machine itself. It’s going to be replaced by two lists of threads (more on this subject a little later):

typedef struct matcher {
uint8_t *bytecode;
/* Threads to be processed using the current event */
matcher_thread current_threads[MAX_THREAD_NUM];
uint8_t current_thread_num;
/* Threads to be processed using the event to follow */
matcher_thread next_threads[MAX_THREAD_NUM];
uint8_t next_thread_num;
} matcher;

Here is the updated main loop:

match_result matcher_accept(matcher *m, uint32_t next_event)
{
#define NEXT_OP(thread) \
(*(thread).ip++)
#define NEXT_ARG(thread) \
((void)((thread).ip += 2), ((thread).ip[-2] << 8) + (thread).ip[-1])
/* Every new event launches a new thread from the first bytecode instruction */
add_current_thread(m, initial_thread(m));
// On every event we check every thread
for (size_t thread_i = 0; thread_i < m->current_thread_num; thread_i++ ) {
matcher_thread current_thread = m->current_threads[thread_i];
bool thread_done = false;
while (!thread_done) {
uint8_t instruction = NEXT_OP(current_thread);
switch (instruction) {
case OP_ABORT:{
return MATCH_ERROR;
}
case OP_NAME:{
uint16_t name = NEXT_ARG(current_thread);
// if the current event does not correspond to a template described by the thread, the thread will just not make into the next_threads list
if (event_name(next_event) != name)
thread_done = true;
break;
}
case OP_SCREEN:{
uint16_t screen = NEXT_ARG(current_thread);
if (event_screen(next_event) != screen)
thread_done = true;
break;
}
case OP_NEXT:{
// a thread requested the next event, i.e. it must make it to the next_threads list
// next_threads
add_next_thread(m, current_thread);
thread_done = true;
break;
}
case OP_MATCH:{
return MATCH_OK;
}
default:{
return MATCH_ERROR;
}
}
}
}
/* Swap the current and next thread lists */
swap_current_and_next(m);
return MATCH_NEXT;
#undef NEXT_OP
#undef NEXT_ARG
}

For each event obtained, we first add a new thread to the current_threads list which checks the template right from the start. Then we start to work our way through the current_threads list, carrying out the instructions following the pointer for each of the threads.

If we come across the instruction NEXT, then the thread is moved to the next_threads list, that is to say it waits to receive the next event.

Should the thread template not match the event obtained, then the thread in question is simply not added to the next_threads list.

The MATCH instruction immediately exits the function, sending notification by means of a return code that the template matches the session.

Having worked our way through the list of threads, the current threads and next threads lists change places.

That is basically it. You could say that we are literally doing what we wanted to do: we are simultaneously checking several templates, running one new comparison process for each of the session events.

Multiple personalities and branching in templates

It is of course helpful to search for a template which describes the linear sequence of events, but we are looking to obtain fully-fledged regular expressions. And threads which we created at the previous stage will be of use to us here as well.

Let’s say that we want to obtain a sequence from two or three events of interest, something along the lines of a regular expression based on the strings: “a?bc”. In this sequence the symbol ‘a’ is optional.

How can this be expressed in bytecode? Easy! We can run two threads: one for each case, i.e. one with the symbol ‘a’ and one without. To this end we introduce an additional instruction (in the form SPLIT addr1, addr2), which runs two threads from the addresses given. Besides SPLIT, we will also need JUMP, which simply continues to run from the instruction specified in the direct argument:

typedef enum matcher_opcode {
OP_ABORT,
OP_NAME,
OP_SCREEN,
OP_NEXT,
// jump to an address specified
OP_JUMP,
// launch 2 new threads from both given addresses
OP_SPLIT,
OP_MATCH,
// not an op
OP_NUMBER_OF_OPS,
} matcher_opcode;

The loop itself and the other instructions don’t change, we simply introduce two new handlers:

// ...
case OP_JUMP:{
uint16_t offset = NEXT_ARG(current_thread);
add_current_thread(m, create_thread(m, offset));
break;
}
case OP_SPLIT:{
uint16_t left_offset = NEXT_ARG(current_thread);
uint16_t right_offset = NEXT_ARG(current_thread);
add_current_thread(m, create_thread(m, left_offset));
add_current_thread(m, create_thread(m, right_offset));
break;
}
// ...

Notice that the instructions add threads to the current list, that is to say they continue their work in the context of the current event. The thread, depending on which branching occurred, doesn’t get added to the list of subsequent threads.

The most surprising thing with this virtual machine for regular expressions is the fact that our threads and this couple of instructions are sufficient for expressing almost all generally accepted constructions when comparing strings.

Regular expressions defined on events

Now that we have a suitable virtual machine and instruments for it, we can address the actual syntax for our regular expressions.

Manually writing opcodes for more serious programs quickly becomes exhausting. Last time I didn’t do a fully-fledged parser, but the user true-grue based on the example of the mini-language PigletC demonstrated the capabilities of their raddsl library. I was so impressed by how succinct the code was that, with the help of raddsl, I wrote a small, 100–200 lines of code, regular expressions compiler in Python. The compiler and instructions on its use are available on GitHub. The result of the work of the compiler in the assembler language is understood by a utility which reads two files (a program for the virtual machine and a list of session events for testing purposes).

To start with we will limit ourselves to the type and context of the event. We designate the event type with a single digit: if we need to specify the context, we do so using a colon. Here is a very simple example:

> python regexp/regexp.py “13” # event type 13
NEXT
NAME 13
MATCH

And now for an example with context:

python regexp/regexp.py “13:12” # event type 13, context 12
NEXT
NAME 13
SCREEN 12
MATCH

The subsequent events need to be separated out in some way (for example with spaces):

> python regexp/regexp.py “13 11 10:9” 08:40:52
NEXT
NAME 13
NEXT
NAME 11
NEXT
NAME 10
SCREEN 9
MATCH

A more interesting template:

> python regexp/regexp.py “12|13”
SPLIT L0 L1
L0:
NEXT
NAME 12
JUMP L2
L1:
NEXT
NAME 13
L2:
MATCH

Notice the strings that end with a colon. These are labels. The SPLIT instruction creates two threads which continue running from labels L0 and L1, while the JUMP instruction at the end of the first branch that is being executed simply moves to the end of the branching.

You can alternate between longer chains of expressions, grouping subsequences together in brackets:

> python regexp/regexp.py “(1 2 3)|4”
SPLIT L0 L1
L0:
NEXT
NAME 1
NEXT
NAME 2
NEXT
NAME 3
JUMP L2
L1:
NEXT
NAME 4
L2:
MATCH

An unspecified event is marked with a dot:

> python regexp/regexp.py “. 1”
NEXT
NEXT
NAME 1
MATCH

If we want to indicate that a subsequence is optional, we follow it with a question mark:

> python regexp/regexp.py “1 2 3? 4”
NEXT
NAME 1
NEXT
NAME 2
SPLIT L0 L1
L0:
NEXT
NAME 3
L1:
NEXT
NAME 4
MATCH

It goes without saying that ordinary, multiple repetitions in regular expressions (plus or asterisk) are also supported:

> python regexp/regexp.py “1+ 2”
L0:
NEXT
NAME 1
SPLIT L0 L1
L1:
NEXT
NAME 2
MATCH

In this case we simply run the SPLIT instruction multiple times, launching new threads for each iteration.

Similarly in the case of the asterisk:

> python regexp/regexp.py “1* 2”
L0:
SPLIT L1 L2
L1:
NEXT
NAME 1
JUMP L0
L2:
NEXT
NAME 2
MATCH

Future prospects

Other extensions of the virtual machine described may also come in useful.

For example, it can be easily extended with an event attribute test. In the case of a real system I envisage using syntax along the lines of “1:2{3:4, 5:>3}”, which means the following: event 1 in context 2 with attribute 3 of value 4, and attribute value 5 greater than 3. In this case attributes can simply be sent to the matcher_accept function using an array.

If we also send the time interval between events to matcher_accept, then we can add syntax to the language which allows time to pass between events: “1 mindelta(120) 2” — which will mean the following: event 1, then a minimal interval of 120 seconds, then event 2. In combination with storing subsequences this allows us to gather information regarding user behaviour between the two subsequences of events.

Other useful things which are relatively easy to add: storing subsequences of regular expressions, differentiating between ‘greedy’ and ordinary operators, asterisk and plus etc. Theoretically speaking our virtual machine represents a nondeterministic finite automaton, i.e. that makes all these things easy to implement.

Conclusion

Our system is being developed for fast user interfaces, which is why the engine for storing sessions has been self-written and optimised particularly with a view to combing through all the sessions. All the billions of events, broken down into sessions, are tested for compliance with templates within seconds on a single server.

If, in your case, speed is less critical, then you can set up a similar system in the form of an extension to some more standard system for storing data along the lines of a traditional relational database or distributed file system.

On that point, the latest versions of the SQL standard already have a similar capability to the one described in the article and certain databases (Oracle and Vertica) have already implemented it. In turn Yandex ClickHouse implements its own SQL-like language, and it also has similar features.

Digressing from the subject of events and regular expressions, I would like to reiterate that virtual machines have an application much wider than might seem the case at first glance. This technique is suitable and is widely applicable in all cases where there is a need to clearly separate primitives, which the system engine understands, and a user-facing subsystem, for example, some kind of DSL or programming language.

This brings to close this series of articles on the subject of various applications or bytecode interpreters and virtual machines. I hope that you have enjoyed reading the articles and, it goes without saying, I will be happy to answer any questions you may have on the topic.

Informal bibliography

Bytecode interpreters for programming languages represent is a specialised interest and there is not much literature available on this subject. Personally, I enjoyed Iain Craig entitled “Virtual Machines” even though it doesn’t so much describe the implementation of interpreters, as describe abstract machines, mathematical models underlying various programming languages.

For a broader perspective there is another book on the subject of virtual machines entitled “Virtual Machines: Versatile Platforms for Systems and Processes” by James Smith. This is an introduction to various areas of application of virtualisation, including virtualisation of languages, processes and overall computer architecture.

The practical aspects of developing engines for regular expressions are rarely discussed in popular literature on compilers. ‘Piglet Matcher’ and the example from the first article are based on ideas from a wonderful series of articles by Russ Cox, one of the developers of the Google RE2 engine.

Regular expressions theory is expounded in all academic coursebooks on the subject of compilers. It is customary to refer to the famous ‘Dragon Book’, but I would recommend starting with the links given above.

When working on this article, for the first time I used an interesting system called raddsl for developing compilers quickly in Python. It was written by the user true-grue (thanks, Piotr!). It’s worth having a look at it if you’re looking to prototype language or to develop some DSL quickly.

--

--