Regular Expression Matching

Alexey Chashchegorov
10 min readMar 3, 2024

--

Brief

In this article I would like to make an overview of a hard level task solution published at “Leet code” related to regular expression matching. This task implementation shows the most essential parts of regular expression matching.

Task statement:

Statements of the task are:

input is lowercase string of english alphabet

string size limited to 40 symbols

any english alphabet symbol in regex have to have match at tested string

symbol ‘.’ expects that it can match any symbol

combination of english alphabet symbols with ‘*’ right after expects that any number of base symbols should occur ( even 0 symbols ) will match

combination ‘.*’ expect any number of any symbols can match

Samples:

isMatch(/*regex*/ “a”, /*string*/ “a”) is true

isMatch(/*regex*/ “a”, /*string*/ “b”) is false

isMatch(/*regex*/ “a”, /*string*/ “”) is false

isMatch(/*regex*/ “a*”, /*string*/ “”) is true

isMatch(/*regex*/ “a*”, /*string*/ “a”) is true

isMatch(/*regex*/ “a*”, /*string*/ “b”) is false

isMatch(/*regex*/ “.”, /*string*/ “”) is false

isMatch(/*regex*/ “.”, /*string*/ “a”) is true

isMatch(/*regex*/ “.”, /*string*/ “b”) is true

isMatch(/*regex*/ “.*”, /*string*/ “badasddsazxc”) is true

Next we will look at the approaches to the solution: what are the most significant parts of the task to consider.

Precise task overview:

Number of string to regular expression matching combinations we can do is huge. Attempting to combine them all and test the match at each of the incoming string parts makes the task hard at first glance.

Simplest regular expressions matching

Only way to find the solution is to limit the number of parameters we have. Let’s consider this task with the most primitive subset of regular expressions. Just simplest regular expressions classes will be considered first.

Picture 1. Simplest regular expression matching
  • Only one specific symbol should match for “specific symbol matching”. Implementation of the algorithm is a comparison of the symbols.
  • Only one symbol should match for “any symbol matching”. Implementation of the algorithm is just counting the number of symbols in the string.
  • All symbols should be equal to the first symbol in regular expression to match for “any count specific symbol matching”. Implementation is cycle over string to find that all symbols match first symbol in regular expression.
  • For “any count any symbols” matching all of the strings will be accepted.

Now we have ensured that step one of the task solution can be implemented.

Two parts complex regular expression

Next step is to find a solution for combination of any 2 possible regular expressions together to find the matching of the string. We can consider all of the combinations of simplest patterns at complex expression, but instead let’s combine them smart:

  • let’s consider each of the simplest regular expression as placeholders of the symbols intervals at string
  • this intervals must be continuous
  • this intervals cover all of the string data
Picture 2: application of 2 simplest patterns to the string

This way we can expect that only 4 combinations of string intervals matching of simplest regular expression exists:

  • no matching at both placeholders
  • matching in first placeholder only
  • matching at second placeholder only
  • matching at both placeholders

In the last scenario is string match to regex. In all other scenarios we can expect that dividing of the string to intervals for placeholders can be changed. It means that we can move symbols from one interval to another and vice versa to reach the matching (if it possible).

Picture 3: Moving symbols between intervals to reach the matching

It is possible to make this set of moves of placeholders borders canonical and correct to find the match if it exists.

Initialization of solution

Let’s consider that all of the placeholders have some capacity to match the intervals. For “one symbol regular expression” placeholder will have capacity equal to 1. For ”any count symbols placeholders” capacity is unlimited.
It means that it is fair to divide string into intervals by capacity on the initial stage of the matching algorithm. Let’s put all of the symbols of the tested string to placeholders according to placeholders capacity from left to right.

Picture 4: Initial dividing of the string by intervals for placeholders

As we can see on the picture, the first interval has unlimited capacity. That is why it contains all of the symbols on initialization of the algorithm.

This placement is fare:

  • first matching interval of the string can contains all of the symbols
  • we should to try all string to match first to simplest placeholder
  • matching to next placeholder reachable by moving intervals border right to left over the string
  • this moving will reflect all possible combinations of sharing symbols between patterns
  • in case of first match is always possible to find the solution
  • in case of all moves will not get a match — we can be sure that match is not possible

So to make algorithm solid at this state we need to:

  • make initial distribution of string intervals depends on simplest regular expressions capacity
  • make moving of the border of the intervals right to left at cycle to detect the match
  • algorithm should be prolonged until match not found or moves of the intervals borders possible

At this point we have step 2 completed: for just 2 intervals we can implement the solution of the task.

Three parts complex regular expression

Now let’s look at dividing the string to 3 placeholders. It is possible to apply the algorithm for 2 placeholders for 3 intervals with small modifications. Next it is possible to increase the number of placeholders to any count.

As we can see on Picture 5 the first move is from the first placeholder to the next one — left to right. It is the most atomic action we can do, we have no other actions to be made. But why do we select the first placeholder as the source of the move?

  • first placeholder have symbols to move out
  • next placeholder have capacity to accept the symbol from first pattern

We can define this criterion as “can move”.

On the next step we need to define a placeholder to be selected as a source of move. If most right placeholder have a symbol and this symbol does not match the placeholder — then addition of any symbol will not fix matching of this placeholder. This symbol can not belong to this pattern. It should be moved.

  • not matching most right placeholder that align “can move” criterion should be source of move
Picture 5: Moving symbols on 3 placeholders, part 1

At the second step “source of move placeholder” will be selected by strategy defined above. But if we try to use this strategy for the third step — we will find that no more steps are possible. All placeholders point to the string parts that match to simplest regular expression or to string without symbols inside.
That is why we need to define additional criteria to detect “source of move pattern” at the next step that the red arrow at Picture 5 shows. It is the special criteria for placeholders that have match.

Combined placeholder

Let’s consider first and second placeholders as a combined placeholder. This way we will find that the only option to have a match will be the transition symbol from this combined placeholder to the next placeholder. This is obvious because we have a match at the combined placeholder and do not have it at the next placeholder. It is the same way to act for two placeholders we considered.

But why do we not care to get second and third placeholders as combined placeholders? Without internal movement in this complex placeholder we can not reach its matching state by moving symbols from outside. That is why we should have equal state placeholders as complex placeholders to correctly use actions from using two placeholders. This means that new criteria should consider this complex placeholder at the beginning of the placeholders collection.

  • Most right matching placeholder with no unmatched predecessors placeholders need to be the source of move at second strategy.

Algorithm definition

Now we have all essential parts together and we can define the algorithms internals:

  • Algorithm based on segregating complex regular expression to collection of simplest regular expressions that contained in placeholders
  • Each placeholder stores simplest regular expression and points the string interval of the full string to match
  • During the algorithm atomic action is to move symbols from left to right intervals related to current and next placeholders
  • To select source of move placeholder two strategies expected to be applied
  • If first strategy detects the source of move placeholder — it will be used
  • First strategy, placeholder to be selected need to:
  • be most right placeholder with symbols
  • points to string interval that not match the simplests regular expression at placeholder
  • have capacity at next placeholder to collect symbol
  • If second strategy detects the source of move placeholder — it will be used instead of first strategy selection
  • Second strategy, placeholder to be selected need to:
  • be most right placeholder with symbols
  • points to string interval that match the simplests regular expression at placeholder
  • have capacity at next placeholder to collect symbol
  • no unmatched predecessors placeholders
  • Algorithm should work until full match of all string intervals and simplest regular expressions in placeholders detected or no more moves possible

Implementation details:

We will talk about implementation of the task in modern C++. There are several points to discuss about the implementation:

  • runtime complexity of the solution
  • ability to improve solution runtime
  • memory footprint

Solution runtime complexity

Let’s start from runtime complexity. Each step we moving border of placeholders from one symbol to another; it passes the symbols one by one. The most symbols the count border can pass is string length. Last border can not be moved at all. That is why worst case complexity based on moves can be defined as O(s²).

This complexity is not final. At each step we need to detect the most right unmatched and matched symbols by strategy one and two. Worst case complexity of this detection will be linear O(r).

So total complexity of the solution will be O(s² * r).

Improving solution runtime complexity

But we can make runtime complexity more effective and reach O(s² * log(r)).

Picture 6: fast detection of placeholders for the strategies

Every step next move source placeholder needs to be found. Worst case we need to traverse over all of the placeholders to find it. But we can make it better. We can create ordered binary tree of:

  • unmatched placeholders
  • matched placeholders
  • not matched placeholders that can move
  • matched placeholders that can move and have not matched next placeholder

This way the runtime complexity of every insertion and search in an ordered binary tree will be O(log(r)).

Detection of move source placeholder for first strategy will contain:

  • detection of first placeholder in not matched placeholders that can move ( O(log(r) )

Detection of move source for second strategy will contain:

  • detection of first placeholder in not matched placeholders ( O(log(r) )
  • detection of lower bound for matched placeholder from first step in matched placeholders that can move and have next placeholder unmatched ( O(log(r) )
  • detection of previous placeholder for previous step ( O(1) )

This is not all about reaching the runtime optimization goal: We need to build these ordered binary trees and have the ability to rebuild them fast. At every move on algorithm comes possible to change several parameters:

  • matching state of source of move placeholder
  • ability to move at source of move placeholder
  • ability to move at previous placeholder
  • matching state of next placeholder

We have all of these states at the ordered sets mentioned later. That is why we need to process maximum:

  • 4 operations of erasing of placeholders from ordered set
  • 4 operations to put it in some other ordered sets

In both cases we will have complexity for each move equal O(log(r)).

Simplest regular expression matching improvement

One more area that can increase the runtime is matching of the simplest regular expression in placeholder and string interval.

Just imagine if we will recount the simplest regular expression matching symbol by symbol at each algorithm step. Complexity of each step of this solution will be linear O(s). This is not an option — we can make it O(1) by using unmatched symbols count and detecting which symbol comes in/out to the string interval that placeholder points.

Picture 7: fast detection of simplest regex matching.
  • On each symbol coming we will
  • increase not matched symbols count on symbol itself not matched
  • decrease not matched symbols count on symbol itself matched
  • On not matched symbols count equal to 0 we can detect that simplest regex matched the string

Memory footprint

Now let’s talk about memory footprint. It should be O(s+r) best case. It is a reachable way if we can not hold all of the intervals and simplest regular expressions itself — but just have reference to some parts of incoming data.

At C++ most appropriate way for this is using std::string_view classes to hold regular expression and string interval references instead of storing values at placeholders as string members. Even in case of runtime optimization of next move source placeholder detection for both strategies of algorithm we will have memory footprint complexity equal to O(s+r). Because ordered binary tree space complexity is O(r).

Conclusion:

Algorithms described in the article have:

  • O(s² * log(r)) runtime complexity
  • O(r+s) memory complexity
  • can be easily implemented at C++
  • can be used for matching of string to regular expressions with a limited amount of simplest patterns.

--

--