Understanding RegEX using deep learning

Published in

CodistAI

5 min readJun 30, 2019

source : https://www.sitepoint.com/demystifying-regex-with-practical-examples/

I was staring to my screen and there, in green color over my black VSCode background, was this — `^(http|https|ftp):[\/]{2}([a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,4})(:[0-9]+)?\/?([a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]*) I could understand that it is, may be, talking about some kind of URL pattern. But what the meaning of it? I scratch my head. Once, twice, many times. No answer. StackOverflow is not going to be very useful as I have a RegEx and I want to understand the intent of it. Not the other way round. What shall I do? I did not know much, apart from diving into this cryptic line and trying to understand it piece by piece. “Urgh.” I mumbled, “I wish there could be an easier way to do this. I have a deadline coming up. I really don’t have the time”.

This is my story over and over again. How many times had I cursed the previous developer to leave behind a series of mysterious RegEx legacy? I can’t even count. Jamie Zawinski truly said,

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

This is how we came up with the idea of Codist AI. We wanted to help developers to be able to enter the source code faster. We leverage Machine Learning and Traditional Programing Language Theory to experiment with several smaller size problems and training models on them.

Understandably, one of the first problems we are tackling is to generate the intent of a RegEx expression given the expression itself. Without any prior domain knowledge.

Today we release an early version of this algo as a Google Colab notebook and this post to tell the story of building the algo. If you want to be in the loop about the exciting science we are working on, then you can let us know here.

Ok, without further ado, let’s dive into the algorithm.

Well, not really. First we need to look at the data. Data is one of the most important part of building a deep learning model. So, let’s begin.

The data

We are using the data-set provided with a paper published in 2016 by Nicholas Locascio et al. The goal of the paper was exactly opposite of what we are doing here. Going from intent to RegEx. This particular research has a long history and we strongly suggest that you go through the original paper to know more about it.

The data and the code from that paper (Partly in Python and Lua) can be found in the accompanying Github repo.

After obtaining the data and doing some pre-processing on it (You can checkout the pre-processing code in the colab I mentioned above) if we look at the final result and pick one random {X, Y} (that stands for Source and Target) tuple from it then it may look like this —

RegEx =>\ b ( [ <CAP> ] ) & ( [ <VOW> ] ) \ b 
Intent => lines that have words containing a capital letter as well as a vowel

We have replaced the usual RegEx expressions, like A-Z or AEIOUaeiou, by <CAP> or <VOW> here. In fact, we replace all such kind of RegEx expressions with special symbols. For the sake of normalization for the model. We also tokenize the regex so that later when we use some kind of data reader to read, it will be easier to separate each symbol. That way we will have (hopefully) smaller vocabulary.

This pre-processing is largely following the pre-processing from the original paper.

The Model

Before I jump into the details, I would just like to show you the final model that we used in the colab settings.

SimpleSeq2Seq(
  (_source_embedder): BasicTextFieldEmbedder(
    (token_embedder_tokens): Embedding()
  )
  (_encoder): PytorchSeq2SeqWrapper(
    (_module): LSTM(256, 256, num_layers=3, batch_first=True, dropout=0.25)
  )
  (_attention): LinearAttention(
    (_activation): Tanh()
  )
  (_target_embedder): Embedding()
  (_decoder_cell): LSTMCell(512, 256)
  (_output_projection_layer): Linear(in_features=256, out_features=230, bias=True)
)

As you can see, it is a Seq2Seq network with one source and one target embedding, a three layer LSTM as encoder and a one layer LSTM as decoder. And also we are using the classic Bahdanau attention mechanism at the decoder.

If all of the above sounds a bit unfamiliar to you then I would suggest that you take a look at the PyTorch official tutorial on Seq2Seq or if you really want to dive deep then I can not recommend enough taking the open course from Stanford, taught by none other than Christopher Manning himself!

One last thing, we are heavy user of AllenNLP at CodistAI. And I can not think about a time when this library was not around and I was doing all the messy pre-processing of the text data by hand. What a horror! So a big shout out to the whole team at AllenNLP. You guys rock!

The Training

We trained this model for 100 epochs but with a patience of 10 (which means AllenNLP trainer will automatically stop the training if the validation loss has not decreased for 10 contiguous epochs). In our case, it seems that the model reached the peak at 23rd epoch. And, of course, the training stopped after 32 epochs.

The result

Now as the training is done it is time for some predictions. We picked the first 30 examples from the validation set and ran the predictions on it. The result looks really promising. And, if I dare say, sometimes the model predictions are actually better than the Gold standard. But also in some cases the predictions are not so good. As with any deep learning based product we have a lot more to do before we can reach an even better result.

Here are some samples

Matching —

Prediction => lines with words ending with a number before lower - case letter
Gold => lines with words ending with number before lower-case letter

Better —

Prediction => lines with the string < M0 > before a vowel and the string < M1 > at least once in them
Gold => lines with the string <M0> before a vowel and string <M1>

Worse —

Prediction => lines with the string < M0 > followed by 2 capital letters
Gold => lines with the string <M0> followed by a capital letter , 6 or more times in them

Well, to be fair, the only thing worse about the last one is the fact that instead of 6 the prediction said 2. There are some ways to go around this.

Thank you for getting this far! We have also made a Web App available where we will be adding interactive ways to use these small models. And we have already added the model you saw just now. You can check it out here. Please let us know what you think. We will be back soon with another interesting post and algorithm. Till then, au revoir!

Understanding RegEX using deep learning

The data

The Model

The Training

The result

Written by Shubhadeep Roychowdhury