Language Generation with Recurrent Models

LSTM, Sampling, Smart Code Completion Tool

Jake Batsuuri
Computronium Blog
9 min readMar 21, 2021

--

How Do You Generate Sequence Data?

The general way, is to train a machine learning model. Then ask it to predict the next token, whether they be characters or words or n-grams. A model with this predictive capability is called a Language Model. The model is basically learning the latent space i.e. the statistical structure of the given data.

This model will spit out an output based on an input. Then replace the output as the input for another round of text generation. And repeat the process.

More concretely, given a sequence like “Cat in the ha”, the language model would predict, “t”. Assuming the model was trained on Dr. Seuss corpora.

The output unit for a character-level language model would be a softmax activation over all the possible characters.

Imagine the 26 letters of English, for a given sequence of text, there is a probability distribution over the 26 letters. For our “cat in the ha” sequence the letter “t” would possibly have the highest probability, say 0.25. Whereas “r” might be 0.03 and “m” might be like 0.05 and so on.

So when we generate the next character in the sequence, we are sampling from a probability space. There are some approaches to this.

Which Sampling Strategy To Pick?

Greedy Sampling

If we always go with the highest probability characters, our model will probably never mess up. But the text it generates will probably be pretty stale, cliché and common. This sampling has minimum entropy.

Pure Stochastic Sampling

On the other hand, if we pick randomly, we might as well generate meaningless sequence of characters like “wkrnj1lkm32l3kremflsdcm”. This sampling has maximum entropy.

Somewhere in Between?

However if we sample probabilistically using softmax activation, we would pick “t” 0.25 of the time, which gives the other less likely characters a chance to appear at least some of the time. This method has an entropy somewhere between min and max, what’s even better is, we can even control this with a knob.

The softmax temperature is a value we can use to adjust how randomly we wanna sample from the probability space. 0.01 means very deterministic and 0.99 means very random.

The way we do this is by inputting a distribution and getting back a redistributed distribution according to our entropy preference.

How To Implement a Character Level LSTM Text Generation?

First we download a large corpus to train our network with:

Then we vectorize the characters in the text:

And create a model and compile it:

And finally adjust the temperature and give it a random prompt and let the trained model predict the next character:

The output for this text is a little nonsensical, but considering that it’s a single layer LSTM that takes a couple minutes to train, it’s pretty okay.

What would happen if we replaced the corpus to some codebase?

Swift Public GitHub Repository

I merged only several random files and trained a model to get some gibberish like this:

To be fair this was only 50k length corpus.

ThreeJS Portable Library

This file is a single file that has most of the ThreeJS core library, of length 1.3m. Furthermore, since code needs to be more structured, I decided to reduce the temperature to 0.35:

This stuff gives pseudocode a whole new meaning. How can we make this more useful?

What If We Just Try to Make a Smart Code Completion Tool?

When we are looking for code completion, it is just to complete the current line of code, it’s never really multi line stuff. Although maybe a Smart Code Snippet Tool could be cool too.

A typical line is about 50. Furthermore, our code completion tool shouldn’t really invent new code, we are just trying to save time by completing code we type very frequently. We almost want deterministic stuff. So we might use a deterministic temperature of 0.05.

Another consideration is, how much of a random string to prompt with, most people will type out a bit and then wait for code completion. I chose a string length of 10:

Test #1

Our first random promp is “vertice”, in the original corpus, this appears a lot, some of the original uses were:

Our smart code completion tool, outputs this:

Which doesn’t make sense. However:

Would have made sense. As it appears several times throughout the codebase. We just need less temperature.

Test #2

Now use use temperature of 0.01 and our prompt is:

And the real code has uses like:

And our code completion tool outputs:

Which looks better but, this particular line of code never appears in the original codebase.

It’s very clear what’s happening, our character level code generation does make sense when you consider the words hyper locally. But it’s just not capturing the larger meaning of a small line of code.

We could do word level tokenization or stack the LSTM layers.

Test #3

So in this one, I stacked 2 layers of LSTM, make sure to use full sequences in your preceding LSTM:

Our prompt is:

And the real code has uses like:

And our code completion tool outputs:

Which isn’t bad, but also it’s not quite capturing our intent.

To improve this in general we can add couple more features, like doing code completion only from the start of a line and doing a word or n-gram level code prediction.

Other Articles

Up Next…

Coming up next is probably more Computational Linguistics Theory. If you would like me to write another article explaining a topic in-depth, please leave a comment.

For the table of contents and more content click here.

--

--