Language Generation with Recurrent Models
LSTM, Sampling, Smart Code Completion Tool
How Do You Generate Sequence Data?
The general way, is to train a machine learning model. Then ask it to predict the next token, whether they be characters or words or n-grams. A model with this predictive capability is called a Language Model. The model is basically learning the latent space i.e. the statistical structure of the given data.
This model will spit out an output based on an input. Then replace the output as the input for another round of text generation. And repeat the process.
More concretely, given a sequence like “Cat in the ha”, the language model would predict, “t”. Assuming the model was trained on Dr. Seuss corpora.
The output unit for a character-level language model would be a softmax activation over all the possible characters.
Imagine the 26 letters of English, for a given sequence of text, there is a probability distribution over the 26 letters. For our “cat in the ha” sequence the letter “t” would possibly have the highest probability, say 0.25. Whereas “r” might be 0.03 and “m” might be like 0.05 and so on.
So when we generate the next character in the sequence, we are sampling from a probability space. There are some approaches to this.
Which Sampling Strategy To Pick?
Greedy Sampling
If we always go with the highest probability characters, our model will probably never mess up. But the text it generates will probably be pretty stale, cliché and common. This sampling has minimum entropy.
Pure Stochastic Sampling
On the other hand, if we pick randomly, we might as well generate meaningless sequence of characters like “wkrnj1lkm32l3kremflsdcm”. This sampling has maximum entropy.
Somewhere in Between?
However if we sample probabilistically using softmax activation, we would pick “t” 0.25 of the time, which gives the other less likely characters a chance to appear at least some of the time. This method has an entropy somewhere between min and max, what’s even better is, we can even control this with a knob.
The softmax temperature is a value we can use to adjust how randomly we wanna sample from the probability space. 0.01 means very deterministic and 0.99 means very random.
The way we do this is by inputting a distribution and getting back a redistributed distribution according to our entropy preference.
import numpy as npdef reweight_distribution(original_distribution, temperature=0.5):
distribution = np.log(original_distribution) / temperature
distribution = np.exp(distribution)
return distribution / np.sum(distribution)
How To Implement a Character Level LSTM Text Generation?
First we download a large corpus to train our network with:
Then we vectorize the characters in the text:
And create a model and compile it:
And finally adjust the temperature and give it a random prompt and let the trained model predict the next character:
The output for this text is a little nonsensical, but considering that it’s a single layer LSTM that takes a couple minutes to train, it’s pretty okay.
epoch 1 1565/1565 [==============================] - 168s 107ms/step - loss: 1.3974
⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ Generating with seed: "a race which seeks to rise above its hereditary baseness and" ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ temperature: 0.5
a race which seeks to rise above its hereditary baseness and interermal and man was other things has been commands of the simplic of the senses and straight and weakness of the seemed and come the instincts of the whole person of the work and although the mostepoch 2 1565/1565 [==============================] - 171s 109ms/step - loss: 1.3848
⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ Generating with seed: "admire and still readier to turn away. 36 =objection.=--o" ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ temperature: 0.5
admire and still readier to turn away. 36 =objection.=--one has at the spirit of the stronger of the our graditions of a being body one conscience and interpretation of the child are better of contemplate himself and consciences of the contemplate of the suepoch 3 1565/1565 [==============================] - 167s 107ms/step - loss: 1.3746
⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ Generating with seed: "taste when it is counter to our vanity. 177. with regard to" ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ temperature: 0.5
taste when it is counter to our vanity. 177. with regard to the world for consential men, this suffering and shours it makes called as the profound in the amplement of distinguish and solitude and there is not to the point of many perhaps the stronger and havepoch 4 1565/1565 [==============================] - 171s 109ms/step - loss: 1.3643
⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ Generating with seed: " vital spot of truth when he warns all those endowed with re" ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ temperature: 0.5
vital spot of truth when he warns all those endowed with regard and the through and every him are whom will the world of a new and that in the same "nature and in the souls and as he artist and the possible the antilition of the grateful of the say still deceepoch 5 1565/1565 [==============================] - 167s 107ms/step - loss: 1.3556
⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ Generating with seed: " and profound enough to receive such belated fugitives. 256" ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ temperature: 0.5
and profound enough to receive such belated fugitives. 256. a man in all the spirit of the superficial of prosonded with a man and by the same spirit of the powerful of a purposition of the conditions of the spirit is a fanish of the superstition of the spir
What would happen if we replaced the corpus to some codebase?
Swift Public GitHub Repository
I merged only several random files and trained a model to get some gibberish like this:
133/133 [==============================] - 14s 109ms/step - loss: 1.3357 ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ Generating with seed: "if() if(swift_built_standalone) project(swift c cxx asm) en" ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ temperature: 0.5
if() if(swift_built_standalone) project(swift c cxx asm) endif() option(swift_host_variant_arch}") set(swift_host_variant_arch_default "${swift_host_variant_arch_default "${cmake_march_sdk_default "acchos "") endif("${cmake_system_name}") set(swift_host_variepoch 6
To be fair this was only 50k length corpus.
ThreeJS Portable Library
This file is a single file that has most of the ThreeJS core library, of length 1.3m. Furthermore, since code needs to be more structured, I decided to reduce the temperature to 0.35:
epoch 1 3593/3593 [==============================] - 409s 113ms/step - loss: 1.8626
⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ Generating with seed: " math.log(math.max(width, height)) * math.log2e; " ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ temperature: 0.35 math.log(math.max(width, height)) * math.log2e; this.startandition = new points(this.matrix);
this.matrixworld = shadow.color.prototype.color();
this._caches = new vector3();
}
var material = new vector3();
function = function epoch 2 3593/3593 [==============================] - 399s 111ms/step - loss: 1.0805
⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ ⨕ Generating with seed: ".sqrt(this.distancetosquared(v));
};
_proto.distanc" ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ temperature: 0.35 .sqrt(this.distancetosquared(v));
};
_proto.distance = function () {
var vertex.settexture(array, array, offset);
return this;
}; _proto.getpointlined( vector3();
this.component = function (intensity, origin, color) {
This stuff gives pseudocode a whole new meaning. How can we make this more useful?
What If We Just Try to Make a Smart Code Completion Tool?
When we are looking for code completion, it is just to complete the current line of code, it’s never really multi line stuff. Although maybe a Smart Code Snippet Tool could be cool too.
A typical line is about 50. Furthermore, our code completion tool shouldn’t really invent new code, we are just trying to save time by completing code we type very frequently. We almost want deterministic stuff. So we might use a deterministic temperature of 0.05.
Another consideration is, how much of a random string to prompt with, most people will type out a bit and then wait for code completion. I chose a string length of 10:
Test #1
Our first random promp is “vertice”, in the original corpus, this appears a lot, some of the original uses were:
var vertices = [];vertices.push(x, -y, 0);_this.setAttribute('position', new Float32BufferAttribute(vertices, 3));
Our smart code completion tool, outputs this:
vertices = this.groups.prototype.color.clearcoatnormalmap
Which doesn’t make sense. However:
vertices = this.groups;
Would have made sense. As it appears several times throughout the codebase. We just need less temperature.
Test #2
Now use use temperature of 0.01 and our prompt is:
"ditherin"
And the real code has uses like:
this.dithering = source.dithering;dithering_fragment: dithering_fragment,parameters.dithering ? '#define DITHERING' : '',
And our code completion tool outputs:
dithering = new vector3();
Which looks better but, this particular line of code never appears in the original codebase.
It’s very clear what’s happening, our character level code generation does make sense when you consider the words hyper locally. But it’s just not capturing the larger meaning of a small line of code.
We could do word level tokenization or stack the LSTM layers.
Test #3
So in this one, I stacked 2 layers of LSTM, make sure to use full sequences in your preceding LSTM:
return_sequences=True
Our prompt is:
"py.call(th"
And the real code has uses like:
Light.prototype.copy.call(this, source);_Object3D.prototype.copy.call(this, source, false);
And our code completion tool outputs:
py.call(this, context);
Which isn’t bad, but also it’s not quite capturing our intent.
To improve this in general we can add couple more features, like doing code completion only from the start of a line and doing a word or n-gram level code prediction.
Other Articles
This post is part of a series of stories that explores the fundamentals of deep learning:1. Linear Algebra Data Structures and Operations
Objects and Operations2. Computationally Efficient Matrices and Matrix Decompositions
Inverses, Linear Dependence, Eigen-decompositions, SVD3. Probability Theory Ideas and Concepts
Definitions, Expectation, Variance4. Useful Probability Distributions and Structured Probabilistic Models
Activation Functions, Measure and Information Theory5. Numerical Method Considerations for Machine Learning
Overflow, Underflow, Gradients and Gradient Based Optimizations6. Gradient Based Optimizations
Taylor Series, Constrained Optimization, Linear Least Squares7. Machine Learning Background Necessary for Deep Learning I
Generalization, MLE, Kullback-Leibler Divergence8. Machine Learning Background Necessary for Deep Learning II
Regularization, Capacity, Parameters, Hyper-parameters9. Principal Component Analysis Breakdown
Motivation, Derivation10. Feed-forward Neural Networks
Layers, definitions, Kernel Trick11. Gradient Based Optimizations Under The Deep Learning Lens
Stochastic Gradient Descent, Cost Function, Maximum Likelihood12. Output Units For Deep Learning
Stochastic Gradient Descent, Cost Function, Maximum Likelihood13. Hidden Units For Deep Learning
Activation Functions, Performance, Architecture14. The Common Approach to Binary Classification
The most generic way to setup your deep learning models to categorize movie reviews15. General Architectural Design Considerations for Neural Networks
Universal Approximation Theorem, Depth, Connections16. Classifying Text Data into Multiple Classes
Single-Label Multi-class Classification17. Convolutional Models Overview
Convolutions, Kernels, Downsampling & Properties18. Working Understanding of Convolutional Models
Creating, Preprocessing, Data Augmentation, Feature Extraction, Fine Tuning19. Convolutional Models for Sequential Data
And easing into Recurrent Neural Networks20. Recurrent Models Overview
Recurrent Layers: SimpleRNN, LSTM, GRU21. Language Processing with Recurrent Models
Bidirectional RNNs, Encoding, Word Embeddings and Tips22. Language Generating with Recurrent Models
LSTM, Sampling, Smart Code Completion Tool
Up Next…
Coming up next is probably more Computational Linguistics Theory. If you would like me to write another article explaining a topic in-depth, please leave a comment.
For the table of contents and more content click here.
References
Adams, R. A. (2017). Calculus. Prentice-Hall.
Goodfellow, I. (2017). Deep Learning. MIT Press.
Nicholson, K. (2009). Linear Algebra with Applications.
François, C. (2018). Deep Learning with Python and Keras. MITP-Verlags GmbH & Co. KG.
Sutton, R. S. (2018). Reinforcement Learning. A Bradford Book.
Wackerly, D. D. (2007). Mathematical Statistics with Applications. Belmont, CA: Nelson Education.
(n.d.). A First Course In Linear Algebra — Open Textbook Library. Retrieved February 24, 2020, from https://open.umn.edu/opentextbooks/textbooks/a-first-course-in-linear-algebra-2017