Learn how to build powerful contextual word embeddings with ELMo

In this article, we will dive into deep contextual “Word embedding”, and train our own custom ELMo embeddings and use it in Natural Language Processing (NLP) tasks like Named Entity Recognition (NER) and Text Classification.

Getting started with Word Embedding

Word embeddings are an essential part of any NLP model as they give meaning to words.It all started with Word2Vec which ignited the spark in the NLP world, which was followed by GloVe.

Word2Vec showed that we can use a vector (a list of numbers) to properly represent words in a way that captures semantics or meaning-related relationships

Let’s not get into these word embeddings further but vital point is that this word embeddings provided an exact meaning to words. This was a major drawback of this word embeddings as the meaning of words changes based on context, and thus this wasn’t the best option for Language Modelling.

Take a look at these sentences below as an example.

The plane took off at exactly nine o’clock.
The plane surface is a must for any cricket pitch.
Plane geometry is fun to study.

You can see how meaning of the word plane changes based on context. This made it essential to find a way that captures the word meaning in changing contexts, as well as retain the contextual information. And so, contextualized word embeddings came into the picture. In the following sections, we’ll learn how Embedding for Language Models (ELMo), helped overcome the limitation of the traditional word embedding methods like Glove and Word2vec.

ELMo: Deep contextualized word representation

Instead of using a fixed embedding for each word, like models like GloVe do , ELMo looks at the entire sentence before assigning each word in it its embedding.

How does it do it? Using Long Short-Term Memory (LSTM)

Illustrated guide to LSTM

It uses a bi-directional LSTM trained on a specific task, to be able to create contextual word embedding.

ELMo provided a momentous stride towards better language modelling and language understanding. The ELMo LSTM, after being trained on a massive dataset, can then be used as a component in other NLP models that are for language modelling.

ELMo stands for Embeddings from Language Models, and hence it also has the ability to predict the next word in a sentence, which is, essentially, what Language Models do. When trained on a large dataset, the model also starts to pick up on language patterns.

It’s unlikely that it’ll accurately guess the next word in the example. Such models allow you to determine that, if you see the phrase like “I am going to write with a”, the word pencil seems to be a more reasonable next word than “frog”.

Elmo uses bi-directional LSTM in training, so that its language model not only understands the next word, but also the previous word in the sentence. It contains a 2-layer bidirectional LSTM backbone. The residual connection is added between the first and second layers. Residual connections are used to allow gradients to flow through a network directly, without passing through the non-linear activation functions. The high-level intuition is that residual connections help deep models train more successfully.

What also made ELMo interesting is how they used the language model after training. Assume that we are looking at the nth word in our input. Using our trained 2-layer language model, we take the word representation xn,, as well as the bidirectional hidden layer representations h1,n​ and h2,n​ . Then, we combine them into a new weighted task representation. This look as follows:

An example of combining the bidirectional hidden representations and word representation for “happy” to get an Elmo-specific representation.

Here, function F multiplies each vector with weights from the hidden representations of the language model.

Salient features

  • ELMo word representations are purely character-based, which allows the network to use morphological clues to form robust representations for out-of-vocabulary tokens unseen during training.
  • Unlike other word embeddings, it generates word vectors on run time.
  • It gives embedding of anything you put in — characters, words, sentences, paragraphs, but it is built for sentence embeddings in mind.

Now, that we have some familiarity with how ELMo embeddings work in language modelling, let’s get started with the code.

Training ELMo on corpus

In this part of the tutorial, we’re going to train our ELMo for deep contextualized word embeddings from scratch. Training of Elmo is a pretty straight forward task. You will need to install the TensorFlow- GPU library before starting the training as we are using the TensorFlow version of it.


Install python version 3.5 or later, tensorflow version 1.2 and h5py:

pip install tensorflow-gpu==1.2 h5py
python setup.py install

Ensure the tests pass in your environment by running the following line of code:

python -m unittest discover tests/

To train and evaluate the embeddings, you need to provide:

  • a vocabulary file
  • a set of training files
  • a set of held out files

The vocabulary file is a text file with one token per line. It must also include the special tokens, and the vocabulary file should be sorted in descending order by token count in your training data. The first three entries/lines should be the special tokens:
<S> , 
</S> and 

The training data should be randomly split into many training files, each containing one slice of the data. Each file contains pre-tokenized and white space separated text, one sentence per line. Don’t include the <S> or </S> tokens in your training data.

Once you are ready with the above three files the task is simpler now. Clone the repo first:

git clone https://github.com/allenai/bilm-tf.git

Training the model

The hyperparameters used to train the ELMo model can be found in bin/train_elmo.py .Select the number of the GPU which you have and make changes accordingly in train_elmo.py .


We are ready now. For training just hit the below command and we’re ready to go!

python bin/train_elmo.py \
--train_prefix='/path/to/training-folder/*' \
--vocab_file /path/to/vocab.txt \
--save_dir /output_path/to/checkpoint

If everything goes right you will see your model training. Good thing is that after passing every 100 batches, it shows the time it took in training. So, after the first hundred 100 batches, you can easily calculate the total time it will take to complete the entire training process.

Evaluate the trained model

If you are done with training, it’s time to evaluate your model.

Use bin/run_test.py to evaluate a trained model, e.g.

python bin/run_test.py \
--test_prefix='/path/to/heldout-folder/*' \
--vocab_file /path/to/vocab.txt \
--save_dir /output_path/to/checkpoint

After running the run_test.py you will come to know about the perplexity score of your model.

Convert the tensorflow checkpoint to hdf5 for the prediction

set n_characters to 262 after training in options.json . Then Run:

python bin/dump_weights.py \
--save_dir /output_path/to/checkpoint
--outfile /output_path/to/weights.hdf5

Its time for some real action!

You can apply ELMo to almost any NLP pipeline, and it will work like a charm. If you are interested to use ELMo embeddings trained by you to build your own custom NER model, then here is a great article for it.

Here, I would like to give an example of text classification by using ELMo word embeddings. I am going to use emotion data created by our organization, containing 4 emotions as anger, sad, neutral and happy. There are only 700 sentences in the dataset. I am not going to use deep learning here, but just a simple logistic regression model.

You can use ELMo embeddings without any hassle with the help of Tensorflow hub. TensorFlow Hub is a library that enables transfer learning by allowing the use of many machine learning model for different tasks. ELMo is one such example. Many trained models are stored there so we just need to pull them in our pipeline.

How easy isn’t it?

To use TensorFlow hub we need to install it first.

$ pip install "tensorflow>=1.7.0"
$ pip install tensorflow-hub

So, let's start with loading the libraries.

import pandas as pd
import numpy as np
import spacy
from tqdm import tqdm
import re
import time
import pickle

Loading the ELMo embedding

import tensorflow_hub as hub
import tensorflow as tf

elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)

By putting trainable=True we can finetune some parameters of the elmo module.

To turn any sentence into ELMo vector you just need to pass a list of string(s) in the object elmo.

x = [“Nothing suits me like suit”] # Extract ELMo features 
embeddings = elmo(x, signature=”default”, as_dict=True)[“elmo”] 

Output: TensorShape([Dimension(1), Dimension(5), Dimension(1024)])

  • The first dimension of this tensor represents the number of training samples.
  • The second dimension represents the maximum length of the longest string in the input list of strings.
  • The third dimension is equal to the length of the ELMo vector

We will extract ELMo vectors for the from train and test set. We will take the mean of the ELMo vectors of constituent terms or tokens of the tweet.

def elmo_vectors(x):
embeddings = elmo(x.tolist(), signature="default", as_dict=True)["elmo"]

with tf.Session() as sess:
# return average of ELMo features
return sess.run(tf.reduce_mean(embeddings,1))

To save memory I am passing 100 sentences as batch while computing embedding. This step takes time computing ELMo embedding is computationally heavy.

list_train = [train[i:i+100] for i in range(0,train.shape[0],100)]
list_test = [test[i:i+100] for i in range(0,test.shape[0],100)]

Now, we will iterate through these batches and extract the ELMo vectors.

# Extract ELMo embeddings
elmo_train = [elmo_vectors(x['clean_tweet']) for x in list_train]
elmo_test = [elmo_vectors(x['clean_tweet']) for x in list_test]

We can concatenate them back to a single array and save it.

elmo_train_new = np.concatenate(elmo_train, axis = 0)
elmo_test_new = np.concatenate(elmo_test, axis = 0)
# save elmo_train_new
pickle_out = open("elmo_train_03032019.pickle","wb")
pickle.dump(elmo_train_new, pickle_out)

# save elmo_test_new
pickle_out = open("elmo_test_03032019.pickle","wb")
pickle.dump(elmo_test_new, pickle_out)

Time to build the model

We will use the ELMo vectors of the train dataset to build a classification model. Then, we will use the model to make predictions on the test set. First split the data.

from sklearn.model_selection import train_test_split

xtrain, xvalid, ytrain, yvalid = train_test_split(elmo_train_new,

Now building a simple logistic regression baseline model.

from sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import f1_score lreg = LogisticRegression()lreg.fit(xtrain, ytrain)

Time for results

preds_valid = lreg.predict(xvalid)
f1_score(yvalid, preds_valid)

Output: 0.789976

We are getting a good enough F1 score on such a small dataset which proves the efficiency of the model.

Further Reading


ELMo has revolutionized the word embedding space in Natural Language Processing (NLP). It is now a prevalently used model for various NLP tasks. I hope this article gave you an insight into how contextual sentence embeddings work, and why ELMo works great in language modelling tasks, as well as other NLP tasks.

For any doubts, feel free to reach out to me through the comments section, and I’ll be thrilled to help you. If you liked this article, give it a clap

For more articles around NLP, Deep Learning, Language Modelling and Chatbots, follow our blog and facebook.