Why You Should Care About Byte-Level Sequence-to-Sequence Models in NLP

Original image by Christina Morillo (Pexels.com)

Sequence-to-sequence models are ubiquitous in NLP nowadays. They are applied to all kinds of problems, form part-of-speech tagging to question-answering systems. Usually, the text they deal with is treated as a sequence of words. While this proves to be powerful, word-level models do come with their own set of problems.

For example, there are many different words in any given language, so word-level models should maintain very big vocabularies. This can become too large to fit into computer memory, which might make it hard to run a model on smaller devices like a smartphone. Moreover, unknown words might still be ran into, and how to deal with them is not a solved problem. Lastly, some languages have many different forms (morphological variants) of the same word. Word-level models treat all of them as distinct words, which is not the most efficient way of dealing with things.

This is where byte-level models come in. While byte-level sequence models gain more traction in NLP of late, they are still relatively unknown to a large group of NLP researchers and practitioners. This is everyone’s loss, as byte-level models combine highly desirable properties with elegant solution to the long-standing problems mentioned above.

This blog post explains how byte-level sequence-to-sequence models work, how this brings about the benefits they have, and how they relate to other models — character-level and word-level models, in particular.

To make things more concrete, we will focus on a particular use case: sequence-to-sequence models, with recurrent neural networks (RNNs).
We will use the task of machine reading — a task where a computer reads a document and has to answer questions about it — as a running example throughout the text.

The content below is based on research I did during an internship at Google Research, which culminated in this paper.

Why should you care about byte-level sequence-to-sequence models?

Let’s start with the take-home messages. Why would we want to use bytes? It turns out reading and writing textual content byte by byte — rather than word by word, or character by character — has quite some advantages:

✅ Small input vocabulary → As a result, small model size

✅ No out-of-vocabulary problem

✅ An elegant way of dealing with many morphological variants per word

As a bonus, bytes provide us a universal encoding scheme across languages and they allow for apple-to-apple comparison between models.

So, is it all clear sailing and nothing but blue skies?!?
Sadly, no…

Byte-level models do come with a drawback, which is:

❌ Longer unroll length for the RNN models → as a result: less input can be read

To jump ahead a bit, byte-level models are in fact quite similar to character-level models and their performance is often on par with word-level models (as I proved in this paper on a machine reading task). Which may lead you to think that, well, in that case, perhaps there is nothing too thrilling about them after all? But this would be missing the very point.

If we consider the point listed first above again, i.e. much smaller model size, what it comes down to, really, is a great performance with much fewer parameters. This is very good news!

To add to that, on languages rather different from English — like, e.g., Russian and Turkish which are morphologically much richer than English is, and for which words as units of input are less suitable as a result — byte-level models have an edge over word-level models. This is better news still!

In short, many advantages and desirable computational properties. Let’s turn to an explanation of how byte-level models work. After that, I will show how each of the points mentioned above — small input vocabulary and model size, 
no out-of-vocabulary problem, elegant way of dealing with many morphological variants per word, and a longer unroll length for the RNN models — follows from that.

As a bonus, we’ll see how bytes provide us a universal encoding scheme across languages and how they allow for apple-to-apple comparison between models.

How do byte-level models work?

Before we turn to bytes, first let’s have a look at what a typical word-level encoder-decoder, or sequence-to-sequence model looks like:

Figure 1. Word-level sequence-to-sequence model, for answering questions.

The inputs at the bottom in Figure 1 are vectors representing words (also called word embeddings), depicted as columns of yellow squares.
They are processed by recurrent cells, represented by orange circles, that take the word vectors as inputs and maintain an internal state, keeping track of everything read so far. The blue cells start off with the internal state of the last orange cell as input. They have word embeddings as output, corresponding to the words in the answer. The yellow/orange bit is what is called the encoder. The blue bit is the decoder.

As can be deduced from this figure, the model needs to have access to a dictionary of word embeddings. Every word in the input, and every word in the output corresponds to one unique word vector in this dictionary.
A model dealing with English texts typically stores vectors for 100,000 different words.

Now let’s have a look at how this works at byte level:

Figure 2. Byte-level sequence-to-sequence model.

Here, in Figure 2, we see a byte-level sequence-to-sequence model. The inputs are embeddings, just as with the word-level model, but in this case, the embeddings correspond to bytes. The first byte is the one corresponding to the character ‘W’, the second corresponds to ‘h’, etc.

Just to be clear on one detail here, as this sometimes is a source of confusion, the bytes themselves are not the embeddings. Rather, they are used as indices to the entries of an embedding matrix. So, byte 0000 0010 corresponds to the third row in the embedding matrix (if the indexing is 0-based) which itself is a, let’s say, 200-dimensional vector.

In short: every byte in the input, and every byte in the output corresponds to one unique vector in the dictionary, i.e., the embedding matrix. As there are 256 possible bytes, each byte-level model stores 256 vectors as its dictionary.

Small input vocabulary → As a result, small model size

This immediately brings us to the first advantage of byte-level models: they have a vocabulary size of 256, corresponding to all possible bytes. This is considerably smaller than a word-level model that has to store, let’s say, 100,000 embeddings, for every word in its vocabulary.

For a typical embedding size of 200, the byte-level model has 256 ×200 = 51,200 values in its embedding matrix. The word-level model, however, has 100,000 × 200 = 20,000,000 values to keep track of. This, I think is safe to say, is a substantial difference.

No out-of-vocabulary problem

What happens when a word-level model encounters a word not included in its vocabulary of 100,000 words? This is a hard problem to solve, it turns out, and the topic of much research. People have proposed many solutions, such as keeping a set of otherwise unused embeddings around that can function as placeholders for the unknown words. When the unknown word is needed as output, we can simply copy it from the input.

Character-level models can stumble upon unexpected characters too. Even if we are dealing with Wikipedia data, from just the English part of Wikipedia, a Wikipedia page might contain some Chinese script, or perhaps some Greek symbols if it is about mathematics.

While the placeholder mechanism is quite powerful — especially in languages with little inflection, no cases, etc. — and the odd couple of unknown characters might simply be ignored, it is interesting to note that when a model is reading bytes, out-of-vocabulary input is simply impossible, as the set of 256 bytes is extensive.

This doesn’t mean the end of all problems, of course — byte-level models might still be bad at dealing with mathematical equations, for example — but it does make for one problem less to worry about.

“When a model is reading bytes, out-of-vocabulary input is simply impossible, as the set of 256 bytes is extensive.”

As a slight aside, there are hybrid models, that perform word-level reading and writing, but resort to character-level or byte-level reading when an out-of-vocabulary word is encountered. Without going into full detail here, what was shown in this paper is that this model is the top performer on the English data. On Turkish data, however, a fully byte-level model outperforms it.

Dealing with many morphological variants of words

Smaller vocabulary, a universal encoding scheme, no out-of-vocabulary problems… all of this is very nice for byte-level models of course, but all the same, word-level models do work. So, really… why would we want to avoid having words as inputs? This is where the morphologically rich languages mentioned above come into play.

In English (and similar languages like Dutch, German, French, etc.), words carry relatively atomic units of meaning. As an example, let’s take the word chair. The word chairs means the same as what chair means, just a couple more of them.

Similarly, walk denotes some activity, and in the word walked is about the same activity, where the -ed part indicates it having taken place in the past.

“So, really… why would we want to avoid having words as inputs?”

Now, contrast this to Turkish.

In Turkish, the word kolay means ‘easy’. The word kolaylaştırabiliriz means ‘we can make it easier’, while kolaylaştıramıyoruz means ‘we cannot make it easier’. Similarly, zor means ‘hard’ or ‘difficult’, zorlaştırabiliriz means ‘we can make it harder’, while zorlaştıramıyoruz means ‘we cannot make it harder’.

Turkish, here, illustrates what it means to be a morphologically rich language.
In short, a lot of what in English would be expressed by separate words, is expressed in Turkish by ‘morphing’ the word; changing its shape, by, e.g., attaching suffixes to a stem. All of a sudden, the concept of word gets a different twist here. What is a word in Turkish, we might think of as a phrase, or an entire sentence, in English.

Let’s tie this back to sequence models again. A model dealing with English would have a word embedding for chair and another one for chairs. So that is two embeddings for two forms of the word chair. In Turkish, now, many, many more forms exist for a word like kolay, as illustrated above. So, instead of storing just two embeddings per word, we would have to store way, way more.
In fact, the number of words that can be made from one word (or stem, really) in Turkish is more or less infinite, just as there are infinitely many sentences in English.

A similar argument can be made for languages which have a lot of cases, like, e.g., Russian. Every stem comes with a lot of different forms, corresponding to the different cases. To cover the same number of stems in Russian as one would do in English, a vocabulary would be needed that is bigger by an order of magnitude. For a general purpose model, this can become prohibitively expensive, or, to put it plainly, far too big to fit into memory.

Lastly, while in word-level model the semantic information for chair and chairs is duplicated between the two word vectors it has to maintain for these two words, a lot of the information shared between these words can be stored in the matrices a byte-level maintains, which might even allow it to generalize in cases of typos (chairss) or infrequent words forms not observed during training (“This couch is rather chairy.”).

Longer unroll length for the RNN models → As a result: less input can be read

To read the sentence “Where is Amsterdam”, a word-level model would need 3 steps, one for each word. A byte- or character-level model would need 18. Average word length varies across languages, but it unavoidable for character-level models to need more steps than word-level models, and for byte-level models to need the same number, or even more. If a very long input is to be dealt with, this might prove to be a showstopper.

BONUS I: Universal encoding scheme across languages

Now, at this point, you might wonder, wouldn’t most of what is said so far go for characters, much as it goes for bytes?

And the answer is, to a certain extent, yes, it would.

In fact, for English, when ASCII encoding is used, there is no difference between reading characters or bytes. The ASCII-way of encoding characters allows for 256 characters to be encoded and (surprise…) these 256 possible characters are stored as bytes.

However, as we all know, but sometimes forget, English is not the only language in the world. There are other languages, like Dutch, German and Turkish that use characters not represented in the ASCII character set, like é, ä, and ş. Moreover, to write in Russian, people use a different alphabet altogether, with letters from Cyrillic script.

To encode all these non-ASCII characters, countless encoding schemes have been invented, like UTF-8, UTF-16, ISO-8859–1 and so on, all of which extend the 256 characters in the ASCII set.

“As we all know, but sometimes forget, English is not the only language in the world.”

Character-level models, as the name indicates, take characters as units of input and output. To do this, they have to pick an encoding scheme and a particular set of characters they expect to deal with. Wouldn’t it be nice, now, if we could do away with having to make these decisions? How about if we could represent all languages and alphabets in one format, one universal encoding scheme across languages?

The answer, which should come as no surprise at this point, is that we can, of course, by using bytes.

When we read bytes as input, it doesn’t matter what encoding scheme was used to encode the input, UTF-8, ISO-5589–1 or anything else. The encoder RNN will just figure this out. And similarly for the decoder, which has to output bytes adhering to a certain encoding scheme provided in the training material. It turns out, fortunately, that for an RNN these are trivial tasks to accomplish, and in all experimental results I have seen, I didn’t ever come across a single encoding error after the first couple of training iterations.

BONUS II: Apple-to-apple comparison between models

Byte-level models allow for a fair comparison between models. This point is closely related to the previous one about bytes providing a universal encoding scheme across languages. Any word-level model comes with a vocabulary size, and more importantly, with a choice as to which words to include in the vocabulary. It is very unlikely that two researchers, even if they are working with the same data, end up with exactly the same words in the vocabularies of their models.

“It is very unlikely that two researchers, even if they are working with the same data, end up with exactly the same words in the vocabularies of their models.”

This also goes for character-level models, albeit to a lesser extent. Which characters can be dealt with by the model? What diacritics are expected? What punctuation symbols are considered? Can tabs occur as white space characters?

While it is possible to come up with reasonable defaults for any model, it is nice to note that when two byte-level models are compared that differ in architecture, any difference they might have in terms of performance can never be due to differences in choices made regarding their input/output vocabularies. The difference in performance has to do with the difference in architecture.

To conclude

Above, I tried to explain the intuitions behind byte-level models, and why I am enthusiastic about them.

Byte-level models provide an elegant way of dealing with the out-of-vocabulary problem. Byte-level models perform on par with state-of-the-art word-level models on English, and can do even better on morphologically more involved languages. This is good news, as byte-level models have far fewer parameters.
In short: reading and outputting bytes really works ;-)

This doesn’t mean that byte-level models are the only solutions to the problems covered above. There is very interesting work on morpheme-level models or even models that deal with arbitrary parts of words, where the models themselves figure out what is the best way to split up words.

“In short: reading and outputting bytes really works.”

What all of these models have in common is that they allow for quite complex morphological phenomena to be dealt with without resorting to involved rule-based systems, or machine-learned models for morphological analysis that may need large manually curated data sets for training.

Finally, I have to admit I skipped over many (well, very many) details.
I didn’t come up with all of this just like that though.

This blog post is based on work I did during an internship at Google Research in Mountain View, California, published in this AAAI’18 paper, the BibTeX of which, just in case you want to cite it, is:

@inproceedings{byte-level2018kenter,
title={Byte-level Machine Reading across Morphologically Varied Languages},
author={Tom Kenter and Llion Jones and Daniel Hewlett},
booktitle={
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)
},
year={2018}
}