
To read the sentence “Where is Amsterdam”, a word-level model would need 3 steps, one for each word. A byte- or character-level model would need 18. Average word length varies across languages, but it unavoidable for character-level models to need…
Just to be clear on one detail here, as this sometimes is a source of confusion, the bytes themselves are not the embeddings. Rather, they are used as indices to the entries of an embedding matrix. So, byte 0000 0010 correspo…