Weaknesses of WordPiece Tokenization

Findings from the front lines of NLP at VMware

12 min readMar 29, 2021

In the VMware R&D AI Lab we’ve been working with BERT since Google released it back in 2018. We were impressed by its performance on standard benchmarks (GLUE, SQuAD, etc.), so we expected it to show a similar level of improvement on our internal benchmarks. Long story short, it was better, but not dramatically better, than our existing models as we had expected it would be. After a deep dive into the numerous failure cases, we linked many of the failures back to poor tokenization. In this article we will:

Summarize the weaknesses we’ve found with WordPiece tokenization that caused BERT to not be a silver bullet
Show you what we’re doing at VMware to overcome those weaknesses
Demonstrate a statistically significant performance improvement on our internal benchmarks with no model changes — solely by correcting the tokenization errors

A Brief Introduction to WordPiece Tokenization

WordPiece tokenization takes unstructured text and prepares it for ingestion into a machine learning model (most commonly BERT or other Transformer-based models). It breaks every word into a root token and as many sub-tokens as are required to build the original word.

Correct Examples:

"abstracting" ⇒ [ 'abstract', '##ing' ]"alerting" ⇒ [ 'alert', '##ing' ]"visualize" ⇒ [ 'visual', '##ize' ]"virtualize" ⇒ [ 'virtual', '##ize' ]"foundational" ⇒ [ 'foundation', '##al' ]"workload" ⇒ [ 'work', '##load' ]

The intent is to have a dramatically smaller, yet more expressive vocabulary, thus maximizing the semantic overlap between different versions of a single word. This works well for conjugated terms and suffixes. The model isn’t forced to learn that “virtualize” is the verb form of “virtual” separately. Instead, it sees two occurrences of the word “virtual”, thus making the learned embedding for that token more robust.

Unfortunately, this fails for words that don’t have a good root, like product names and technical jargon, and multi-compound words, which appear to have not been considered when WordPiece was designed.

Incorrect Examples:

"vSphere" ⇒ [ 'vs', '##pher', '##e' ]"Kubernetes" ⇒ [ 'ku', '##ber', '##net', '##es' ]"acceptunverifiedcertificates" ⇒ [ 'accept', '##un', '##ver', '##ified', '##cer', '##ti', '##fi', '##cate', '##s' ]

The first two examples are demonstrations of the out of vocabulary problem. There is no good root term in BERT’s vocabulary for vSphere or Kubernetes, so they get split into little pieces. The third is an example of what I’ve come to call the sub-token soup problem. It’s a meaningless sequence of sub-tokens that doesn’t enhance the model’s understanding of what’s being said.

This set of problems really became apparent when we were attempting to use BERT for a question-answering task based on VMware documentation. Any time comprehension of the question (or selection of the correct answer) relied on understanding a technical term or reading a compound word, the model returned a (typically amusing but) completely useless answer.

The remainder of this article will identify the sources of sub-token soup and propose a solution for each one. Our solution for out-of-vocabulary terms will be the subject of a future article.

The Spelling Mistake Problem

Spelling mistakes can cause a massive decrease in model performance. This problem was quantified in the paper User Generated Data: Achilles’ Heel of BERT, in which the researchers demonstrated the following reduction in model performance as a function of spelling error rate (the percentage of words in a given input string that are misspelled):

Spelling Error Rate | Classification Score
--------------------+---------------------
          0%        |        0.89
          5%        |        0.78
         10%        |        0.60
         15%        |        0.45
         20%        |        0.35

If you think a 20% spelling error rate is excessive (as I first did while reading the paper), consider the following example:

A common question people ask Google is, “What is vSphere?”

If I feed that question into our Automated Question Answering system, it will happily return “VMware vSphere is a suite of software components for virtualization.” A perfect answer.

If I mangle the product name by even a single letter, “What is vShere?”

(Note the missing “p”. This string has a 25% spelling error rate if you count the question mark, and a 33% spelling error rate if you don’t!)

The system comes back with, “You can enable or disable vSGX when you create a virtual machine or edit an existing virtual machine.” An answer unrelated to the question.

The reading comprehension model was fine-tuned on SQuAD 2. If you check the score, it’s below the “answer found” threshold, so the model is actually saying “No answer found”. In the entirety of our publicly available content, the question, “What is vShere?” is unanswerable because of a single missing letter.

Why does a single missing letter cause such an astounding failure? The answer lies in how the word is tokenized. For our in-house version of BERT, vBERT, we’ve added vSphere to its vocabulary, so “vSphere” is tokenized as itself, just in lowercase: [ “vsphere” ]. When it’s misspelled without the “p”, it gets tokenized as [ “vs”, “##her”, “##e” ]. And, remember BERT doesn’t see either the original input characters or the characters of the sub-tokens. It doesn’t know that you were only off by one with your spelling. Those tokens are mapped to the indexes for the pre-trained vectors for those tokens. Semantically, it can be thought of as the difference between our product name “vSphere”, and the contraction of “versus”, the female pronoun “her”*, and the letter “e”*. That’s what I’m talking about when I say “sub-token soup”. It’s a string of meaningless tokens that only confuse the model.

*Note: The sub-tokens “##her” and “##e” don’t actually have the same semantic meaning as “her” and “e”. This is simply an illustrative analogy since it can be challenging to conceptualize the semantic meaning of sub-words. The point is, “vsphere” and the sequence “vs”, “##her”, “##e” will have extremely different vector representations, which is the ultimate cause of failure.

The Compound Word Problem

WordPiece was designed to handle suffixes and simple compound words, which it does quite well. The word “endpoint” is tokenized as [ “end”, “##point” ]. Even “endpointtype” is tokenized well to [ “end”, “##point”, “##type”]. Those splits perfectly capture what’s being said. WordPeice fails when it gets the word boundaries wrong. Take “forwarderendpoint” for example. You and I can easily parse that out to “forwarder endpoint”, but WordPiece will tokenize that as [ “forward”, “##ere”, “##nd”, “##point” ]. It got “forward” and “##point” right, but because WordPiece is greedy, it picked the largest sub-token it could find, so it picked “##ere” instead of “##er” and subsequently got “##nd” instead of “##end”. And, as we saw in the previous section, being off by a single letter can have dramatic downstream effects.

Our documentation contains tens of thousands of compound words. Being highly technical in nature, it contains vast amounts of things like API calls, code fragments, system logs, and lots of other forms of non-standard English which are prone to an abnormally high presence of compound words. Here are just a few more examples containing “endpoint”:

"ipamendpointid" ⇒ [ 'ipa', '##men', '##dp', '##oint', '##id' ]"ldpprotocolendpoint" ⇒ [ 'ld', '##pp', '##rot', '##oco', '##len', '##dp', '##oint' ]"nioendpoint" ⇒ [ 'ni', '##oe', '##nd', '##point' ]"udpendpointprocess" ⇒ [ 'ud', '##pen', '##dp', '##oint', '##pro ##ces', '##s' ]"vapiendpoint" ⇒ [ 'va', '##pie', '##nd', '##point' ]"vsphereendpointname" ⇒ [ 'vs', '##pher', '##een', '##dp', '##oint', '##name' ]

Hopefully, at this point, you’re starting to get a good sense of what sub-token soup really looks like. I also hope you’re not getting tired of seeing it because there’s plenty more where that came from …

The Prefix Problem

As stated previously, WordPiece was designed to handle suffixes and simple compound words. It has no concept of prefixes. So, if the word you’re trying to tokenize wasn’t popular enough to make it into the root vocabulary with its prefix, weird things happen. Here are just a few examples:

"descheduled" ⇒ [ 'des', '##ched', '##uled' ]"deprioritize" ⇒ [ 'dep', '##rio', '##rit', '##ize' ]"disaggregated" ⇒ [ 'di', '##sa', '##gg', '##re', '##gated' ]"disallowed" ⇒ [ 'di', '##sal', '##lowe', '##d' ]"disconnection" ⇒ [ 'disco', '##nne', '##ction' ]"nonexistent" ⇒ [ 'none', '##xi', '##sten', '##t' ]"repackaging" ⇒ [ 'rep', '##ack', '##aging' ]"repopulate" ⇒ [ 'rep', '##op', '##ulate' ]"rescheduling" ⇒ [ 'res', '##ched', '##ulin', '##g' ]"unreachable" ⇒ [ 'un', '##rea', '##cha', '##ble' ]"unprotects" ⇒ [ 'un', '##pro', '##tec', '##ts' ]"unpartitioned" ⇒ [ 'un', '##par', '##ti', '##tion', '##ed' ]

If the idea behind WordPiece is to increase model understanding by decreasing vocabulary size and relying on the common meaning between previously seen sub-tokens, then prefixes are really throwing things off. And, that’s before we consider prefixed words inside compound words …

"vspheredisconnection" ⇒ [ 'vs', '##pher', '##ed', '##is', '##con', '##ne', '##ction' ]

If the same words are tokenized differently, then the model’s pre-trained token weights will not match with the semantic meaning of what’s being said.

The Abbreviation Problem

People are lazy. They get tired of typing complete words over and over again. WordPiece has no ability to map between the abbreviated form and the long form of a word. Here are a few examples:

"command" ⇒ "cmd" ⇒ [ 'cm', '##d' ]"configuration" ⇒ "config" ⇒ [ 'con', '##fi', '##g' ]"parameter" ⇒ "param" ⇒ [ 'para', '##m' ]"utility" ⇒ "util" ⇒ [ 'ut', '##il' ]

The Lazy (and Wrong) Answer

The modern solution to the problems described above is generally to throw more pre-training data at ever-larger models and let them sort it all out themselves. With an input corpus like C4 (Colossal Clean Crawled Corpus) and models like T5 (Text-To-Text Transfer Transformer), who cares that there’s no sub-token overlap between virtualize [ “virtual”, “##ize”] and unvirtualize [ “un”, “##vir”, “##tua”, “##li”, “##ze” ]? Given enough examples, the model will eventually learn robust representations of the sub-token soup.

A Better Answer

Unfortunately, the lazy answer isn’t available to us. While more content containing plain English is easy to come by, there is a finite amount of VMware-specific content … for whatever reason, people don’t go around discussing disaggregating their unpartitioned vSAN clusters via vAPIEndpoints. VMware-speak (or any niche topic full of jargon) can and should be treated more like a low-resource language. As such, we need to solve the vocabulary mismatch problem on the front end, since we don’t have the amount of content that would be needed to train a model to see through the noise.

To accomplish this, we built a VMware-specific Natural Language Preprocessor (vNLP). It’s a pipeline of selectable stages that performs various preprocessing actions. The first stage fixes encoding problems like fixing mojibake and decoding HTML character codes. The second stage strips repeated characters. For example, people like to draw lines in text to put a separator in their content, like:

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

This is semantically meaningless and eats up precious tokens, so we strip them out. The third stage strips the remaining special characters, like “[]{}()`~!@#$%^&*-=_+;:<>”. They also tend to be semantically meaningless and are thus wasted tokens. In the fourth stage, we use regex to transform known patterns into their semantic name. For example:

"2018-05-03T01:59:50.817Z +0000" ⇒ "timestamp""2620:124:6020:c002:ffff:ffff:a097:9e39/128" ⇒ "ip address""127.0.0.1:8080" ⇒ "ip address""00:00:0A:BB:28:FC" ⇒ "mac address""7.0.0-0.0.32156387" ⇒ "version number""5accf9a3-226ee135-7ca6-0025b521a1b4" ⇒ "volume id""0xFFC67FF0" ⇒ "hex id"

In general, the language model doesn’t need to know the exact IP address, for example, it just needs to know that there’s an IP address there, thus translating long strings of what would otherwise be noise (there are 18 tokens in the timestamp) into a semantically meaningful word or two. The remaining stages attempt to deal with the sub-token soup problems laid out in the first half of this article.

Solving the Spelling Mistake Problem

This is going to sound obvious … because it is. The best way to handle spelling mistakes is to … correct them. So, we trained a VMware-specific spelling corrector. The first version was trained only on VMware-specific content. It worked quite well for correcting VMware-specific content, but when we tested it on more general content, like an email that contained VMware-specific words, we found it attempted to correct normal English that wasn’t wrong, it was just outside of its training corpus. So, for version two, we trained it on a combination of VMware-specific content, a hand cleansed copy of r/vmware, and “high quality” Wikipedia content. This worked a lot better for both VMware-specific content and more general content. It still struggles to correct spelling mistakes made inside compound words, but that’s a research problem for another day.

Solving the Compound Word Problem

After looking at thousands of compound words that were tokenized incorrectly, the solution to the compound word problem became as obvious as the solution to the spelling mistake problem. The best way to handle compound words is to split them up. You and I don’t read compound words as one long word. We break it up in our heads, then read each word individually. Language models can’t do that, so we have to break the compound words up for them.

"ipamendpointid" ⇒ "ipam endpoint id""ldpprotocolendpoint" ⇒ "ldp protocol endpoint""nioendpoint" ⇒ "nio endpoint""udpendpointprocess" ⇒ "udp endpoint process""vapiendpoint" ⇒ "v api endpoint""vsphereendpointname" ⇒ "vsphere endpoint name"

This turned out to be orders of magnitude harder than fixing spelling mistakes. The solution is delicate, to say the least. Getting things to split correctly involves fiddling with word occurrence counts* and gets very complicated when you have acronyms inside compound words. This stage got so complicated, we ended up having to build a regression suite to track the changes, since fixing one incorrect split could cause dozens of new incorrect splits. As of today, our regression suite has almost 22,000 tests in it with just under 250 known failure cases. A significant portion of the known failures involve spelling mistakes. For example:

Current Output:"addgoupurlsetting" ⇒ "add go up url setting"Expected Output:"addgoupurlsetting" ⇒ "add group url setting"

*When deciding how to split compound words, like “diskspace” for example, the algorithm must decide between “disk space” (correct) and “disks pace” (incorrect). To do so, the algorithm checks the word occurrence counts for each possible sub-word and computes the most likely correct split. If it produced the incorrect split, for this case I would try raising the occurrence counts for “disk” and “space” and lowering the count for “pace” until it got the split correct … then re-run the regression suite to see how many new incorrect splits I caused by changing the counts … and go futz with more occurrence counts until it all settled out.

Solving the Prefix Problem

This is the most controversial stage in our preprocessing pipeline. If I could wave a magic wand, I’d add the concept of prefixes to WordPiece. Something like:

"descheduled" ⇒ [ 'de##', 'scheduled' ]"disallowed" ⇒ [ 'dis##', 'allow', '##ed' ]"nonexistent" ⇒ [ 'non##', 'existent' ]"repopulate" ⇒ [ 're##', 'pop', '##ulate' ]"unreachable" ⇒ [ 'un##', 'reach', '##able' ]

That would be the ideal answer.

But, since we don’t live in an ideal world, our solution is to treat prefixes like compound words and split them off. If the prefixed form of a word isn’t in BERT’s vocabulary and isn’t sufficiently common for us to assume BERT is already familiar with the WordPiece-split version of the word, then the prefix gets chopped off:

"descheduled" ⇒ "de scheduled" ⇒ [ 'de', 'scheduled' ]"unprotects" ⇒ "un protects" ⇒ [ 'un', 'protects' ]"unpartitioned" ⇒ "un partitioned" ⇒ [ 'un', 'partition', '##ed' ]"acceptunverifiedcertificates" ⇒ "accept un verified certificates" ⇒ [ 'accept', 'un', 'verified', 'certificates' ]

Would this strategy work if we were using vanilla BERT? Maybe? But, because we pre-trained vBERT on the split-off prefixes, we believe it can read them just fine. And, since our corpus has a large number of prefixed words which are uncommon in regular English, minimizing sub-token soup is extremely important, given our limited pre-training corpus.

Solving the Abbreviation Problem

With the controversial part out of the way, we’re back to exceedingly obvious solutions. The best way to handle abbreviations is to expand them back to the original word.

"addr" ⇒ "address""cmd" ⇒ "command""env" ⇒ "environment""param" ⇒ "parameter""stmt" ⇒ "statement"

For speed reasons, these are context-free replacements. We could do something more sophisticated that would take context into account to handle ambiguous contractions (like “pwd”, which could be “password” or the Linux command “print working directory”, and “desc”, which could be “description” or “descending”). But, vNLP is already a bit slow (still usable for real-time inference, but barely so), and switching from regex to something context-sensitive would add an order of magnitude or more computational complexity, so we expand the ones we can get right without context and let BERT sort out the rest.

Performance Improvement

I can’t share a lot of detail about our internal benchmarks, but I can share raw numbers:

Task | Input Cleansed with vNLP | Test Accuracy
-----+--------------------------+--------------
  1  |           No             |    0.9145
  1  |           Yes            |    0.9540
-----+--------------------------+--------------
  2  |           No             |    0.9693
  2  |           Yes            |    0.9750
-----+--------------------------+--------------
  3  |           No             |    0.4983
  3  |           Yes            |    0.5167

Statistically significant improvement across the board!

Conclusion

Real-world data is messy. WordPiece tokenization is an excellent evolution from stemming and lemmatization, but it’s not without its weaknesses. It struggles with spelling mistakes, multi-compound words, prefixes, and contractions. In this article, we showed real-world examples of those cases and presented VMware’s Natural Language Preprocessor (vNLP) as a pipeline of solutions to each of WordPiece’s weaknesses.

Spelling mistake? Fix it.
Compound word? Split it.
Prefix? Treat it like a compound word and split it.
Abbreviation? Expand it.

Just as with any low-resource language, corporate-speak generally won’t have a large enough corpus to allow the language model to sort out the sub-token soup problem on its own, so we have to do everything we can to minimize vocabulary mismatch.

Disclaimer

The views and opinions expressed here are my own and don’t necessarily represent VMware’s positions, strategies, or opinions.