Posted by Robby Neale, Software Engineer
TensorFlow provides a wide breadth of ops that greatly aid in building models from images and video. However, there are many models that begin with text, and the language models built from these require some preprocessing before the text can be fed into the model. For example, the Text Classification tutorial that uses the IMDB set begins with text data that has already been converted into integer IDs. This preprocessing done outside the graph may create skew if it differs at training and inference times, and requires extra work to coordinate these preprocessing steps.
TF.Text is a TensorFlow 2.0 library that can be easily installed using PIP and is designed to ease this problem by providing ops to handle the preprocessing regularly found in text-based models, and other features useful for language modeling not provided by core TensorFlow. The most common of these operations is text tokenization. Tokenization is the process of breaking up a string into tokens. Commonly, these tokens are words, numbers, and/or punctuation.
Each of the included tokenizers return RaggedTensors with the innermost dimension of tokens mapping to the original individual strings. As a result, the resulting shape’s rank is increased by one. This is illustrated below, but also please review the ragged tensor guide if you are unfamiliar with RaggedTensors.
We are initially making available three new tokenizers (as proposed in a recent RFC). The most basic new tokenizer is the whitespace tokenizer that splits UTF-8 strings on ICU defined whitespace characters (eg. space, tab, new line).
The initial release also includes a unicode script tokenizer, which splits UTF-8 strings based on Unicode script boundaries. Unicode scripts are collections of characters and symbols that have historically related language derivations. View the International Components for Unicode (ICU) UScriptCode values for the complete set of enumerations. It’s worth noting that this is similar to the whitespace tokenizer with the most apparent difference being that it will split punctuation
USCRIPT_COMMON from language texts (eg.
The final tokenizer provided in the TF.Text launch is a wordpiece tokenizer. It is an unsupervised text tokenizer which requires a predetermined vocabulary for further splitting tokens down into subwords (prefixes & suffixes). Wordpiece is commonly used in BERT models.
Each of these tokenizes on UTF-8 encoded strings and includes an option for getting byte offsets into the original string. This allows the caller to know the byte alignment into the original string for each token that was created.
This just brushes the surface of TF.Text. Along with these tokenizers, we are also including ops for normalization, n-grams, sequence constraints for labeling, and more! We encourage you to visit our Github repository, and try using these ops in your own model development. Installation is easy with PIP:
pip install tensorflow-text
And for more in depth working examples, please view our Colab notebook. It includes a variety of code snippets for many of the newly available ops not discussed here. We look forward to continuing this effort and providing even more tools to make your language models even easier to build in TensorFlow.