Introducing TF.Text

Jun 10, 2019 · 3 min read

Posted by Robby Neale, Software Engineer

TensorFlow provides a wide breadth of ops that greatly aid in building models from images and video. However, there are many models that begin with text, and the language models built from these require some preprocessing before the text can be fed into the model. For example, the Text Classification tutorial that uses the IMDB set begins with text data that has already been converted into integer IDs. This preprocessing done outside the graph may create skew if it differs at training and inference times, and requires extra work to coordinate these preprocessing steps.

TF.Text is a TensorFlow 2.0 library that can be easily installed using PIP and is designed to ease this problem by providing ops to handle the preprocessing regularly found in text-based models, and other features useful for language modeling not provided by core TensorFlow. The most common of these operations is text tokenization. Tokenization is the process of breaking up a string into tokens. Commonly, these tokens are words, numbers, and/or punctuation.

Each of the included tokenizers return RaggedTensors with the innermost dimension of tokens mapping to the original individual strings. As a result, the resulting shape’s rank is increased by one. This is illustrated below, but also please review the ragged tensor guide if you are unfamiliar with RaggedTensors.


We are initially making available three new tokenizers (as proposed in a recent RFC). The most basic new tokenizer is the whitespace tokenizer that splits UTF-8 strings on ICU defined whitespace characters (eg. space, tab, new line).

The initial release also includes a unicode script tokenizer, which splits UTF-8 strings based on Unicode script boundaries. Unicode scripts are collections of characters and symbols that have historically related language derivations. View the International Components for Unicode (ICU) UScriptCode values for the complete set of enumerations. It’s worth noting that this is similar to the whitespace tokenizer with the most apparent difference being that it will split punctuation USCRIPT_COMMON from language texts (eg. USCRIPT_LATIN, USCRIPT_CYRILLIC, etc).

The final tokenizer provided in the TF.Text launch is a wordpiece tokenizer. It is an unsupervised text tokenizer which requires a predetermined vocabulary for further splitting tokens down into subwords (prefixes & suffixes). Wordpiece is commonly used in BERT models.

Each of these tokenizes on UTF-8 encoded strings and includes an option for getting byte offsets into the original string. This allows the caller to know the byte alignment into the original string for each token that was created.


This just brushes the surface of TF.Text. Along with these tokenizers, we are also including ops for normalization, n-grams, sequence constraints for labeling, and more! We encourage you to visit our Github repository, and try using these ops in your own model development. Installation is easy with PIP:

pip install tensorflow-text

And for more in depth working examples, please view our Colab notebook. It includes a variety of code snippets for many of the newly available ops not discussed here. We look forward to continuing this effort and providing even more tools to make your language models even easier to build in TensorFlow.


TensorFlow is an end-to-end open source platform for…

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store