Simple NLP in Python

Tokenization With TextBlob package

Natalia Kuzminykh
Quick Code
5 min readApr 21, 2022

--

Photo by Clément Hélardot on Unsplash

The amount of textual data on the Internet has enormously risen in the past decades, and there is no doubt that any information analysis needs to be automated. Here is when the TextBlob package becomes a handy tool that serves this goal.

TextBlob is a fairly simple Python library used for performing various natural language processing tasks (ranging from part-of-speech tagging, noun phrase extraction, tokenization, sentiment analysis to classification and many more).

Furthermore, it is applicable for both Python 2 and 3 and requires no special technical prerequisites. In this tutorial, I focus on one of the basic NLP task that is essential for any novice specialist.

What is Tokenization?

Before going deeper into the field of NLP you first must be able to understand the difference between two key terms:

  • Corpus (or corpora in plural) is simply a certain collection of language data (like texts). Corpora are commonly used for training different machine learning models for text classification or sentiment analysis.
  • Token is a final string detached from the primary text, or in other words, it’s an output of tokenization.

Tokenization or word segmentation is a process of separating sentences or words from the corpus into small units, i.e. tokens.

Tokenization in Natural Language Processing

All together, these concepts can be presented in the following sentence:

  • Input (corpus): The evil that men do lives after them
  • Output (tokens): | The | evil | that | men | do | lives | after | them |

Here, the input sentence is tokenized on the basis of spaces between words. One can also tokenize characters from a single word (e.g., a-p-p-l-e from apple) or separate sentences from one text.

Tokenization is one of the elementary and crucial stages of language processing. It transforms unorganized textual material into data. This could be applied further in developing various models of machine translation, search engine optimization, and numerous business inquiries.

Implementing Tokenization in Code

Photo by Adi Goldstein on Unsplash

Foremost, it’s critical to specify a TextBlob object and define a sample corpus that will be tokenized in subsequent analysis. For instance, let's try to tokenize a part of the poem If written by R. Kipling:

If you can force your heart and nerve and sinew
To serve your turn long after they are gone,
And so hold on when there is nothing in you
Except the Will which says to them: “Hold on!”

from textblob import TextBlob# Creating the corpus
corpus = '''If you can force your heart and nerve and sinew to serve your turn long after they are gone. And so hold on when there is nothing in you except the Will which says to them: 'Hold on!'
'''

Once the object is created, it should be passed as an argument to the TextBlob constructor:

blob_object = TextBlob(corpus)

After this, we can perform various operations on this blob_object. It already contains our corpus, categorized to a degree.

Word Tokenization

Finally, to get the tokenized words, we simply retrieve the words attribute to the created blob_object. This gives us a list containing Word objects, that behave very similarly to str objects:

from textblob import TextBlobcorpus = '''If you can force your heart and nerve and sinew to serve your turn long after they are gone. And so hold on when there is nothing in you except the Will which says to them: 'Hold on!'
'''
blob_object = TextBlob(corpus)# Word tokenization of the sample corpus
corpus_words = blob_object.words
# To see all tokens
print(corpus_words)
# To count the number of tokens
print(len(corpus_words))

The output commands should give you the following set of tokens:

['If', 'you', 'can', 'force', 'your', 'heart', 'and', 'nerve', 'and', 'sinew', 'to', 'serve', 'your', 'turn', 'long', 'after', 'they', 'are', 'gone', 'and', 'so', 'hold', 'on', 'when', 'there', 'is', 'nothing', 'in', 'you', 'except', 'the', 'Will', 'which', 'says', 'to', 'them', 'Hold', 'on']38

It’s worth noting that this approach tokenizes words using SPACE as the delimiting character. We can change this delimiter, for example, to a TAB:

from textblob import TextBlob
from nltk.tokenize import TabTokenizer
corpus = '''If you can force your heart and nerve and sinew to serve your turn long after they are gone. And so hold on when there is nothing in you except the Will which says to them: 'Hold on!'
'''
tokenizer = TabTokenizer()
blob_object = TextBlob(corpus, tokenizer = tokenizer)
# Word tokenization of the sample corpus
corpus_words = blob_object.tokens
# To see all tokens
print(corpus_words)

Note that we’ve added a TAB after the first sentence here. Now, the corpus of the words looks something like this:

['If you can force your heart and nerve and sinew to serve your turn long after they are gone.',
'And so hold on when there is nothing in you except the Will which says to them: 'Hold on!'
]

nltk.tokenize contains other tokenization options as well. By default, it uses the SpaceTokenizer which you don't need to define explicitly, but can. Other than these two, it also contains useful tokenizers such as LineTokenizer, BlankLineTokenizer and WordPunctTokenizer.

Sentence Tokenization

To tokenize on a sentence level, we’ll use the same blob_object. This time, instead of the words attribute, we will use the sentences attribute. This returns a list of Sentence objects:

from textblob import TextBlobcorpus = '''If you can force your heart and nerve and sinew to serve your turn long after they are gone. And so hold on when there is nothing in you except the Will which says to them: 'Hold on!'
'''
blob_object = TextBlob(corpus)# Sentence tokenization of the sample corpus
corpus_sentence = blob_object.sentences
# To identify all tokens
print(corpus_sentence)
# To count the number of tokens
print(len(corpus_sentence))

Output:

[Sentence("If you can force your heart and nerve and sinew to serve your turn long after they are gone"), 
Sentence("And so hold on when there is nothing in you except the Will which says to them: 'Hold on!")]
2

Conclusion

Tokenization is a significant data pre-processing step in NLP and involves breaking down of a text into smaller chunks called tokens. These tokens can be individual words, sentences, or characters in the original text.

TextBlob is a great library to get into NLP with since it offers a simple API that lets users quickly jump into performing NLP tasks. In the following article, you can explore further text analysis through TextBlob package and get to know about N-Grams Detection methods:

If you enjoy reading stories like these and want to support me as a writer, consider signing up to become a Medium member. It’s $5 a month, giving you unlimited access to stories on Medium. If you sign up using my link, I’ll earn a small commission.

You also can support my sleepless nights when I am creating the content by buying me a coffee.

--

--

Natalia Kuzminykh
Quick Code

NLP Developer & Conversational AI | A linguist from Italy who is learning to navigate passion to technologies