Top 5 Tokenization Techniques in Natural Language Processing in Python

Ajay Khanna
5 min readFeb 9, 2022

--

Tokenization is the process of splitting a text object into smaller units known as tokens. Examples of tokens can be words, characters, numbers, symbols, or n-grams.

The most common tokenization process is whitespace/ unigram tokenization. In this process entire text is split into words by splitting them from whitespaces. As shown in the example below the whole sentence is split into unigrams i.e “I”,” Went”,”To”,”New Delhi” etc.

Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

Different Techniques to perform Tokenization in Python

Tokenization Using Python’s split() function

text = “””Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.”””# Splits at spacetext.split()[‘Natural’,‘language’,‘processing’,‘(NLP)’,‘is’,‘a’,‘subfield’,‘of’,‘linguistics,’,‘compute’,‘science,’,‘and’,‘artificial’,‘intelligence’,‘concerned’,‘with’,‘the’,‘interactions’,‘between’,‘computers’,‘and’,‘human’,‘language,’,‘in’,‘particular’,‘how’,‘to’,‘program’,‘computers’,‘to’,‘process’,‘and’,‘analyze’,‘large’,‘amounts’,‘of’,‘natural’,‘language’,‘data.]

Tokenization Using Regular Expressions (RegEx)

import retext = “””Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language,in particular how to program computers to process and analyze large amounts of natural language data. “””tokens = re.findall(“[\w’]+”, text)Output — [‘Natural’, ‘language’, ‘processing’, ‘NLP’, ‘is’, ‘a’, ‘subfield’, ‘of’, ‘linguistics’, ‘computer’, ‘science’, ‘and’, ‘artificial’, ‘intelligence’, ‘concerned’, ‘with’, ‘the’, ‘interactions’, ‘between’, ‘computers’, ‘and’, ‘human’, ‘language’, ‘in’, ‘particular’, ‘how’, ‘to’, ‘program’, ‘computers’, ‘to’, ‘process’, ‘and’, ‘analyze’, ‘large’, ‘amounts’, ‘of’, ‘natural’, ‘language’, ‘data’]

Tokenization Using NLTK

NLTK contains a module called tokenize() which further classifies into two sub-categories:

Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words

Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences

import nltkfrom nltk.tokenize import word_tokenize , sent_tokenizetext = “””Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language,in particular how to program computers to process and analyze large amounts of natural language data. “””word_tokenize(text)Output — [‘Natural’, ‘language’, ‘processing’, ‘(‘, ‘NLP’, ‘)’, ‘is’, ‘a’, ‘subfield’, ‘of’, ‘linguistics’, ‘,’, ‘computer’, ‘science’, ‘,’, ‘and’, ‘artificial’, ‘intelligence’, ‘concerned’, ‘with’, ‘the’, ‘interactions’, ‘between’, ‘computers’, ‘and’, ‘human’, ‘language’, ‘,’, ‘in’, ‘particular’, ‘how’, ‘to’, ‘program’, ‘computers’, ‘to’, ‘process’, ‘and’, ‘analyze’, ‘large’, ‘amounts’, ‘of’, ‘natural’, ‘language’, ‘data’, ‘.’]

Tokenization Using SpaCy

spaCy is an open-source library for advanced NLP. It supports over 49+ languages and provides state-of-the-art computation speed.

Check out full code at https://spacy.io/api/tokenizer

from spacy.lang.en import English# Load English tokenizer, tagger, parser, NER and word vectorsnlp = English()text = “””Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. “””# “nlp” Object is used to create documents with linguistic annotations.my_doc = nlp(text)# Create list of word tokenstoken_list = []for token in my_doc:token_list.append(token.text)token_listOutput — [‘Natural’, ‘language’, ‘processing’, ‘(‘, ‘NLP’, ‘)’, ‘is’, ‘a’, ‘subfield’, ‘of’, ‘linguistics’, ‘,’, ‘computer’, ‘science’, ‘,’, ‘and’, ‘\n’, ‘artificial’, ‘intelligence’, ‘concerned’, ‘with’, ‘the’, ‘interactions’, ‘between’, ‘computers’, ‘and’, ‘human’, ‘language’, ‘,’, ‘\n’, ‘in’, ‘particular’, ‘how’, ‘to’, ‘program’, ‘computers’, ‘to’, ‘process’, ‘and’, ‘analyze’, ‘large’, ‘amounts’, ‘of’, ‘natural’, ‘language’, ‘data’, ‘.’]

Tokenization Using Keras

It is an open-source neural network library for Python. Keras is super easy to use and can also run on top of TensorFlow. In the NLP context, we can use Keras for cleaning the unstructured text data that we typically collect.

Check out code at https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/t#ext/Tokenizer

from keras.preprocessing.text import text_to_word_sequence# definetext = """Natural language processing (NLP) is a sub'sfield of linguistics, computer science, andartificial intelligence concerned with the interactions between computers and human language,in particular how to program computers to process and analyze large amounts of natural language data. """# tokenizeresult = text_to_word_sequence(text)result['natural','language','processing', 'nlp', 'is', 'a',"subfield",'of','linguistics', 'computer','science', 'and','artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'language', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to','process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language','data']

Tokenization Using Gensim

The final tokenization method we will cover here is using the Gensim library. It is an open-source library for unsupervised topic modeling and natural language processing and is designed to automatically extract semantic topics from a given document.

Check out full code at https://tedboy.github.io/nlps/generated/generated/gensim.utils.tokenize.html

from gensim.utils import tokenizetext = “””Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. “””list(tokenize(text))Output — [‘Natural’, ‘language’, ‘processing’, ‘NLP’, ‘is’, ‘a’, ‘subfield’, ‘of’, ‘linguistics’, ‘computer’, ‘science’, ‘and’, ‘artificial’, ‘intelligence’, ‘concerned’, ‘with’, ‘the’, ‘interactions’, ‘between’, ‘computers’, ‘and’, ‘human’, ‘language’, ‘in’, ‘particular’, ‘how’, ‘to’, ‘program’, ‘computers’, ‘to’, ‘process’, ‘and’, ‘analyze’, ‘large’, ‘amounts’, ‘of’, ‘natural’, ‘language’, ‘data’]

Tokenization Using PunktSentenceTokenizer

This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.The NLTK data package includes a pre-trained Punkt tokenizer for English

Check out more details at https://www.nltk.org/_modules/nltk/tokenize/punkt.html

from nltk.tokenize import PunktSentenceTokenizernltk.download(‘state_union’)from nltk.corpus import state_uniontrain_text = state_union.raw(“2005-GWBush.txt”)sample_text = state_union.raw(“2006-GWBush.txt”)custom_sent_tokenizer = PunktSentenceTokenizer(train_text) #Atokenized = custom_sent_tokenizer.tokenize(sample_text)tokenizedOutput — ["White House photo by Eric DraperEvery time I'm invited to this rostrum, I'm humbled by the privilege, and mindful of the history we've seen together.",  'We have gathered under this Capitol dome in moments of national mourning and national achievement.',  'We have served America through one of the most consequential periods of our history -- and it has been my honor to serve with you.',  'In a system of two parties, two chambers, and two elected branches, there will always be differences and debate.',  'But even tough debates can be conducted in a civil tone, and our differences cannot be allowed to harden into anger.']

--

--

Ajay Khanna

Data Scientist with expertise in Text Mining , ML, DL and NLP