LEXICAL PROCESSING

MANOJ KUMAR

3 min readMay 3, 2020

To process the textual data for machine learning usage we need to perform following steps:

Lexical Processing of text: In this we convert raw text into words, sentences, paragraphs etc.
Syntactic Processing of text: In this we try to understand the relationships among words used in the sentences.
Semantic processing of text: In this we try to understand the meaning of text.

As part of this series I will be explaining the different techniques for lexical processing of the text.

To do the lexical processing of text, we perform “Tokenization” and “Extraction of features from text”.

Tokenization

Tokenization is technique that is used to split the text into smaller elements. These elements can be characters, words, sentences or paragraphs; it will be depend upon type of the application we are working on.

E.g. you must have heard about the spam detectors so in spam detector we break the message or email text into words, to identify whether message or email is spam or ham.

This technique of splitting text into either words or sentences or paragraphs is called “tokenization”. There is library “NLTLK” in python which have different types of tokenizer available and the most popular are

a. Word Tokenizer: As the name suggest it is used to split the text into words.

b. Sentence Tokenizer: It is used to split the text into sentences.

c. Tweet Tokenizer: It is used to extract the words from emoji’s or hashtags that we generally use when post the text in social media.

d. Regex Tokenizer: it used to write the custom tokenizer using regex patterns according requirement of application we will be working.

Let’s see how we can use python to perform the tokenization. You can copy the below text jupyter notebook and run the program to see tokenization in action

Token.py

# # Tokenization
#
# The notebook contains three types of tokenisation techniques:
# 1. Word tokenization
# 2. Sentence tokenization
# 3. Tweet tokenization
# 4. Custom tokenization using regular expressions

# ### 1. Word tokenization

text = “We are entering a new world. The technologies of machine learning, speech recognition, and natural language understanding are reaching a nexus of capability.The end result is that we’ll soon have artificially intelligent assistants to help us in every aspect of our lives.”
print(text)

print(text.split())

# Tokenising using nltk word tokeniser

from nltk.tokenize import word_tokenize
words = word_tokenize(text)
print(words)

# NLTK’s word tokeniser not only breaks on whitespaces but also breaks contraction words such as we’ll into “we” and “‘ll”.

# ### 2. Sentence tokeniser

# Tokenising based on sentence requires you to split on the period (‘.’). Let’s use nltk sentence tokeniser.

from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)

print(sentences)

# ### 3. Tweet tokeniser

# A problem with word tokeniser is that it fails to tokeniser emojis and other complex special characters such as word with hashtags. Emojis are common these days and people use them all the time.

message = “Another Record Of its own…#Tubelight gets its own emoji..FIRST EVER fr Hindi Cinema , Kmaal krte ho \
@BeingSalmanKhan\@kabirkhankk 👏👌✌”

print(word_tokenize(message))

# The word tokeniser is not able to split the emojis which is something that we don’t want. Emojis have their own significance in areas like sentiment analysis where a happy face and sad face can alone prove to be a really good predictor of the sentiment. Similarly, the hashtags are broken into two tokens. A hashtag is used for searching specific topics or photos in social media apps such as Instagram and facebook. So there, you want to use the hashtag as is.
#
# Let’s use the tweet tokeniser of nltk to tokenise this message.

from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()

tknzr.tokenize(message)

# As you can see, it handles all the emojis and the hashtags pretty well.

# Now, there is a tokeniser that takes a regular expression and tokenises and returns result based on the pattern of regular expression.
#
# Let’s look at how you can use regular expression tokeniser.

from nltk.tokenize import regexp_tokenize
pattern = “#[\w]+”

regexp_tokenize(message, pattern)

If you have any queries in above article, please put on the comment section I would love to clarify that.

In next week article I will be covering details of other techniques for Lexical Processing.

LEXICAL PROCESSING

Written by MANOJ KUMAR