Text Segmentation

Normalization, Tokenization, Sentence Segmentation + Useful Methods

Jake Batsuuri
Computronium Blog
7 min readAug 27, 2021

--

What does normalizing a text do?

We have previously called this method .lower() to turn all of the words lowercase, so that strings like “the” and “The” both become “the”, so we don’t double count them.

What if we wanna do even more?

Stemming

For example we can strip the affixes from words in a process called stemming. In the word “preprocessing”, there’s a prefix “pre — ” and suffix “ — ing” and the resulting word.

NLTK has several stemmers, you can make your own using regular expressions, but the NLTK stemmers handle many irregular cases.

There are 2 stemmers, Porter and Lancaster:

Which produces some wrong results.

The Porter has slightly better outputs. The following class finds concordances not with direct string search, but with the different variations of “lie”.

Lemmatization

The resulting word from stemming may or may not be a dictionary term, the process of making sure it is a real word is called lemmatization.

It’s not much of use if we stem a word, and the resulting root word is not in the dictionary. WordNet lemmatizes affixes only if the resulting word is in WordNet.

This idea of only stemming if the result is a real word is helpful when you wanna do further analysis.

Furthermore, you can map special words such as numbers, abbreviations, emails, addresses etc into special sub-vocabularies. Which will improve the language model significantly.

To be able to map these special words sometimes you have to do things yourself.

Regular Expression for Tokenization

Tokenization is a special type of segmentation where we segment the entire text into words, as opposed to sentences or phrases.

Simplest way to tokenize a text is to split the text on whitespaces.

Simplest way to segment a sentence is to split by periods.

Then we come across this problem, the newline is connecting two what should be separate tokens. We add a set of tabs, newlines, tabs:

This works much better.

We can even shorthand the set to this r’\s+’ , which says any one or more whitespaces.

The Complement Method

Consider the range [a-zA-Z0–9_] which can be shortened to \w , this range matches words rather than spaces. Then there’s \W , which matches words but excludes letters, digits or underscores.

This method is essentially more powerful since it allows for easier regular expression matching of different patterns such as “I’m” or “hot-tempered”.

This is the best resource I found for regexes.

Issues with Tokenization

Tokenization never gives a perfect solution across different types of text, each text must be modified further according to the text type.

One way to handle this is to test and compare your tokenizer to already tokenized texts to make sure that its good.

The other issue with tokenizers is contractions such as didn’t, which must be treated with special cases.

Or what about cases where we can’t really tell word boundaries, such as when our input system just gets this “doyouseethekittyseethedoggydoyoulikethekittylikethedoggy”.

Simulated Annealing

Sentence Segmentation

What about segmenting into sentences?

Sentence segmentation is a bit harder, we can separate by periods, but periods are used in other ways, which makes it more difficult.

Formatting

Lists to Strings

Newline Printing

Frequency Distribution Counters

String Formatting Expressions

These help us construct nicely formatted outputs:

Conversion Specifiers

  • %s is for strings
  • %d is for decimals

The tuple should always match the number of formatting strings in the first string.

More use cases:

Lining It Up

Left Padding:

Right Padding:

Variable Padding:

Decimals:

Writing To a File

If the data is non-text, then convert to string first:

Other Articles

Up Next…

In the next article, we will explore Computational Complexity for language processing.

For the table of contents and more content click here.

References

Clark, Alexander. The Handbook of Computational Linguistics and Natural Language Processing. Wiley-Blackwell, 2013.

Eisenstein, Jacob. Introduction to Natural Language Processing. The MIT Press, 2019.

Bird, Steven, et al. Natural Language Processing with Python. O’Reilly, 2009.

Jurafsky, Dan, and James H. Martin. Speech and Language Processing. Pearson, 2014.

Barker-Plummer, Dave, et al. Language, Proof and Logic. CSLI Publ., Center for the Study of Language and Information, 2011.

--

--