Inputting & PreProcessing Text

Input Methods, String & Unicode, Regular Expression Use Cases

Jake Batsuuri
Computronium Blog
13 min readAug 25, 2021

--

NLTK has preprocessed texts. But we can also import and process our own texts.

Importing

To Import a Book as a Txt

Install urlopen:

And:

Tokenization:

Textization, or just turning it into NLTK’s Text Object so we run things like collocations:

Getting Just the Good Stuff

Lot of books will have header and footers, here we just find the header index and the footer index and simply remove ‘em.

If there are more than one “THE END”s, you can use:

Which will find indexes from the bottom of the text.

Handling HTML

The printout:

Not that great, so let’s clean it up:

Beautiful soup has tons of easy methods to get us the text:

Stringify:

Tokenify:

Textify:

Reading Local Files

If it's a different directory:

Print line by line:

Binary Files

Text sometimes comes in PDF and WORD, there are libraries for processing these, such as pypdf and pywin32.

User Input

Strings

Single Quote

However if you wanna escape the single quotation itself:

Or you can use the…

Double Quotation Mark

Triple Quotation Mark

The problem is, the above doesn’t print newlines, but this does:

Works also for single quotation marks thrice:

Concatenation

Printing

Individual Chars

Negative Indexing Chars

Where the last character is also -1 and it counts up as you reverse through each character.

Print Chars

Count Chars

You can also visualize this frequency distribution:

Each language has a typical frequency distribution, and is a good way to distinguish between them.

Substrings

The slice (m,n) contains the substring from index m through n-1.

Substring Membership

More Operations

  • s.find(t) Index of first instance of string t inside s (-1 if not found)
  • s.rfind(t) Index of last instance of string t inside s (-1 if not found)
  • s.index(t) Like s.find(t), except it raises ValueError if not found
  • s.rindex(t) Like s.rfind(t), except it raises ValueError if not found
  • s.join(text) Combine the words of the text into a string using s as the glue
  • s.split(t) Split s into a list wherever a t is found (whitespace by default)
  • s.splitlines() Split s into a list of strings, one per line
  • s.lower() A lowercased version of the string s
  • s.upper() An uppercased version of the string s
  • s.titlecase() A titlecased version of the string s
  • s.strip() A copy of s without leading or trailing whitespace
  • s.replace(t, u) Replace instances of t with u inside s

Lists

Unicode

ASCII can hold 128 or 256 characters, because it uses only 1 byte or 8 bits. Whereas UTF-8 can encode millions of characters because it uses 1 to 4 bytes.

We can manipulate unicode strings exactly as normal strings, however when we store unicode, we store em as a stream of bytes. Encodings such as ASCII are often enough to support a single language.

Unicode can support many if not all languages and other special characters like emojis.

Since unicode is the universal encoding between languages, we say translating into unicode is decoding. Translating out of unicode into usable encoding is called encoding.

Code Point

Unicode supports millions of characters, each character is assigned a number in the space, which we call a code point.

Glyphs

Fonts are a mapping from characters to glyphs. Glyphs are what appear on print outs and on screen. Characters are just 4 digit hexadecimal numbers.

Codecs

Codecs are a device or program that helps compress data so that it can be transmitted faster or more efficiently.

Ordinal

Regular Expression Applications to Tokenizing

Lots of linguistic tasks require pattern matching. For example to find words that end with ‘ed’, use endswith('ed’)

Regular expressions help us do that very efficiently.

Basic Metacharacters

The metacharacters define additional things like mark the start, end, wildcards etc.

Start: Caret

^ matches the start of the string, can think of it as space preceding the word.

End: Dollar Sign

$ matches the end of the string.

Single Character Wildcard: Dot

Optional Characters: Question Mark

This makes it so that the regular expression inside ‹‹›› says that any character before ? is optional. The end result is that both email and e-mail are both matched.

Ranges

Words “golf” and “hold” are both textonyms, which are words that are entered with the same keystrokes.

  • Set = [ghi]
  • Range = [g-i]

Closures

  • The next metacharacter is the + in ‹‹^m+i+n+e+$›› which means 1 or more instances of the preceding character.
  • The next metacharacter is the * in ‹‹^m*i*n*e*$›› which means 0 or more instances of the preceding character.

These are Kleene Closures, which are all the strings possible under the regular expression.

You can also do it with ranges:

The result:

Logical Not Operator: Caret Inside a Bracket

«[^aeiouAEIOU]» matches anything but a vowel, so this would give us tokens like:

  • :):):),
  • grrr,
  • cyb3r, and
  • zzzzzzzz
  • or just !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Matching Patterns with Separators: Escape with Forward Slash

This gets us all decimal numbers:

To get currencies:

The result:

Limit Characters: Curly Brackets

The result:

To apply it to several ranges:

And the result:

Order of Operations: Brackets

What does «w(i|e|ai|oo)t» match?

Gives results like:

In regular expression backslashes always escape the following character, for example \b would be backspace character.

To send it to the re library to be processed, we prefix the string with like so r’\band\b’.

Extracting Word Pieces

The previous examples were all re.search(regex, word), here we start with finding all instance of with re.findall(regex, word).

The below example finds all the 2 or more sequences from the set [aeiou] .

The output is:

Reconstructing Words from Word Pieces

This function removes vowels from words, except in the first and last character, the result:

Conditional Frequency Distributions

The output is a conditional frequency distribution of consonant — vowel sequences from treebank:

treebank.wods() is a tokenized Wall Street Journal sample.

Finding All Instances Of

The consonant — vowel sequences in reverse:

The outputs are:

Finding Word Stems

Word stems are the core of the word, the root, and in search engines we want to query for not just a literal string match but for all related words using the stem.

Which finds all words with those suffixes.

If you use it independently of the listing:

we get:

Finding Stems In a Better Way

Which outputs:

So even this better method makes errors such as the bolded above.

Searching Tokenized Text

What if you wanted to search multiple words? We can use regular expressions for that too.

The above regular expression will match “a (anything) man”. <.*> will match any single token.

  • If we want the matched phrase we don’t use the parentheses
  • If we use the parentheses, then it only matches the word

To be able to match 3 word phrases:

To be able to match sequences of 3 or more words that start with “l”:

Exploring Hypernyms

Some linguistics phenomena such as superordinate words can sometimes appear a certain way in a text.

To understand how this works:

Notice that one result prints “water and other liquids”, which tells us that water is a type of liquid and the liquid is the hypernym and the water is a hyponym.

Of course this method isn’t perfect, there can be false positives.

Other Articles

Up Next…

In the next article, we will explore Normalizing, Tokenizing and Sentence Segmentation.

For the table of contents and more content click here.

References

Clark, Alexander. The Handbook of Computational Linguistics and Natural Language Processing. Wiley-Blackwell, 2013.

Eisenstein, Jacob. Introduction to Natural Language Processing. The MIT Press, 2019.

Bird, Steven, et al. Natural Language Processing with Python. O’Reilly, 2009.

Jurafsky, Dan, and James H. Martin. Speech and Language Processing. Pearson, 2014.

Barker-Plummer, Dave, et al. Language, Proof and Logic. CSLI Publ., Center for the Study of Language and Information, 2011.

--

--