Natural Language Tool Kit 3.5

Search Functions, Statistics, Pronoun Resolution

Jake Batsuuri
Computronium Blog
8 min readSep 30, 2020

--

Introduction

This article is meant to be a gentle introduction to NLTK. As with everything, we will try to balance mathematical rigor, programmatic ease of use with concrete examples that have linguistically motivated examples.

In many ways, this article is the programmatic introduction to computational linguistics, and is a mirror to this article.

What is NLTK?

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

Getting Started

Importing

Importing a Book

Get Text Details

Basic Search Functions

Concordance

A concordance shows us every occurrence of a given string, with the context.

Similar

Is kind of the f⁻¹ of concordance. It gives us other words that are similarly used as the given string.

Common Contexts

This function brings up all contexts that are shared by 2 or more words. Here by context we mean, something like the concordance.

Dispersion Plot

We can also find the location of a word in a text and plot it.

A Lexical Dispersion Plot that shows the density of Trump Tweets about Conspiracy Keywords

Counting Vocabulary

Tokens

Token is another, more NLP way to say strings. Specifically, its a sequence of characters, including emojis, punctuation, words etc.

Sets

Remember that in mathematical sets, all duplicates collapse into the singular. So to count the number of unique tokens in a text we use.

Word Type

Word type is the form or spelling of a the word.

Lexical Richness

This is the average number of times a word occurs.

Word Occurrence

Word Occurrence as a Percentage

Texts as Lists of Words

Lists

One and the main way we think about a text is as a list of words. As it helps simplify its analysis.

Addition

Appending

Word from an Index

Index from a Word

Slicing

Slicing is another term for getting a sub-list.

Slicing Middle

This includes the sixth, seventh and eighth elements.

Slicing Prefix

This gets you everything from beginning to the third element.

Slicing Suffix

Variables

Strings

Characters

Characters Range

String Multiplication

String Addition

String Joins

String Split

Simple Statistics

Frequency Distributions

Ranked Frequency Distribution

Vocabulary

Plot

Hapaxes

Words that occur only once.

Importing

Fine Grained Selection of Words

Here we attempt to have set notation like selection of words.

{w| w ∈ V & P(w)}

  • fdist5[w] > 7 ensures that these words occur more than seven times.

Collocation and Bigrams

Bigrams

Bigrams are just n and n+1 tokens. If we do a frequency on these. We can get collocations.

Collocations

For example, which book are these collocations from?

That’s right, Moby Dick!

In the end, collocations are words like “red wine”, that occur together.

Counting

This gives us word lengths and the frequency of these.

Max Item

Get Count by Key

Conditionals

Conjunction and Disjunction

Iteration

Removing Double Counts

Removing Non Alphabetic

Looping with Conditionals and Printing

Word Sense Disambiguation

Consider the words with contextual definitions:

Or consider the word by:

Pronoun Resolution

Another problem is the understanding of “who did what to whom”, exemplified by:

Antecedent

They is a pronoun, so we need to find the antecedent, which could be the paintings or the thieves.

Anaphora Resolution

This is a computational technique for identifying what a pronoun or noun phrase refers to.

Semantic Role Labeling

This is a computational technique for identifying how a noun phrase relates to the verb.

Generating Language Output

Once we can do the above tasks, we can tackle problems like question answering and machine translation.

Question Answering

Machine Translation

Correct translation always depends on a correct understanding of the original text.

Machine Translation Not Converging

Textual Entailment

Other Articles

Up Next…

In the next article, we will explore Chomsky’s hierarchy of languages as it is one of the formal pillars of computational linguistics, and its results continue to shape modern research and development in NLP.

For the table of contents and more content click here.

References

Clark, Alexander. The Handbook of Computational Linguistics and Natural Language Processing. Wiley-Blackwell, 2013.

Eisenstein, Jacob. Introduction to Natural Language Processing. The MIT Press, 2019.

Bird, Steven, et al. Natural Language Processing with Python. O’Reilly, 2009.

Jurafsky, Dan, and James H. Martin. Speech and Language Processing. Pearson, 2014.

Barker-Plummer, Dave, et al. Language, Proof and Logic. CSLI Publ., Center for the Study of Language and Information, 2011.

--

--