Natural Language Tool Kit 3.5

Search Functions, Statistics, Pronoun Resolution

Jake Batsuuri

Published in

Computronium Blog

8 min readSep 30, 2020

https://www.newyorker.com/culture/culture-desk/living-in-alan-turings-future

Introduction

This article is meant to be a gentle introduction to NLTK. As with everything, we will try to balance mathematical rigor, programmatic ease of use with concrete examples that have linguistically motivated examples.

In many ways, this article is the programmatic introduction to computational linguistics, and is a mirror to this article.

Notes on Formal Language Theory

Objects, Operations, Regular Expressions and Finite State Automata

medium.com

What is NLTK?

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

Natural Language Toolkit - NLTK 3.5 documentation

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use…

www.nltk.org

Getting Started

Importing

Importing a Book

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

Get Text Details

<Text: Moby Dick by Herman Melville 1851>

Basic Search Functions

Concordance

A concordance shows us every occurrence of a given string, with the context.

Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

Similar

Is kind of the f⁻¹ of concordance. It gives us other words that are similarly used as the given string.

true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless

Common Contexts

This function brings up all contexts that are shared by 2 or more words. Here by context we mean, something like the concordance.

be_glad am_glad a_pretty is_pretty a_lucky

Dispersion Plot

We can also find the location of a word in a text and plot it.

Counting Vocabulary

Tokens

Token is another, more NLP way to say strings. Specifically, its a sequence of characters, including emojis, punctuation, words etc.

Sets

Remember that in mathematical sets, all duplicates collapse into the singular. So to count the number of unique tokens in a text we use.

Word Type

Word type is the form or spelling of a the word.

Lexical Richness

This is the average number of times a word occurs.

Word Occurrence

Word Occurrence as a Percentage

1.457973123627309

Texts as Lists of Words

Lists

One and the main way we think about a text is as a list of words. As it helps simplify its analysis.

Addition

['Fellow',
 '-',
 'Citizens',
 'of',
 'the',
 'Senate',
 'and',
 'of',
 'the',
 'House',
 'of',
 'Representatives',
 ':',
 'Call',
 'me',
 'Ishmael',
 '.']

Appending

['Call', 'me', 'Ishmael', '.', 'Some', 'Some']

Word from an Index

awaken

Index from a Word

Slicing

Slicing is another term for getting a sub-list.

['U86', 'thats', 'why', 'something', 'like', 'gamefly', 'is', 'so', 'good',
'because', 'you', 'can', 'actually', 'play', 'a', 'full', 'game', 'without',
'buying', 'it']

Slicing Middle

This includes the sixth, seventh and eighth elements.

['word6', 'word7', 'word8']

Slicing Prefix

This gets you everything from beginning to the third element.

['word1', 'word2', 'word3']

Slicing Suffix

['among', 'the', 'merits', 'and', 'the', 'happiness', 'of', 'Elinor', 'and', 'Marianne',
',', 'let', 'it', 'not', 'be', 'ranked', 'as', 'the', 'least', 'considerable', ',',
'that', 'though', 'sisters', ',', 'and', 'living', 'almost', 'within', 'sight', 'of',
'each', 'other', ',', 'they', 'could', 'live', 'without', 'disagreement', 'between',
'themselves', ',', 'or', 'producing', 'coolness', 'between', 'their', 'husbands', '.',
'THE', 'END']

Variables

Strings

Characters

'M'

`Characters Range`

'Mont'

String Multiplication

'MontyMonty'

String Addition

'Monty!'

String Joins

'Monty Python'

String Split

['Monty', 'Python']

Simple Statistics

Frequency Distributions

Ranked Frequency Distribution

Vocabulary

Plot

Hapaxes

Words that occur only once.

Importing

Fine Grained Selection of Words

Here we attempt to have set notation like selection of words.

{w| w ∈ V & P(w)}

fdist5[w] > 7 ensures that these words occur more than seven times.

Collocation and Bigrams

Bigrams

[('After', 'having'),
 ('having', 'a'),
 ('a', 'near-death'),
 ('near-death', 'experience'),
 ('experience', ','),
 (',', 'an'),
 ('an', 'unnamed'),
 ('unnamed', 'CIA'),
 ('CIA', 'operative'),
 ('operative', ','),
 (',', 'known'),
 ('known', 'as'),
 ('as', 'the'),
 ('the', 'Protagonist'),
 ('Protagonist', ','),
 (',', 'is'),
 ('is', 'selected'),
...
 ('future', 'of'),
 ('of', 'our'),
 ('our', 'world'),
 ('world', 'is'),
 ('is', 'hanging'),
 ('hanging', 'by'),
 ('by', 'a'),
 ('a', 'thread'),
 ('thread', '.'),
 ('.', 'Can'),
 ('Can', 'two'),
 ('two', 'agents'),
 ('agents', 'working'),
 ('working', 'alone'),
 ('alone', 'avert'),
 ('avert', 'the'),
 ('the', 'impending'),
 ('impending', 'Armageddon'),
 ('Armageddon', '?')]

Bigrams are just n and n+1 tokens. If we do a frequency on these. We can get collocations.

Collocations

Sperm Whale; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand

For example, which book are these collocations from?

That’s right, Moby Dick!

In the end, collocations are words like “red wine”, that occur together.

Counting

FreqDist({1: 47933,
          2: 38513,
          3: 50223,
          4: 42345,
          5: 26597,
          6: 17111,
          7: 14399,
          8: 9966,
          9: 6428,
          10: 3528,
          11: 1873,
          12: 1053,
          13: 567,
          14: 177,
          15: 70,
          16: 22,
          17: 12,
          18: 1,
          20: 1})

This gives us word lengths and the frequency of these.

Max Item

Get Count by Key

Conditionals

Conjunction and Disjunction

Iteration

['[', 'MOBY', 'DICK', 'BY', 'HERMAN', 'MELVILLE', '1851', ']', 'ETYMOLOGY', '.', ...]

Removing Double Counts

Removing Non Alphabetic

Looping with Conditionals and Printing

Call
Ishmael

Call is a titlecase word
me is a lowercase word
Ishmael is a titlecase word
. is punctuation

Word Sense Disambiguation

Consider the words with contextual definitions:

a. serve: help with food or drink; hold an office; put ball into play
b. dish: plate; course of a meal; communications device

Or consider the word by:

a. The lost children were found by the searchers (agentive)
b. The lost children were found by the mountain (locative)
c. The lost children were found by the afternoon (temporal)

Pronoun Resolution

Another problem is the understanding of “who did what to whom”, exemplified by:

a. The thieves stole the paintings. They were subsequently sold.
b. The thieves stole the paintings. They were subsequently caught.
c. The thieves stole the paintings. They were subsequently found.

Antecedent

They is a pronoun, so we need to find the antecedent, which could be the paintings or the thieves.

Anaphora Resolution

This is a computational technique for identifying what a pronoun or noun phrase refers to.

Semantic Role Labeling

This is a computational technique for identifying how a noun phrase relates to the verb.

Generating Language Output

Once we can do the above tasks, we can tackle problems like question answering and machine translation.

Question Answering

a. Text: ... The thieves stole the paintings. They were subsequently sold. ...
b. Human: Who or what was sold?
c. Machine: The paintings.

Machine Translation

a. The thieves stole the paintings. They were subsequently found.
b. Les voleurs ont volé les peintures. Ils ont été trouvés plus tard. (the thieves)
c. Les voleurs ont volé les peintures. Elles ont été trouvées plus tard. (the
paintings)

Correct translation always depends on a correct understanding of the original text.

Machine Translation Not Converging

0> how long before the next flight to Alice Springs?
1> wie lang vor dem folgenden Flug zu Alice Springs?
2> how long before the following flight to Alice jump?
3> wie lang vor dem folgenden Flug zu Alice springen Sie?
4> how long before the following flight to Alice do you jump?
5> wie lang, bevor der folgende Flug zu Alice tun, Sie springen?
6> how long, before the following flight to Alice does, do you jump?
7> wie lang bevor der folgende Flug zu Alice tut, tun Sie springen?
8> how long before the following flight to Alice does, do you jump?
9> wie lang, bevor der folgende Flug zu Alice tut, tun Sie springen?
10> how long, before the following flight does to Alice, do do you jump?
11> wie lang bevor der folgende Flug zu Alice tut, Sie tun Sprung?
12> how long before the following flight does leap to Alice, does you?

0> The pig that John found looked happy
1> Das Schwein, das John fand, schaute gl?cklich
2> The pig, which found John, looked happy

Textual Entailment

a. Text: David Golinkin is the editor or author of 18 books, and over 150
responsa, articles, sermons and books
b. Hypothesis: Golinkin has written 18 books

Up Next…

In the next article, we will explore Chomsky’s hierarchy of languages as it is one of the formal pillars of computational linguistics, and its results continue to shape modern research and development in NLP.

For the table of contents and more content click here.

References

Clark, Alexander. The Handbook of Computational Linguistics and Natural Language Processing. Wiley-Blackwell, 2013.

Eisenstein, Jacob. Introduction to Natural Language Processing. The MIT Press, 2019.

Bird, Steven, et al. Natural Language Processing with Python. O’Reilly, 2009.

Jurafsky, Dan, and James H. Martin. Speech and Language Processing. Pearson, 2014.

Barker-Plummer, Dave, et al. Language, Proof and Logic. CSLI Publ., Center for the Study of Language and Information, 2011.

Natural Language Tool Kit 3.5

Search Functions, Statistics, Pronoun Resolution

Introduction

Notes on Formal Language Theory

Objects, Operations, Regular Expressions and Finite State Automata

What is NLTK?

Natural Language Toolkit - NLTK 3.5 documentation

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use…

Getting Started

Importing

Importing a Book

Get Text Details

Basic Search Functions

Concordance

Similar

Common Contexts

Dispersion Plot

Counting Vocabulary

Tokens

Sets

Word Type

Lexical Richness

Word Occurrence

Word Occurrence as a Percentage

Texts as Lists of Words

Lists

Addition

Appending

Word from an Index

Index from a Word

Slicing

Slicing Middle

Slicing Prefix

Slicing Suffix

Variables

Strings

Characters

Characters Range

String Multiplication

String Addition

String Joins

String Split

Simple Statistics

Frequency Distributions

Ranked Frequency Distribution

Vocabulary

Plot

Hapaxes

Importing

Fine Grained Selection of Words

Collocation and Bigrams

Bigrams

Collocations

Counting

Max Item

Get Count by Key

Conditionals

Conjunction and Disjunction

Iteration

Removing Double Counts

Removing Non Alphabetic

Looping with Conditionals and Printing

Word Sense Disambiguation

Pronoun Resolution

Antecedent

Anaphora Resolution

Semantic Role Labeling

Generating Language Output

Question Answering

Machine Translation

Machine Translation Not Converging

Textual Entailment

Other Articles

Up Next…

References

Written by Jake Batsuuri

`Characters Range`