Natural Language Tool Kit 3.5

Search Functions, Statistics, Pronoun Resolution

Jake Batsuuri
Sep 30 · 8 min read

Introduction

In many ways, this article is the programmatic introduction to computational linguistics, and is a mirror to this article.

What is NLTK?

Getting Started

Importing

Importing a Book

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

Get Text Details

<Text: Moby Dick by Herman Melville 1851>

Basic Search Functions

Concordance

Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

Similar

true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless

Common Contexts

be_glad am_glad a_pretty is_pretty a_lucky

Dispersion Plot

Image for post
Image for post
A Lexical Dispersion Plot that shows the density of Trump Tweets about Conspiracy Keywords

Counting Vocabulary

Tokens

149797

Token is another, more NLP way to say strings. Specifically, its a sequence of characters, including emojis, punctuation, words etc.

Sets

9913

Word Type

Lexical Richness

This is the average number of times a word occurs.

Word Occurrence

21

Word Occurrence as a Percentage

1.457973123627309

Texts as Lists of Words

Lists

Addition

['Fellow',
'-',
'Citizens',
'of',
'the',
'Senate',
'and',
'of',
'the',
'House',
'of',
'Representatives',
':',
'Call',
'me',
'Ishmael',
'.']

Appending

['Call', 'me', 'Ishmael', '.', 'Some', 'Some']

Word from an Index

awaken

Index from a Word

173

Slicing

['U86', 'thats', 'why', 'something', 'like', 'gamefly', 'is', 'so', 'good',
'because', 'you', 'can', 'actually', 'play', 'a', 'full', 'game', 'without',
'buying', 'it']

Slicing Middle

This includes the sixth, seventh and eighth elements.

['word6', 'word7', 'word8']

Slicing Prefix

This gets you everything from beginning to the third element.

['word1', 'word2', 'word3']

Slicing Suffix

['among', 'the', 'merits', 'and', 'the', 'happiness', 'of', 'Elinor', 'and', 'Marianne',
',', 'let', 'it', 'not', 'be', 'ranked', 'as', 'the', 'least', 'considerable', ',',
'that', 'though', 'sisters', ',', 'and', 'living', 'almost', 'within', 'sight', 'of',
'each', 'other', ',', 'they', 'could', 'live', 'without', 'disagreement', 'between',
'themselves', ',', 'or', 'producing', 'coolness', 'between', 'their', 'husbands', '.',
'THE', 'END']

Variables

Strings

Characters

'M'

Characters Range

'Mont'

String Multiplication

'MontyMonty'

String Addition

'Monty!'

String Joins

'Monty Python'

String Split

['Monty', 'Python']

Simple Statistics

Frequency Distributions

Ranked Frequency Distribution

Vocabulary

Plot

Hapaxes

Importing

Fine Grained Selection of Words

{w| w ∈ V & P(w)}

  • fdist5[w] > 7 ensures that these words occur more than seven times.

Collocation and Bigrams

Bigrams

[('After', 'having'),
('having', 'a'),
('a', 'near-death'),
('near-death', 'experience'),
('experience', ','),
(',', 'an'),
('an', 'unnamed'),
('unnamed', 'CIA'),
('CIA', 'operative'),
('operative', ','),
(',', 'known'),
('known', 'as'),
('as', 'the'),
('the', 'Protagonist'),
('Protagonist', ','),
(',', 'is'),
('is', 'selected'),
...
('future', 'of'),
('of', 'our'),
('our', 'world'),
('world', 'is'),
('is', 'hanging'),
('hanging', 'by'),
('by', 'a'),
('a', 'thread'),
('thread', '.'),
('.', 'Can'),
('Can', 'two'),
('two', 'agents'),
('agents', 'working'),
('working', 'alone'),
('alone', 'avert'),
('avert', 'the'),
('the', 'impending'),
('impending', 'Armageddon'),
('Armageddon', '?')]

Bigrams are just n and n+1 tokens. If we do a frequency on these. We can get collocations.

Collocations

Sperm Whale; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand

For example, which book are these collocations from?

That’s right, Moby Dick!

In the end, collocations are words like “red wine”, that occur together.

Counting

FreqDist({1: 47933,
2: 38513,
3: 50223,
4: 42345,
5: 26597,
6: 17111,
7: 14399,
8: 9966,
9: 6428,
10: 3528,
11: 1873,
12: 1053,
13: 567,
14: 177,
15: 70,
16: 22,
17: 12,
18: 1,
20: 1})

This gives us word lengths and the frequency of these.

Max Item

3

Get Count by Key

50223

Conditionals

Conjunction and Disjunction

Iteration

['[', 'MOBY', 'DICK', 'BY', 'HERMAN', 'MELVILLE', '1851', ']', 'ETYMOLOGY', '.', ...]

Removing Double Counts

Removing Non Alphabetic

Looping with Conditionals and Printing

Call
Ishmael
Call is a titlecase word
me is a lowercase word
Ishmael is a titlecase word
. is punctuation

Word Sense Disambiguation

a. serve: help with food or drink; hold an office; put ball into play
b. dish: plate; course of a meal; communications device

Or consider the word by:

a. The lost children were found by the searchers (agentive)
b. The lost children were found by the mountain (locative)
c. The lost children were found by the afternoon (temporal)

Pronoun Resolution

a. The thieves stole the paintings. They were subsequently sold.
b. The thieves stole the paintings. They were subsequently caught.
c. The thieves stole the paintings. They were subsequently found.

Antecedent

Anaphora Resolution

Semantic Role Labeling

Generating Language Output

Question Answering

a. Text: ... The thieves stole the paintings. They were subsequently sold. ...
b. Human: Who or what was sold?
c. Machine: The paintings.

Machine Translation

a. The thieves stole the paintings. They were subsequently found.
b. Les voleurs ont volé les peintures. Ils ont été trouvés plus tard. (the thieves)
c. Les voleurs ont volé les peintures. Elles ont été trouvées plus tard. (the
paintings)

Correct translation always depends on a correct understanding of the original text.

Machine Translation Not Converging

0> how long before the next flight to Alice Springs?
1> wie lang vor dem folgenden Flug zu Alice Springs?
2> how long before the following flight to Alice jump?
3> wie lang vor dem folgenden Flug zu Alice springen Sie?
4> how long before the following flight to Alice do you jump?
5> wie lang, bevor der folgende Flug zu Alice tun, Sie springen?
6> how long, before the following flight to Alice does, do you jump?
7> wie lang bevor der folgende Flug zu Alice tut, tun Sie springen?
8> how long before the following flight to Alice does, do you jump?
9> wie lang, bevor der folgende Flug zu Alice tut, tun Sie springen?
10> how long, before the following flight does to Alice, do do you jump?
11> wie lang bevor der folgende Flug zu Alice tut, Sie tun Sprung?
12> how long before the following flight does leap to Alice, does you?
0> The pig that John found looked happy
1> Das Schwein, das John fand, schaute gl?cklich
2> The pig, which found John, looked happy

Textual Entailment

a. Text: David Golinkin is the editor or author of 18 books, and over 150
responsa, articles, sermons and books
b. Hypothesis: Golinkin has written 18 books

Other Articles

This post is part of a series of stories that explores the fundamentals of natural language processing:1. Context of Natural Language Processing
Motivations, Disciplines, Approaches, Outlook
2. Notes on Formal Language Theory
Objects, Operations, Regular Expressions and Finite State Automata
3. Natural Language Tool Kit 3.5
Search Functions, Statistics, Pronoun Resolution

Up Next…

For the table of contents and more content click here.

References

Eisenstein, Jacob. Introduction to Natural Language Processing. The MIT Press, 2019.

Bird, Steven, et al. Natural Language Processing with Python. O’Reilly, 2009.

Jurafsky, Dan, and James H. Martin. Speech and Language Processing. Pearson, 2014.

Barker-Plummer, Dave, et al. Language, Proof and Logic. CSLI Publ., Center for the Study of Language and Information, 2011.

The Startup

Medium's largest active publication, followed by +732K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store