Natural Language Tool Kit 3.5
Search Functions, Statistics, Pronoun Resolution
Introduction
This article is meant to be a gentle introduction to NLTK. As with everything, we will try to balance mathematical rigor, programmatic ease of use with concrete examples that have linguistically motivated examples.
In many ways, this article is the programmatic introduction to computational linguistics, and is a mirror to this article.
What is NLTK?
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
Getting Started
Importing
Importing a Book
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
Get Text Details
<Text: Moby Dick by Herman Melville 1851>
Basic Search Functions
Concordance
A concordance shows us every occurrence of a given string, with the context.
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
Similar
Is kind of the f⁻¹ of concordance. It gives us other words that are similarly used as the given string.
true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless
Common Contexts
This function brings up all contexts that are shared by 2 or more words. Here by context we mean, something like the concordance.
be_glad am_glad a_pretty is_pretty a_lucky
Dispersion Plot
We can also find the location of a word in a text and plot it.
Counting Vocabulary
Tokens
149797
Token is another, more NLP way to say strings. Specifically, its a sequence of characters, including emojis, punctuation, words etc.
Sets
Remember that in mathematical sets, all duplicates collapse into the singular. So to count the number of unique tokens in a text we use.
9913
Word Type
Word type is the form or spelling of a the word.
Lexical Richness
This is the average number of times a word occurs.
Word Occurrence
21
Word Occurrence as a Percentage
1.457973123627309
Texts as Lists of Words
Lists
One and the main way we think about a text is as a list of words. As it helps simplify its analysis.
Addition
['Fellow',
'-',
'Citizens',
'of',
'the',
'Senate',
'and',
'of',
'the',
'House',
'of',
'Representatives',
':',
'Call',
'me',
'Ishmael',
'.']
Appending
['Call', 'me', 'Ishmael', '.', 'Some', 'Some']
Word from an Index
awaken
Index from a Word
173
Slicing
Slicing is another term for getting a sub-list.
['U86', 'thats', 'why', 'something', 'like', 'gamefly', 'is', 'so', 'good',
'because', 'you', 'can', 'actually', 'play', 'a', 'full', 'game', 'without',
'buying', 'it']
Slicing Middle
This includes the sixth, seventh and eighth elements.
['word6', 'word7', 'word8']
Slicing Prefix
This gets you everything from beginning to the third element.
['word1', 'word2', 'word3']
Slicing Suffix
['among', 'the', 'merits', 'and', 'the', 'happiness', 'of', 'Elinor', 'and', 'Marianne',
',', 'let', 'it', 'not', 'be', 'ranked', 'as', 'the', 'least', 'considerable', ',',
'that', 'though', 'sisters', ',', 'and', 'living', 'almost', 'within', 'sight', 'of',
'each', 'other', ',', 'they', 'could', 'live', 'without', 'disagreement', 'between',
'themselves', ',', 'or', 'producing', 'coolness', 'between', 'their', 'husbands', '.',
'THE', 'END']
Variables
Strings
Characters
'M'
Characters Range
'Mont'
String Multiplication
'MontyMonty'
String Addition
'Monty!'
String Joins
'Monty Python'
String Split
['Monty', 'Python']
Simple Statistics
Frequency Distributions
Ranked Frequency Distribution
Vocabulary
Plot
Hapaxes
Words that occur only once.
Importing
Fine Grained Selection of Words
Here we attempt to have set notation like selection of words.
{w| w ∈ V & P(w)}
- fdist5[w] > 7 ensures that these words occur more than seven times.
Collocation and Bigrams
Bigrams
[('After', 'having'),
('having', 'a'),
('a', 'near-death'),
('near-death', 'experience'),
('experience', ','),
(',', 'an'),
('an', 'unnamed'),
('unnamed', 'CIA'),
('CIA', 'operative'),
('operative', ','),
(',', 'known'),
('known', 'as'),
('as', 'the'),
('the', 'Protagonist'),
('Protagonist', ','),
(',', 'is'),
('is', 'selected'),
...
('future', 'of'),
('of', 'our'),
('our', 'world'),
('world', 'is'),
('is', 'hanging'),
('hanging', 'by'),
('by', 'a'),
('a', 'thread'),
('thread', '.'),
('.', 'Can'),
('Can', 'two'),
('two', 'agents'),
('agents', 'working'),
('working', 'alone'),
('alone', 'avert'),
('avert', 'the'),
('the', 'impending'),
('impending', 'Armageddon'),
('Armageddon', '?')]
Bigrams are just n and n+1 tokens. If we do a frequency on these. We can get collocations.
Collocations
Sperm Whale; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand
For example, which book are these collocations from?
That’s right, Moby Dick!
In the end, collocations are words like “red wine”, that occur together.
Counting
FreqDist({1: 47933,
2: 38513,
3: 50223,
4: 42345,
5: 26597,
6: 17111,
7: 14399,
8: 9966,
9: 6428,
10: 3528,
11: 1873,
12: 1053,
13: 567,
14: 177,
15: 70,
16: 22,
17: 12,
18: 1,
20: 1})
This gives us word lengths and the frequency of these.
Max Item
3
Get Count by Key
50223
Conditionals
Conjunction and Disjunction
Iteration
['[', 'MOBY', 'DICK', 'BY', 'HERMAN', 'MELVILLE', '1851', ']', 'ETYMOLOGY', '.', ...]
Removing Double Counts
Removing Non Alphabetic
Looping with Conditionals and Printing
Call
Ishmael
Call is a titlecase word
me is a lowercase word
Ishmael is a titlecase word
. is punctuation
Word Sense Disambiguation
Consider the words with contextual definitions:
a. serve: help with food or drink; hold an office; put ball into play
b. dish: plate; course of a meal; communications device
Or consider the word by:
a. The lost children were found by the searchers (agentive)
b. The lost children were found by the mountain (locative)
c. The lost children were found by the afternoon (temporal)
Pronoun Resolution
Another problem is the understanding of “who did what to whom”, exemplified by:
a. The thieves stole the paintings. They were subsequently sold.
b. The thieves stole the paintings. They were subsequently caught.
c. The thieves stole the paintings. They were subsequently found.
Antecedent
They is a pronoun, so we need to find the antecedent, which could be the paintings or the thieves.
Anaphora Resolution
This is a computational technique for identifying what a pronoun or noun phrase refers to.
Semantic Role Labeling
This is a computational technique for identifying how a noun phrase relates to the verb.
Generating Language Output
Once we can do the above tasks, we can tackle problems like question answering and machine translation.
Question Answering
a. Text: ... The thieves stole the paintings. They were subsequently sold. ...
b. Human: Who or what was sold?
c. Machine: The paintings.
Machine Translation
a. The thieves stole the paintings. They were subsequently found.
b. Les voleurs ont volé les peintures. Ils ont été trouvés plus tard. (the thieves)
c. Les voleurs ont volé les peintures. Elles ont été trouvées plus tard. (the
paintings)
Correct translation always depends on a correct understanding of the original text.
Machine Translation Not Converging
0> how long before the next flight to Alice Springs?
1> wie lang vor dem folgenden Flug zu Alice Springs?
2> how long before the following flight to Alice jump?
3> wie lang vor dem folgenden Flug zu Alice springen Sie?
4> how long before the following flight to Alice do you jump?
5> wie lang, bevor der folgende Flug zu Alice tun, Sie springen?
6> how long, before the following flight to Alice does, do you jump?
7> wie lang bevor der folgende Flug zu Alice tut, tun Sie springen?
8> how long before the following flight to Alice does, do you jump?
9> wie lang, bevor der folgende Flug zu Alice tut, tun Sie springen?
10> how long, before the following flight does to Alice, do do you jump?
11> wie lang bevor der folgende Flug zu Alice tut, Sie tun Sprung?
12> how long before the following flight does leap to Alice, does you?
0> The pig that John found looked happy
1> Das Schwein, das John fand, schaute gl?cklich
2> The pig, which found John, looked happy
Textual Entailment
a. Text: David Golinkin is the editor or author of 18 books, and over 150
responsa, articles, sermons and books
b. Hypothesis: Golinkin has written 18 books
Other Articles
This post is part of a series of stories that explores the fundamentals of natural language processing:1. Context of Natural Language Processing
Motivations, Disciplines, Approaches, Outlook2. Notes on Formal Language Theory
Objects, Operations, Regular Expressions and Finite State Automata3. Natural Language Tool Kit 3.5
Search Functions, Statistics, Pronoun Resolution
Up Next…
In the next article, we will explore Chomsky’s hierarchy of languages as it is one of the formal pillars of computational linguistics, and its results continue to shape modern research and development in NLP.
For the table of contents and more content click here.
References
Clark, Alexander. The Handbook of Computational Linguistics and Natural Language Processing. Wiley-Blackwell, 2013.
Eisenstein, Jacob. Introduction to Natural Language Processing. The MIT Press, 2019.
Bird, Steven, et al. Natural Language Processing with Python. O’Reilly, 2009.
Jurafsky, Dan, and James H. Martin. Speech and Language Processing. Pearson, 2014.
Barker-Plummer, Dave, et al. Language, Proof and Logic. CSLI Publ., Center for the Study of Language and Information, 2011.