Inputting & PreProcessing Text
Input Methods, String & Unicode, Regular Expression Use Cases
NLTK has preprocessed texts. But we can also import and process our own texts.
Importing
from __future__ import division
import nltk, re, pprint
To Import a Book as a Txt
Install urlopen:
!pip install urlopen
And:
import urllib.requesturl = "https://www.gutenberg.org/files/11/11-0.txt"
raw = urllib.request.urlopen(url).read()type(raw)
# <type 'str'>len(raw)
// 1176831raw[:75]
// 'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'
Tokenization:
tokens = nltk.word_tokenize(raw)type(tokens)
# <type 'list'>len(tokens)
# 255809tokens[:10]
# ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']
Textization, or just turning it into NLTK’s Text Object so we run things like collocations:
text = nltk.Text(tokens)
type(text)
# <type 'nltk.text.Text'>text[1020:1060]
# ['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in', # 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in', # 'which', 'he', 'lodged', 'in', 'S', '.', 'Place', 'and', 'walked', 'slowly', # ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K', '.', 'bridge', '.']text.collocations()
# Katerina Ivanovna; Pulcheria Alexandrovna; Avdotya Romanovna; Pyotr # Petrovitch; Project Gutenberg; Marfa Petrovna; Rodion Romanovitch; # Sofya Semyonovna; Nikodim Fomitch; did not; Hay Market; Andrey # Semyonovitch; old woman; Literary Archive; Dmitri Prokofitch; great # deal; United States; Praskovya Pavlovna; Porfiry Petrovitch; ear rings
Getting Just the Good Stuff
text.find("CHAPTER I")
# 1007text.find("THE END")
# 148848text = text[1007:148848]
Lot of books will have header and footers, here we just find the header index and the footer index and simply remove ‘em.
If there are more than one “THE END”s, you can use:
text.rfind("THE END")
Which will find indexes from the bottom of the text.
Handling HTML
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib.request.urlopen(url).read()
print(html)
The printout:
b'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">\r\n<html>\r\n<head>\r\n<title>BBC NEWS | Health | Blondes \'to die out in 200 years\'</title>\r\n<meta name="keywords" content="BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service">\r\n<meta (none)~RS~a~RS~International~RS~q~RS~~RS~z~RS~25~RS~">\r\n</noscript>\r\n\r\n\r\n\r\n<br>\r\n<link type="text/css" rel="stylesheet" href="/nol/shared/stylesheets/uki_globalstylesheet.css">\r\n\r\n</body>\r\n</html>\r\n'
Not that great, so let’s clean it up:
!pip install beautifulsoup4
Beautiful soup has tons of easy methods to get us the text:
from bs4 import BeautifulSoupsoup = BeautifulSoup(html)
print(soup.prettify())
Stringify:
text = soup.get_text()
type(text)
Tokenify:
tokens = nltk.word_tokenize(text)
tokens
Textify:
text = nltk.Text(tokens)
text.concordance('gene')
Reading Local Files
f = open('document.txt')
raw = f.read()
If it's a different directory:
import os
os.listdir('.')
Print line by line:
f = open('document.txt', 'r')
for line in f:
print(line.strip())
Binary Files
Text sometimes comes in PDF and WORD, there are libraries for processing these, such as pypdf and pywin32.
User Input
s = input("Enter some text: ")
print("You typed", len(nltk.word_tokenize(s)), "words.")
Strings
Single Quote
monty = 'Monty Python'
monty
'Monty Python'
However if you wanna escape the single quotation itself:
circus = 'Monty Python\'s Flying Circus'
circus
"Monty Python's Flying Circus"
Or you can use the…
Double Quotation Mark
circus = "Monty Python's Flying Circus"
circus
"Monty Python's Flying Circus"
Triple Quotation Mark
couplet = "Shall I compare thee to a Summer's day?"\
"Thou are more lovely and more temperate:"print(couplet)
# Shall I compare thee to a Summer's day?Thou are more lovely and more temperate:
The problem is, the above doesn’t print newlines, but this does:
couplet = """Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate:"""print(couplet)
# Shall I compare thee to a Summer's day?
# Thou are more lovely and more temperate:
Works also for single quotation marks thrice:
couplet = '''Rough winds do shake the darling buds of May,
And Summer's lease hath all too short a date:'''print(couplet)
# Rough winds do shake the darling buds of May,
# And Summer's lease hath all too short a date:
Concatenation
'very' + 'very' + 'very'
# 'veryveryvery''very' * 3
# 'veryveryvery'
Printing
grail = 'Holy Grail'
print(monty + grail)
# Monty PythonHoly Grailprint(monty, grail)
# Monty Python Holy Grailprint(monty, "and the", grail)
# Monty Python and the Holy Grail
Individual Chars
monty[0]
# 'M'monty[3]
# 't'monty[5]
# ' '
Negative Indexing Chars
monty[-1]
# 'n' monty[5]
# ' ' monty[-7]
# ' '
Where the last character is also -1 and it counts up as you reverse through each character.
Print Chars
sent = 'colorless green ideas sleep furiously'
for char in sent:
print char# c o l o r l e s s g r e e n i d e a s s l e e p f u r i o u s l y
Count Chars
from nltk.corpus import gutenbergraw = gutenberg.raw('melville-moby_dick.txt')
fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())
fdist.keys()# ['e', 't', 'a', 'o', 'n', 'i', 's', 'h', 'r', 'l', 'd', 'u', 'm',
# 'c', 'w', 'f', 'g', 'p', 'b', 'y', 'v', 'k', 'q', 'j', 'x', 'z']
You can also visualize this frequency distribution:
fdist.plot()
Each language has a typical frequency distribution, and is a good way to distinguish between them.
Substrings
monty[6:10]
# 'Pyth'
The slice (m,n) contains the substring from index m through n-1.
Substring Membership
phrase = 'And now for something completely different'
if 'thing' in phrase:
print('found "thing"')# found "thing"
More Operations
s.find(t)
Index of first instance of string t inside s (-1 if not found)s.rfind(t)
Index of last instance of string t inside s (-1 if not found)s.index(t)
Like s.find(t), except it raises ValueError if not founds.rindex(t)
Like s.rfind(t), except it raises ValueError if not founds.join(text)
Combine the words of the text into a string using s as the glues.split(t)
Split s into a list wherever a t is found (whitespace by default)s.splitlines()
Split s into a list of strings, one per lines.lower()
A lowercased version of the string ss.upper()
An uppercased version of the string ss.titlecase()
A titlecased version of the string ss.strip()
A copy of s without leading or trailing whitespaces.replace(t, u)
Replace instances of t with u inside s
Lists
beatles = ['John', 'Paul', 'George', 'Ringo']beatles[2]
# 'George'beatles[:2]
# ['John', 'Paul']beatles + ['Brian']
# ['John', 'Paul', 'George', 'Ringo', 'Brian']
Unicode
ASCII can hold 128 or 256 characters, because it uses only 1 byte or 8 bits. Whereas UTF-8 can encode millions of characters because it uses 1 to 4 bytes.
We can manipulate unicode strings exactly as normal strings, however when we store unicode, we store em as a stream of bytes. Encodings such as ASCII are often enough to support a single language.
Unicode can support many if not all languages and other special characters like emojis.
Since unicode is the universal encoding between languages, we say translating into unicode is decoding. Translating out of unicode into usable encoding is called encoding.
Code Point
Unicode supports millions of characters, each character is assigned a number in the space, which we call a code point.
Glyphs
Fonts are a mapping from characters to glyphs. Glyphs are what appear on print outs and on screen. Characters are just 4 digit hexadecimal numbers.
Codecs
Codecs are a device or program that helps compress data so that it can be transmitted faster or more efficiently.
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')
import codecsf = codecs.open(path, encoding='latin2')
f = codecs.open(path, 'w', encoding='utf-8').
Ordinal
ord('a') # 97hex(97) # 0x61char = u'\u0061'
print(char) # a
Regular Expression Applications to Tokenizing
Lots of linguistic tasks require pattern matching. For example to find words that end with ‘ed’, use endswith('ed’)
Regular expressions help us do that very efficiently.
import re
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]re.search('^pre', w)
Basic Metacharacters
The metacharacters define additional things like mark the start, end, wildcards etc.
Start: Caret
^ matches the start of the string, can think of it as space preceding the word.
import nltk
nltk.download('words')re.search('^pre', w)
[w for w in wordlist if re.search('^pre', w)]# 'predestinately', 'predestination', 'predestinational', 'predestinationism', 'predestinationist', 'predestinative', 'predestinator', 'predestine', 'predestiny', 'predestitute', 'predestitution', 'predestroy', 'predestruction', 'predetach', 'predetachment', 'predetail', 'predetain', 'predetainer', 'predetect', 'predetention', 'predeterminability', 'predeterminable', 'predeterminant', 'predeterminate', 'predeterminately', 'predetermination', 'predeterminative', 'predetermine', 'predeterminer',
End: Dollar Sign
[w for w in wordlist if re.search('ed$', w)] # ['abaissed', 'abandoned', 'abased', 'abashed', 'abatised', 'abed', 'aborted', ...]
$ matches the end of the string.
Single Character Wildcard: Dot
[w for w in wordlist if re.search('^..j..t..$', w)]# ['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic', 'objectee', 'objector', 'rejecter', 'rejector', 'unjilted', 'unjolted', 'unjustly']
Optional Characters: Question Mark
‹‹^e-?mail$››
This makes it so that the regular expression inside ‹‹›› says that any character before ? is optional. The end result is that both email and e-mail are both matched.
[w for w in wordlist if re.search('^e-?mail$', w)]
Ranges
Words “golf” and “hold” are both textonyms, which are words that are entered with the same keystrokes.
[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)] # ['gold', 'golf', 'hold', 'hole']
- Set = [ghi]
- Range = [g-i]
Closures
import nltk
nltk.download('nps_chat')chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))[w for w in chat_words if re.search('^m+i+n+e+$', w)]# ['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
# 'miiiiiinnnnnnnnnneeeeeeee', 'mine',
# 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']
- The next metacharacter is the + in
‹‹^m+i+n+e+$››
which means 1 or more instances of the preceding character. - The next metacharacter is the * in
‹‹^m*i*n*e*$››
which means 0 or more instances of the preceding character.
['', 'e', 'i', 'in', 'm', 'me', 'meeeeeeeeeeeee', 'mi', 'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', 'miiiiiinnnnnnnnnneeeeeeee', 'min', 'mine', 'mm', 'mmm', 'mmmm', 'mmmmm', 'mmmmmm', 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee', 'mmmmmmmmmm', 'mmmmmmmmmmmmm', 'mmmmmmmmmmmmmm', 'n', 'ne']
These are Kleene Closures, which are all the strings possible under the regular expression.
You can also do it with ranges:
[w for w in chat_words if re.search('^[ha]+$', w)]
The result:
['a', 'aaaaaaaaaaaaaaaaa', 'aaahhhh', 'ah', 'ahah', 'ahahah', 'ahh', 'ahhahahaha', 'ahhh', 'ahhhh', 'ahhhhhh', 'ahhhhhhhhhhhhhh', 'h', 'ha', 'haaa', 'hah', 'haha', 'hahaaa', 'hahah', 'hahaha', 'hahahaa', 'hahahah', 'hahahaha', 'hahahahaaa', 'hahahahahaha', 'hahahahahahaha', 'hahahahahahahahahahahahahahahaha', 'hahahhahah', 'hahhahahaha']
Logical Not Operator: Caret Inside a Bracket
«[^aeiouAEIOU]»
matches anything but a vowel, so this would give us tokens like:
- :):):),
- grrr,
- cyb3r, and
- zzzzzzzz
- or just !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Matching Patterns with Separators: Escape with Forward Slash
import nltk
nltk.download('treebank')wsj = sorted(set(nltk.corpus.treebank.words()))
[w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)]
This gets us all decimal numbers:
# ['0.0085', '0.05', '0.1', '0.16', '0.2', '0.25', '0.28', '0.3', '0.4', '0.5', # '0.50', '0.54', '0.56', '0.60', '0.7', '0.82', '0.84', '0.9', '0.95', '0.99', # '1.01', '1.1', '1.125', '1.14', '1.1650', '1.17', '1.18', '1.19', '1.2', ...]
To get currencies:
[w for w in wsj if re.search('^[A-Z]+\$$', w)]
The result:
['C$', 'US$']
Limit Characters: Curly Brackets
[w for w in wsj if re.search('^[0-9]{4}$', w)]
The result:
['1614', '1637', '1787', '1901', '1903', '1917', '1925', '1929', '1933', '1934', '1948', '1953', '1955', '1956', '1961', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1975', '1976', '1977', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2005', '2009', '2017', '2019', '2029', '3057', '8300']
To apply it to several ranges:
[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)]
And the result:
['10-day', '10-lap', '10-year', '100-share', '12-point', '12-year', '14-hour', '15-day', '150-point', '190-point', '20-point', '20-stock', '21-month', '237-seat', '240-page', '27-year', '30-day', '30-point', '30-share', '30-year', '300-day', '36-day', '36-store', '42-year', '50-state', '500-stock', '52-week', '69-point', '84-month', '87-store', '90-day']
Order of Operations: Brackets
What does «w(i|e|ai|oo)t»
match?
[w for w in wsj if re.search('^w(i|e|ai|oo)t', w)]
Gives results like:
['wait', 'waited', 'waiting', 'witches', 'with', 'withdraw', 'withdrawal', 'withdrawn', 'withdrew', 'withhold', 'within', 'without', 'withstand', 'witness', 'witnesses']
In regular expression backslashes always escape the following character, for example \b would be backspace character.
To send it to the re
library to be processed, we prefix the string with like so r’\band\b’
.
Extracting Word Pieces
word = 'supercalifragilisticexpialidocious'
list_vowels = re.findall(r'[aeiou]', word)
len(list_vowels)
The previous examples were all re.search(regex, word), here we start with finding all instance of with re.findall(regex, word).
The below example finds all the 2 or more sequences from the set [aeiou]
.
import nltk
nltk.download('treebank')wsj = sorted(set(nltk.corpus.treebank.words()))fd = nltk.FreqDist(vs for word in wsj for vs in re.findall(r'[aeiou]{2,}', word))fd.items()
The output is:
dict_items([('ea', 476), ('oi', 65), ('ou', 329), ('io', 549), ('ee', 217), ('ie', 331), ('ui', 95), ('ua', 109), ('ai', 261), ('ue', 105), ('ia', 253), ('ei', 86), ('iai', 1), ('oo', 174), ('au', 106), ('eau', 10), ('oa', 59), ('oei', 1), ('oe', 15), ('eo', 39), ('uu', 1), ('eu', 18), ('iu', 14), ('aii', 1), ('aiia', 1), ('ae', 11), ('aa', 3), ('oui', 6), ('ieu', 3), ('ao', 6), ('iou', 27), ('uee', 4), ('eou', 5), ('aia', 1), ('uie', 3), ('iao', 1), ('eei', 2), ('uo', 8), ('uou', 5), ('eea', 1), ('ueui', 1), ('ioa', 1), ('ooi', 1)])
Reconstructing Words from Word Pieces
regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'def compress(word):
pieces = re.findall(regexp, word)
return ''.join(pieces)print(nltk.tokenwrap(compress(w) for w in wsj[0:-75]))
This function removes vowels from words, except in the first and last character, the result:
wgh wghd wghng wght wrd wlcme wlcmd wlfre wll wll-cnnctd wll-knwn wnt wre wht wht whl-ldr whls whn whn-ssd whnvr whre whrby whrwthl whthr whch whchvr whle whmscl whppng whpsw whrlng whstle whte wht-cllr who whle whlsle whlslr whm whse why wde wdly wdsprd wdgt wdgts wdw wld wfe wld wldly wll wllng wllngnss wn wndfll wndng wndw wne wn-byng wn-mkng wns wngs wnnr wnnrs wnnng wns wntr wrs wsdm wsh wtchs wth wthdrw wthdrwl wthdrwn wthdrw wthhld wthn wtht wthstnd wtnss wtnsss wvs wzrds wo wmn wmn wn wndr wng wrd wrd- prcssng wrds wrk wrkble wrkbks wrkd wrkr
Conditional Frequency Distributions
words = sorted(set(nltk.corpus.treebank.words()))cvs = [cv for w in words for cv in re.findall(r'[bcdfghjklmnpqrstvxyz][aeiou]', w)]cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()
The output is a conditional frequency distribution of consonant — vowel sequences from treebank
:
a e i o u
b 166 202 113 133 89
c 345 427 240 554 149
d 127 632 433 121 111
f 115 150 206 128 80
g 126 329 124 60 62
h 270 336 276 224 45
j 7 22 4 26 37
k 19 216 111 8 8
l 458 675 573 288 107
m 321 453 290 193 62
n 289 545 286 153 73
p 233 359 133 256 98
q 0 0 0 0 109
r 679 1229 665 486 127
s 130 577 391 175 229
t 429 1053 1130 352 182
v 100 516 214 60 0
x 17 28 21 7 3
y 13 104 44 22 1
z 21 76 21 10 2
treebank.wods()
is a tokenized Wall Street Journal sample.
Finding All Instances Of
The consonant — vowel sequences in reverse:
cv_word_pairs = [(cv, w) for w in words for cv in re.findall(r'[bcdfghjklmnpqrstvxyz][aeiou]', w)]cv_index = nltk.Index(cv_word_pairs)cv_index['ba']
The outputs are:
['Albany', 'Atlanta-based', 'Barbados', 'Barbara', 'Barbaresco', 'Bermuda-based', 'Cabbage', 'Calif.-based', 'Carballo', 'Centerbank', 'Citibank', 'Conn.based', 'Embassy', 'Erbamont', 'Francisco-based', 'Freshbake', 'Garbage', 'Germany-based' ...
Finding Word Stems
Word stems are the core of the word, the root, and in search engines we want to query for not just a literal string match but for all related words using the stem.
regex = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$'[w for w in wsj if re.findall(regex, w)]
Which finds all words with those suffixes.
["'30s", "'40s", "'50s", "'80s", "'s", '1920s', '1940s', '1950s', '1960s', '1970s', '1980s', '1990s', '20s', '30s', '62%-owned', '8300s', 'ADRs', 'Absorbed', 'Academically', 'According', 'Achievement'...
If you use it independently of the listing:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')
we get:
[('process', 'es')]
Finding Stems In a Better Way
def stem(word):
regex = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
stem, suffix = re.findall(regex, word)[0]
return stemraw = """DENNIS: Listen, strange women lying in ponds
distributing swords is no basis for a system of government.
Supreme executive power derives from a mandate from the masses,
not from some farcical aquatic ceremony."""tokens = nltk.word_tokenize(raw)
[stem(t) for t in tokens]
Which outputs:
['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond', 'distribut', 'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'Supreme', 'execut', 'power', 'deriv', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']
So even this better method makes errors such as the bolded above.
Searching Tokenized Text
What if you wanted to search multiple words? We can use regular expressions for that too.
from nltk.corpus import gutenberg
moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
moby.findall(r"<a> (<.*>) <man>")# monied; nervous; dangerous; white; white; white; pious; queer;
# good; mature; white; Cape; great; wise; wise; butterless; white;
# fiendish; pale; furious; better; certain; complete; dismasted;
# younger; brave; brave; brave; brave
The above regular expression will match “a (anything) man”. <.*>
will match any single token.
- If we want the matched phrase we don’t use the parentheses
- If we use the parentheses, then it only matches the word
from nltk.corpus import gutenberg
moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
moby.findall(r"<a> <.*> <man>")# a monied man; a nervous man; a dangerous man; a white man; a white # man; a white man; a pious man; a queer man; a good man; a mature
# man; a white man; a Cape man; a great man; a wise man; a wise man; # a butterless man; a white man; a fiendish man; a pale man; a
# furious man; a better man; a certain man; a complete man; a
# dismasted man; a younger man; a brave man; a brave man; a brave
# man; a brave man
To be able to match 3 word phrases:
chat = nltk.Text(gutenberg.words())
chat.findall(r"<.*> <.*> <whale>")# or a whale; as a whale; in the whale; that the whale; of a whale; # name a whale; - piggledy whale; s ( whale; " This whale; of the
# whale; of one whale; like a whale; the wounded whale; of a whale; # While the whale; say the whale; see a whale; of a whale; once a
# whale; of the ...
To be able to match sequences of 3 or more words that start with “l”:
moby.findall(r"<l.*>{3,}")# little lower layer; little lower layer; lances lie levelled; long # lance lightly; like live legs
Exploring Hypernyms
Some linguistics phenomena such as superordinate words can sometimes appear a certain way in a text.
import nltk
nltk.download('brown')
from nltk.corpus import brownhobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")
To understand how this works:
speed and other activities; water and other liquids; tomb and other landmarks; Statues and other monuments; pearls and other jewels; charts and other items; roads and other features; figures and other objects; military and other areas; demands and other factors; abstracts and other compilations; iron and other metals
Notice that one result prints “water and other liquids”, which tells us that water is a type of liquid and the liquid is the hypernym and the water is a hyponym.
Of course this method isn’t perfect, there can be false positives.
Other Articles
This post is part of a series of stories that explores the fundamentals of natural language processing:1. Context of Natural Language Processing
Motivations, Disciplines, Approaches, Outlook2. Notes on Formal Language Theory
Objects, Operations, Regular Expressions and Finite State Automata3. Natural Language Tool Kit 3.5
Search Functions, Statistics, Pronoun Resolution4. What Are Regular Languages?
Minimization, Finite State Transducers, Regular Relations5. What Are Context Free Languages?
Grammars, Derivations, Expressiveness, Hierarchies6. Inputting & PreProcessing Text
Input Methods, String & Unicode, Regular Expression Use Cases
Up Next…
In the next article, we will explore Normalizing, Tokenizing and Sentence Segmentation.
For the table of contents and more content click here.
References
Clark, Alexander. The Handbook of Computational Linguistics and Natural Language Processing. Wiley-Blackwell, 2013.
Eisenstein, Jacob. Introduction to Natural Language Processing. The MIT Press, 2019.
Bird, Steven, et al. Natural Language Processing with Python. O’Reilly, 2009.
Jurafsky, Dan, and James H. Martin. Speech and Language Processing. Pearson, 2014.
Barker-Plummer, Dave, et al. Language, Proof and Logic. CSLI Publ., Center for the Study of Language and Information, 2011.