NLP using Spacy and Beautiful Soup

@IanChriste
1 min readOct 5, 2019

--

Analyze the Gutenburg Website — www.gutenburg.org

Important concepts of NLP — (1) Tokenization, (2) Lemmatization, (3) Part of Speech, and (4) Entity Recognition

I. Definitions:

Tokenization splits text into words, symbols, punctuation. Lemmatization reduces a word to its base form. Lemmatization standardizes words that often have the same root meaning, (you could think of it as text classification). Part of Speech tagging assigns grammatical properties — nouns, verbs, etc — to words. Entity recognition classifies tokens into pre-defined categories — dates, names, etc.

II. Scraping the Data:

We pull in the data form the website using Request and change it to a Spacy object using nlp.

We only take text[:999999] for size considerations

III. Parsing Data in Spacy

  1. String and integer representations:
Spacy offers string and integer representations of the tokens. ( “orth_” for string and “orth” for integer)

2. Removing Punctuation and Spacing:

To remove punctuation tokens or white space tokens us -> .is_punct & .is_space

3. Lemmatization and Part of Speech:

.lemma: reduces a word to its base form and .tag: assigns grammatical properties

4. Entity Recognition and Sentence Recognition

Notice we use the nlp object with .ents and .sents

IV. Analyzing data in Spacy

  1. Find how many verbs and nouns:

2. Find most common types:

3. Visualize text with Displacy

from spacy import displacy
fifty = sentences[50]
displacy.render(fifty, style="dep")

Full Code Here

“””https://github.com/back2basics01/Gutenburg/blob/master/Guttenburg_Final.ipynb”””

--

--