NLP using Spacy and Beautiful Soup

1 min readOct 5, 2019

Analyze the Gutenburg Website — www.gutenburg.org

Important concepts of NLP — (1) Tokenization, (2) Lemmatization, (3) Part of Speech, and (4) Entity Recognition

I. Definitions:

Tokenization splits text into words, symbols, punctuation. Lemmatization reduces a word to its base form. Lemmatization standardizes words that often have the same root meaning, (you could think of it as text classification). Part of Speech tagging assigns grammatical properties — nouns, verbs, etc — to words. Entity recognition classifies tokens into pre-defined categories — dates, names, etc.

II. Scraping the Data:

We pull in the data form the website using Request and change it to a Spacy object using nlp.

We only take text[:999999] for size considerations

III. Parsing Data in Spacy

String and integer representations:

Spacy offers string and integer representations of the tokens. ( “orth_” for string and “orth” for integer)

2. Removing Punctuation and Spacing:

To remove punctuation tokens or white space tokens us -> .is_punct & .is_space

3. Lemmatization and Part of Speech:

.lemma: reduces a word to its base form and .tag: assigns grammatical properties

4. Entity Recognition and Sentence Recognition

Notice we use the nlp object with .ents and .sents

IV. Analyzing data in Spacy

Find how many verbs and nouns:

2. Find most common types:

3. Visualize text with Displacy

from spacy import displacy
fifty = sentences[50]
displacy.render(fifty, style="dep")

Full Code Here

back2basics01/Gutenburg

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

“””https://github.com/back2basics01/Gutenburg/blob/master/Guttenburg_Final.ipynb”””