NLP using Spacy and Beautiful Soup
Analyze the Gutenburg Website — www.gutenburg.org
Important concepts of NLP — (1) Tokenization, (2) Lemmatization, (3) Part of Speech, and (4) Entity Recognition
I. Definitions:
Tokenization splits text into words, symbols, punctuation. Lemmatization reduces a word to its base form. Lemmatization standardizes words that often have the same root meaning, (you could think of it as text classification). Part of Speech tagging assigns grammatical properties — nouns, verbs, etc — to words. Entity recognition classifies tokens into pre-defined categories — dates, names, etc.
II. Scraping the Data:
We pull in the data form the website using Request and change it to a Spacy object using nlp.
III. Parsing Data in Spacy
- String and integer representations:
2. Removing Punctuation and Spacing:
3. Lemmatization and Part of Speech:
4. Entity Recognition and Sentence Recognition
IV. Analyzing data in Spacy
- Find how many verbs and nouns:
2. Find most common types:
3. Visualize text with Displacy
from spacy import displacy
fifty = sentences[50]
displacy.render(fifty, style="dep")
Full Code Here
“””https://github.com/back2basics01/Gutenburg/blob/master/Guttenburg_Final.ipynb”””