Natural Language Processing com a biblioteca spaCy

Déborah Mesquita
Mar 26, 2017 · 4 min read

➗ Tokenization

import spacy
nlp = spacy.load('en') #o model
raw_text = "Seven years after the death of his wife, Mill was invited to contest Westminster. His feeling on the conduct of elections made him refuse to take any personal action in the matter, and he gave the frankest expression to his political views, but nevertheless he was elected by a large majority. He was not a conventional success in the House; as a speaker he lacked magnetism. But his influence was widely felt."parsedData = nlp(raw_text) #pronto, o texto já tá separado em tokens
word = parsedData[0]
print(word.text, word.lower_)
>>> Seven seven

🗣 Part-of-speech tagging

for i,word in enumerate(parsedData):
print(word.text, word.pos_)
if i > 5:
break
>>> Seven NUM
years NOUN
after ADP
the DET
death NOUN
of ADP
his ADJ
word = parsedData[10] #a palavra 'was'
print("original:",word.text)
print("POS tag:",word.pos_)
print("fine grainned POS tag:", word.tag_)
>>> original: was
POS tag: VERB
fine grainned POS tag: VBD
# VBD significa: VerbForm=fin Tense=past

📌 Named Entity Recognition (NER)

for word in parsedData:
if word.ent_type_:
print(word.text, word.ent_type_)
>>> Seven DATE
years DATE
Mill PERSON
Westminster PERSON
House ORG

🔪 Syntactic Parsing

As dependências da primeira frase do nosso texto
for word in parsedData:
print(word.text, word.dep_)
>>> Seven nummod
years nsubjpass
after prep
the det
death pobj
of prep
[...]

✅ Rule-based matching

from spacy.attrs import DEP
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
# http://universaldependencies.org/en/dep/nsubj.html
matcher.add_pattern("SujeitoNominal", [ {DEP:'nsubj'}])
doc = nlp(raw_text)
matches = matcher(doc)
for ent_id, label, start, end in matcher(doc):
print(doc[start:end].text)
>>> feeling
he
He
he

↗️ Word vectors

my,dog,and_,cat,and__,horse = nlp(u'my dog and cat and horse')
print(cat.similarity(dog))
print(cat.similarity(horse))
print(dog.similarity(horse))
>>> 0.801685428714
0.484733507195
0.624627638895

📈 E a análise se sentimento?

Déborah Mesquita

Written by

Award-winning Data Scientist 👩🏾‍💻 Loves to write and explain things in different ways✨ - http://deborahmesquita.com/

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade