Get Started with Spacy.io and NLP

Dave Trichter
Aug 26 · 5 min read

-David S. Trichter

What is it: SpaCy is a fast, modern packages of pre-trained NLP models that will help you discover insights in your text data. I recommend that you have familiarity with the Python language, Pandas Dataframes, and basic Natural Language Processing concepts before proceeding.


This is a relatively painless installation. Notice you need to install each language model separately (pick and choose which you need. I’ve only used English so far, but am excited to try out some of the others!):

pip install -U spacypython -m spacy download en_core_web_sm #English modelpython -m spacy download es_core_news_sm #Spanish model
python -m spacy download de_core_news_sm #German model
python -m spacy download fr_core_news_sm #French model
python -m spacy download pt_core_news_sm #Portuguese model
python -m spacy download it_core_news_sm #Italian model
python -m spacy download nl_core_news_sm #Dutch model
python -m spacy download el_core_news_sm #Greek model

They even have trained a multi-language model which supports identification of PER, LOC, ORG and MISC entities for English, German, Spanish, French, Italian, Portuguese and Russian (We’ll get to those 3 letter entity codes in a minute).

python -m spacy download xx_ent_wiki_sm #Multi-language

From there, getting started is easy :

Next, instantiate the SpaCy model as “nlp”. From there, you can fit your model to any text document:

nlp = spacy.load("en_core_web_sm")doc = "Deficit Will Reach $1 Trillion Next Year, Budget Office Predicts"doc = nlp(doc)

Once fit to the model, “doc” can access many of SpaCy’s features for use in your NLP analysis.


Some useful tools to explore SpaCy:

  1. Explore SpaCy functionality in a Panda’s DataFrame
df = pd.DataFrame()df['text'] = [token.text for token in doc]
df['lemma'] = [token.lemma_ for token in doc]
df['is_punctuation'] = [token.is_punct for token in doc]
df['is_space'] = [token.is_space for token in doc]
df['shape'] = [token.shape_ for token in doc]
df['part_of_speech'] = [token.pos_ for token in doc]
df['pos_tag'] = [token.tag_ for token in doc]

2. Since one of the best features of SpaCy is its accurate prediction of part of speech tags, you may also want to explore your data with the following f-string loop:

for token in doc:
print(f'TOKEN: {token.text} \nTAG: {token.tag_} \nEXPLANATION:
{spacy.explain(token.tag_)}\n')

Visualizations

Import SpaCy’s visualization tool “Displacy” can help you to explore how the model is parsing your text.

from spacy import displacydisplacy.render(doc, style='ent', jupyter=True)
You can see that it is not perfect! ‘Budget Office Predicts’ is not an organization. All caps in the title threw off Spacy’s model.
displacy.render(doc, style='dep', jupyter=True, 
options={'distance': 90,
'color':'maroon',
'bg':'lightblue',
'arrow_stroke': 1,
'arrow_width': 6,})
displacy.render(doc, style='dep', jupyter=True, 
options={'distance': 90,
'color':'white',
'bg':'black',
'arrow_stroke': 1,
'arrow_width': 10,
'compact': True})

Cosine Similarity

Once you have a handle on these tools, you can then use Spacy to find cosine similarities between your documents.

wsj = nlp("President Trump sought to ease trade tensions with China and struck a more conciliatory note on the final day of the Group of Seven summit, where world leaders have pressured him to de-escalate the trade war.")nytimes = nlp('President Trump shifted tone on his trade war with China yet again on Monday, expressing confidence that the two sides can reach a deal and calling President Xi Jinping a "great leader" three days after branding him an "enemy".')foxnews = nlp("Trump 'dominated' the G-7 summit 'like no other president has done in years")cnn = nlp("I think perhaps one of the biggest headlines coming out of this press conference that we just witnessed here in France is that the President would not be pinned down on this question of climate change.")print(wsj.similarity(nytimes))  # 0.8223302796090933
print(wsj.similarity(foxnews)) # 0.7628043724615976
print(wsj.similarity(cnn)) # 0.7195118933250793

Pipeline

Spacy has a built in pipeline that you can customize. Sometimes you will want the Part of Speech (POS) parser, and sometimes you will not. Sometimes you might only care about document similarity. All you have to do to see what you’re working with is:

nlp.pipeline#[('tagger', <spacy.pipeline.Tagger at 0x121769f10>),
('parser', <spacy.pipeline.DependencyParser at 0x121bc1a10>),
('ner', <spacy.pipeline.EntityRecognizer at 0x121bc1f50>)]

Classifier

Finally , to add a classifier to your pipeline, all you need to do is call Spacy’s “create_pipe” method :

# Construction via create_pipetextcat = nlp.create_pipe("textcat") 
textcat = nlp.create_pipe("textcat", config={"exclusive_classes": True})
#Spacy makes it easy to decide if you want your documents to fit into more than one class!

NOTE: After creating your model, you can even go back and update it with new information:

textcat = TextCategorizer(nlp.vocab) 
losses = {}
optimizer = nlp.begin_training()
textcat.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)

Begin Training Your Model

textcat = TextCategorizer(nlp.vocab) 
nlp.pipeline.append(textcat)
optimizer = textcat.begin_training(pipeline=nlp.pipeline)

Finally, get predictions and scores

textcat = TextCategorizer(nlp.vocab) 
scores = textcat.predict([doc1, doc2])

Make sure to serialize your model so you can use it later…

#Save it
textcat = TextCategorizer(nlp.vocab) textcat.to_disk("/path/to/textcat")
#Retrieve it
textcat = TextCategorizer(nlp.vocab) textcat.from_disk("/path/to/textcat")

Conclusions

This has been a quick, wide, and somewhat messy look at what a Spacy pipeline has to offer. There are many other customizations that you can do, especially once you get into class construction. However, I find that when using a new piece of technology it is best to get started quickly, explore the entire modeling process, and then allow yourself to back and fill in the gaps and customize once you have a handle on the basics.

Much has been borrowed from the official Spacy.io documentation, so to see the full capabilities of the package, head over to :

spacy.io

Let me know if you have any questions!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade