spaCy : Industrial Strength NLP and it’s online interactive course

Abhishek Purushothama ( ಅಭಿಷೇಕ್ ಪುರುಷೋತ್ತಮ )

Published in

Voice Tech Podcast

3 min readJun 24, 2019

I consider myself a an aspiring NLP(Natural Language Processing) engineer. I have tried my hands with a few projects that deal with NLP, with a wariness towards Deep Learning that I don’t fully comprehend.

I became aware of spaCy close to two years ago. I read about its concept of Pipeline for NLP and found it both practical and interesting. Although I neither got the inclination to take it for a spin nor did fate bestow me with an opportunity to do so(I know, I know, I have to make my own opportunities rather waiting ….). It reared it’s head again when I was checking out rasa_nlu (which is an open-source platform that allows you to build NLU systems) and it supported use of spaCy pipelines.

While working on another NLP-related project which with a pipeline flow and plug-n-play components become would become more versatile, I finally had the motivation to try it out.

Luckily for me spaCy now has an interactive course (similar to those codeacademy) specifically built to teach you how to use spaCy (of course with python)effectively.

I have only finished the 1st of 4 chapters, but it was good enough to motivate me to write this review-ish(puff-piece?) article.

Over the course of the first chapter they introduced people to topics that most coders who have ever tried their hands with NLP would have heard if not worked with, such as Tokenization, POS(Parts-of-speech) tagging, NER(Named-Entity-Recognition) and Statistical Models

So here are some main reasons I have an infatuation with spaCy, which I hope turns into more.

Understanding how to deal with text

During one of my adventures(reinventing the wheel, slower and slowly) I had tried encapsulating text in such a way as to provide ease of use when processing them(in my case text-statistics were of focus). This allowed me to realize the import of the Document-based structure that spaCy employs. It allows for a quick and painless way of dealing with text without all the melodrama.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

2. Most things you need in a nice little box
spaCy provides pre-built statistical models for few of the languages. These come in different shapes, which you can download and use based on your requirements. It comes with the above mentioned trio (Tokenization, POS and NER with a side of syntatic dependency) all packed and ready. The design successfully hides the complexity with easy to use interface. Below is a gist trying to prove the point.

3. They understand ‘The World is Not Enough’
Like any NLP amateur I was overjoyed with the structure available when dealing with text and the pre-build statistical models. But they understand that no model is complete, and no shoe fits all. Due to the variety and complexity of the applications that can be built with NLP, there is no way(is there?) to have a model that works for all. They realize this and support ‘Matcher’. No! it’s not your run-of-the-mill Regular Expressions support, it is a thorough mechanism that successfully uses its structure and model to allow you to customize(and fill the gaps), so that you can only do the extra that is required on top of the available models.

4. Matcher and Rule-based matching

The way Matcher is built it allows you to build complex rules in addition to simple patterns tried above. The support for expanded attributes (such as LEMMA, POS etc) and sequencing (A pattern can be built to apply for span i.e sequence of tokens) put no limits on what can be achieved. This allows us to build complex set of rules (as good as DBMS queries maybe) that can be used to detect any text sequence of our choice(in addition to whatever model has to offer).

Take a look at this gist for a sample.

I don’t think all my unabashed comments make justice to the experience of using it and learning it yourself.

Hop over to https://course.spacy.io/chapter1 to start the course.

spaCy : Industrial Strength NLP and it’s online interactive course

Written by Abhishek Purushothama ( ಅಭಿಷೇಕ್ ಪುರುಷೋತ್ತಮ )