spaCy basic features: comparing performance for Portuguese, French and English

Wilame
4 min readNov 19, 2018

--

spaCy is a great tool for NLP. The open source library is free, fast and easy to use. spaCy has also great features like the fact that it uses non-destructive tokenisation and how it supports a lot of languages. But we know how hard is to work with other languages than English when doing NLP. So, today, I will test some spaCy basic features on English, Portuguese and French to see how the tool deals with these languages.

How does spaCy work?

spaCy uses statistical models to do most of the language processing work. It relies on neural models for tagging, parsing and entity recognition. In order to use most of the features of spaCy, you have to download and install the models for the language you are working with.

Here we have the first difference for the languages: English has 4 available models , while Portuguese has only one model trained using Wikipedia articles. French has 2 available models, also trained with Wikipedia.

This means that features like the named entities are slightly less complete for foreign languages than for English. You can, however, train your own models and submit them to spaCy if you need.

Loading spaCy

I will analyse how the features are available for the three languages by using the same text translated for each one of the languages. The texts will be transformed into a nlp object. This object analyses each token on the text and transforms it on a document that holds extra metadata for the original text.

After being transformed, spaCy returns the original text, but you can individually access the properties of each token later.

That being said, the first thing we want to see is how the original text is interpreted by python when we use a simple print function.

If you run this, you will see that the text is printed like a normal string. There’s nothing great and exciting about it. But the magic happens when you access each one of the hidden properties of the tokens using a for loop.

The result is something like this:

spaCy does a good job for all of the three languages here. By the way, available properties are:

- Text: The original word text.
- Lemma: The base form of the word.
- POS: The simple part-of-speech tag.
- Tag: The detailed part-of-speech tag.
- Dep: Syntactic dependency, i.e. the relation between tokens.
- Shape: The word shape — capitalisation, punctuation, digits.
- is alpha: Is the token an alpha character?
- is stop: Is the token part of a stop list, i.e. the most common words of the language?

Named Entities

spaCy is able to resolve entities on your text, but here we start to notice some language differentiation. Metadata available for English is much more detailed. Also, I noticed some classification problems for French and for Portuguese.

Similarity

We can use spaCy to identify if two words or sentences are similar one to another. I have tested similarity with the words “minister” and “president” and between “car” and “banana” for each one of the languages. Results were pretty good for the first comparison, but not so good for the second one.

Stop words

Detecting stop words is something very basic for a NLP library. spaCy does the job correctly, for all the languages (except for French, where it considered “Le” as a non stop word).

Lemmatisation

Here, no problem for any of the languages.

Sentences

What’s a sentence? A good NLP tool has to be able to discover this without relying on punctuation only. spaCy does a good job on sentence recognition for structured text.

Conclusions

I love spaCy, and the library is really very good into doing basic tasks on structured English text (press, books etc.). But just like you saw, we still need to do some extra work when it comes to other languages.

If you are using spaCy for a project which is not in English, I would strongly recommend you to explore all the customisation possibilities that spaCy offers in order to tweak it for your project.

--

--