SpaCy or Spark NLP — A Benchmarking Comparison

The aim of this article is to run a realistic Natural Language Processing scenario to compare the leading linguistic programming libraries: enterprise-grade John Snow Labs’ Spark NLP and Explosion AI’s industrial-strength library spaCy, both of which are open-source with commercially permissive licenses.

Comparing two different libraries is not as simple as it sounds. Each library has different implementation methods and thus will have different use cases, data pipelines, and characteristics. In this study, a detailed Spark NLP pipeline will be designed and a parallel code mimicking that will be written using spaCy, focusing primarily on runtime speed. We will then compare the results in terms of memory usage, speed, and accuracy.

Spacy is one of the best-documented libraries I have seen and now they include a free course, which I have taken previously and would highly recommend for a quick brush-up. Spark NLP has a well-designed website loaded with tons of useful information. For a quick start, I would recommend reading this article followed by this article. Many useful notebooks can be found in John Snow Labs’ repo for practice and a better understanding of Spark NLP’s dynamics.

Data:

We will be using a library consisting of seven different classics downloaded from Gutenberg.org. The corpus consists of approximately 4.9 M characters and 97 thousand sentences.

The whole notebook of the comparison and the corpus data can be found in my GitHub repo. Let’s start by examining the spaCy way.

Let’s take a brief pause here and observe the Regex Parser output. We asked for the noun chunker to return chunks that consist of a Determiner, an Adjective, and a Noun (proper, singular or plural).

The results look pretty good.

Let’s continue building our blocks.

Let’s take a look at regex matches

Here are the results:

Let’s Venture Into The Characters…

Now that we have a dataset with many features, we have a plethora of options to dive into. Let’s examine the characters that are in the books…Let’s find NER Chunks that have a ‘PERSON’ tag, consisting of 2 words.

The code above returned 4832 names, which looks a bit suspicious since this number is high. Let’s inspect the result of the Counter object:

Many tags are inaccurate, unfortunately. Please observe some chapter titles as well as capitalized initials that are inaccurately returned as PER tags.

In writing the above code, mappings were used to ensure fast pacing, and the Spark NLP pipeline that will be implemented below was mimicked. Once we run a similar code in Spark NLP, we will compare results in terms of memory usage, speed, and accuracy.

Time to do things Spark NLP way!

Now that our pieces are ready, let’s define the assembly line.

Let’s check regex matches according to our search criteria:
- A whole word that begins with a capital letter and ends with ‘ly’,
- ‘Stephen’ that is not followed by ‘Cardinal’ or ‘Proto’, but followed by a word that starts with a capital letter.
- ‘Simon’ is followed by a word that starts with a capital letter.
We are looking for at least two occurrences in each sentence…

Let’s take a look at the results…

The chunker annotator in our pipeline is going to return chunks that consist of a Determiner, an Adjective, and a singular Noun.

Here are the top 20 rows of the chunker results.

Let’s Venture Into The Characters…Spark NLP Way.

Here we examine the characters that are in the books…This time we will be using Spark NLP mechanics. Please note differences in accuracy as compared to spaCy.

This time, the number of characters we have are limited to 1284. Let’s take a look at the most common 350 names.

The names list looks much more accurate. No wonder why Spark NLP is enterprise preferred!

Let’s Talk about Resource Consumption:
The system used for this study is an 8 core Intel(R) Core(TM) i7–9700K CPU @ 3.60GHz with 32820MB Memory. The operating system is Ubuntu 20.04.

Spark NLP uses less memory and runs twice as fast when compared to spaCy. This fact, being coupled with higher accuracy of the Spark NLP provides good reasons to master this library!

In this article, we compared the NLP pipeline in both libraries. While implementing the same process flow is very difficult using two completely different libraries, code was mimicked to the maximum extent possible. As expected, Spark NLP proved to be faster and more accurate in terms of Named Entity Recognition. However, for small-sized datasets, spaCy may be more practical and possibly even faster, but when the data size increases, Spark NLP’s speed becomes clearly visible. No wonder Spark NLP is the weapon of first choice for enterprises.