spark-nlp
Published in

spark-nlp

SpaCy or Spark NLP — A Benchmarking Comparison

Photo: Pixabay

The aim of this article is to run a realistic Natural Language Processing scenario to compare the leading linguistic programming libraries: enterprise-grade John Snow Labs’ Spark NLP and Explosion AI’s industrial-strength library spaCy, both of which are open-source with commercially permissive licenses.

Comparing two different libraries is not as simple as it sounds. Each library has different implementation methods and thus will have different use cases, data pipelines, and characteristics. In this study, a detailed Spark NLP pipeline will be designed and a parallel code mimicking that will be written using spaCy, focusing primarily on runtime speed. We will then compare the results in terms of memory usage, speed, and accuracy.

Spacy is one of the best-documented libraries I have seen and now they include a free course, which I have taken previously and would highly recommend for a quick brush-up. Spark NLP has a well-designed website loaded with tons of useful information. For a quick start, I would recommend reading this article followed by this article. Many useful notebooks can be found in John Snow Labs’ repo for practice and a better understanding of Spark NLP’s dynamics.

Data:

We will be using a library consisting of seven different classics downloaded from Gutenberg.org. The corpus consists of approximately 4.9 M characters and 97 thousand sentences.

The whole notebook of the comparison and the corpus data can be found in my GitHub repo. Let’s start by examining the spaCy way.

### 2.1 Clean Tabs and Whitespace
clean_shrink = lambda text : text.replace(r'\n|\t|\s+',' ').replace('\s+',' ').strip()
df.loc[:,'document']=df.text.map(clean_shrink)
### 2.2 Tokenizer
sentence_tokenizer = lambda sent : [token for token in nlp(sent)]
df.loc[:,'token']=df.document.map(sentence_tokenizer)
### 2.3 Normalizer
normalizer = lambda tokens : [re.sub(punct,'',token.text) for token in tokens if re.sub(punct,'',token.text) != '']
df.loc[:,'normalized']=df.token.map(normalizer)
### 2.4 Remove Stop Words
normalizer_and_stop = lambda tokens : [re.sub(punct,'',token.text) for token in tokens if re.sub(punct,'',token.text) != '' and not token.is_stop]
df.loc[:,'cleanTokens']=df.token.map(normalizer_and_stop)
### 2.5 Lemmatize
normalizer_and_stop_lemma = lambda tokens : [re.sub(punct,'',token.lemma_) for token in tokens if re.sub(punct,'',token.text) != '' and not token.is_stop]
df.loc[:,'lemma']=df.token.map(normalizer_and_stop_lemma)
### 2.6 Stemmer
stemmer = PorterStemmer()
stems = lambda tokens : [stemmer.stem(token.text) if len(tokens)>0 else [] for token in tokens]
df.loc[:,'stem']=df.token.map(stems)
### 2.7 Part of Speech Tagging
normalizer_and_stop_pos = lambda tokens : [re.sub(punct,'',token.pos_) for token in tokens if re.sub(punct,'',token.text) != '' and not token.is_stop]
df.loc[:,'pos']=df.cleanTokens.map(normalizer_and_stop_pos)
### 2.8 Token Assembler
token_assembler = lambda tokens : " ".join(tokens)
df.loc[:,'clean_text']=df.cleanTokens.map(token_assembler)
### 2.9 Tagger
tagger = lambda text : [(ent.text, ent.label_) for ent in nlp(text).ents]
df.loc[:,'ner_chunks']=df.loc[:,'document'].map(tagger)
### 2.10 Regex Parser
noun_chunker = lambda text : [(chnk,(chnk[0].pos_,chnk[1].pos_,chnk[2].tag_ ))for chnk in nlp(text).noun_chunks if len(chnk.text.split())==3\
and chnk.text.replace(' ','').isalpha() and chnk[0].pos_ == 'DET'and chnk[1].pos_ == 'ADJ' and chnk[2].tag_ in ['NN','NNP']
]
df.loc[:,'RegexpParser'] =df.loc[:,'document'].map(noun_chunker)

Let’s take a brief pause here and observe the Regex Parser output. We asked for the noun chunker to return chunks that consist of a Determiner, an Adjective, and a Noun (proper, singular or plural).

[chunk for chunk in df.RegexpParser.values if chunk!=[]]

The results look pretty good.

[[(My dear Robinson, ('DET', 'ADJ', 'NNP'))],
[(this accidental souvenir, ('DET', 'ADJ', 'NN'))],
[(a great deal, ('DET', 'ADJ', 'NN'))],
[(a great amount, ('DET', 'ADJ', 'NN'))],
[(the local hunt, ('DET', 'ADJ', 'NN')),
(some surgical assistance, ('DET', 'ADJ', 'NN'))],
[(a remarkable power, ('DET', 'ADJ', 'NN'))],
[(a good deal, ('DET', 'ADJ', 'NN'))],
[(a fresh basis, ('DET', 'ADJ', 'NN')),
(this unknown visitor, ('DET', 'ADJ', 'NN'))],
[(the obvious conclusion, ('DET', 'ADJ', 'NN'))],
[(their good will, ('DET', 'ADJ', 'NN'))],
[(my dear Watson, ('DET', 'ADJ', 'NNP')),
(a young fellow, ('DET', 'ADJ', 'NN'))],
[(a favourite dog, ('DET', 'ADJ', 'NN'))],
[(the latter part, ('DET', 'ADJ', 'NN'))],
[(that local hunt, ('DET', 'ADJ', 'NN'))],
[(a heavy stick, ('DET', 'ADJ', 'NN'))],
[(a professional brother, ('DET', 'ADJ', 'NN'))],
[(the dramatic moment, ('DET', 'ADJ', 'NN'))],
[(a long nose, ('DET', 'ADJ', 'NN'))],
................

Let’s continue building our blocks.

### 2.11 N-Gram Generator
ngram_generator = lambda input_list: [*zip(*[input_list[i:] for i in range(n)])]
n=3
df.loc[:,'triGrams'] = df.loc[:,'token'].map(ngram_generator)
### 2.12 Word2Vec Embeddings
vector = lambda tokens: [(token.text, token.has_vector, token.vector, token.is_oov) for token in tokens]
df.loc[:,'vectors'] = df.loc[:,'token'].map(vector)
### 2.13 Regex Matcher
rules = r'''\b[A-Z]\w+ly\b|Stephen\s(?!Proto|Cardinal)[A-Z]\w+|Simon\s[A-Z]\w+'''
regex_matchers = lambda text : re.findall(rules,text)
df.loc[:,'Regex_matches'] =df.loc[:,'document'].map(regex_matchers)

Let’s take a look at regex matches

df.Regex_matches[df.Regex_matches.map(len)>1]

Here are the results:

13123                      [Polly, Polly]
25669 [Sally, Sally]
27262 [Sally, Sally]
27273 [Polly, Sally]
27340 [Polly, Sally]
28311 [Pelly, Dolly]
42016 [Feely, Feely]
49802 [Finally, Emily]
52129 [Lively, Lively]
58295 [Silly, Milly]
62141 [Silly, Milly]
64811 [Only, Molly]
71650 [Hely, Daly]
74427 [Healy, Dolly]
77404 [Molly, Milly]
77437 [Milly, Molly]
81557 [Molly, Reilly]
84023 [Szombathely, Szombathely]
89594 [Healy, Joly]
92206 [Simon Dedalus, Stephen Dedalus]
92980 [Firstly, Nelly, Nelly, Nelly]
93046 [Szombathely, Karoly]
94402 [Reilly, Simon Dedalus, Hely]
94489 [Stephen Dedalus, Simon Dedalus]

Let’s Venture Into The Characters…

Now that we have a dataset with many features, we have a plethora of options to dive into. Let’s examine the characters that are in the books…Let’s find NER Chunks that have a ‘PERSON’ tag, consisting of 2 words.

flatlist = lambda l : [re.sub("[^a-zA-Z\s\']","",item[0]).title().strip()  for sublist in l for item in sublist if item[1]=='PERSON' and len(item[0].split())==2]
ner_chunks = df.ner_chunks.to_list()
names=(flatlist(ner_chunks))
len(sorted(names))

The code above returned 4832 names, which looks a bit suspicious since this number is high. Let’s inspect the result of the Counter object:

('St Clare', 306),
('Buck Mulligan', 96),
('Aunt Chloe', 76),
('Martin Cunningham', 74),
('Masr George', 45),
('Ned Lambert', 44),
('Solomon Northup', 43),
('Tom Sawyer', 42),
('Aunt Sally', 37),
('Uncle Tom', 37),
('Ben Dollard', 36),
('Myles Crawford', 35),
('Blazes Boylan', 28),
('Uncle Abram', 24),
('J J', 22),
......
('L L', 5),
('Mark Twain', 5),
("Tom Sawyer'S", 5),
('Chapter Xi', 5),
('Chapter Xix', 5),
('Dey Wuz', 5),
('George Jackson', 5),
('Levi Bell', 5),
('King Lear', 5),
('Simon Legree', 5),
('Garrett Deasy', 5),
('A E', 5),
('S D', 5),
('Josie Powell', 5),
('Mrs Purefoy', 5),
('Ben Howth', 5),
('Bald Pat', 5),
('Barney Kiernan', 5),
('Michael Gunn', 5),
('C C', 5),

Many tags are inaccurate, unfortunately. Please observe some chapter titles as well as capitalized initials that are inaccurately returned as PER tags.

In writing the above code, mappings were used to ensure fast pacing, and the Spark NLP pipeline that will be implemented below was mimicked. Once we run a similar code in Spark NLP, we will compare results in terms of memory usage, speed, and accuracy.

Time to do things Spark NLP way!

documentAssembler = DocumentAssembler()\
.setInputCol(“text”)\
.setOutputCol(“document”)\
.setCleanupMode(“shrink”)
sentenceDetector = SentenceDetector().\
setInputCols(['document']).\
setOutputCol('sentences')
tokenizer = Tokenizer() \
.setInputCols(["sentences"]) \
.setOutputCol("token")
ngrams = NGramGenerator() \
.setInputCols(["token"]) \
.setOutputCol("ngrams") \
.setN(3) \
.setEnableCumulative(False)\
.setDelimiter("_") # Default is space
normalizer = Normalizer() \
.setInputCols(["token"]) \
.setOutputCol("normalized")\
.setLowercase(False)\
.setCleanupPatterns(["[^\w\d\s\.\!\?]"])
stopwords_cleaner = StopWordsCleaner()\
.setInputCols("normalized")\
.setOutputCol("cleanTokens")\
.setCaseSensitive(False)\
lemma = LemmatizerModel.pretrained('lemma_antbnc') \
.setInputCols(["cleanTokens"]) \
.setOutputCol("lemma")
stemmer = Stemmer() \
.setInputCols(["token"]) \
.setOutputCol("stem")
pos = PerceptronModel.pretrained("pos_anc", 'en')\
.setInputCols("clean_text", "cleanTokens")\
.setOutputCol("pos")
chunker = Chunker()\
.setInputCols(["sentences", "pos"])\
.setOutputCol("chunk")\
.setRegexParsers(["<DT>+<JJ>*<NN>"]) ## Determiner - adjective - singular noun
tokenassembler = TokenAssembler()\
.setInputCols(["sentences", "cleanTokens"]) \
.setOutputCol("clean_text")\
tokenizer2 = Tokenizer() \
.setInputCols(["clean_text"]) \
.setOutputCol("token2")
glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d')\
.setInputCols(["document", "lemma"])\
.setOutputCol("embeddings")\
.setCaseSensitive(False)
onto_ner = NerDLModel.pretrained("onto_100", 'en') \
.setInputCols(["document", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentences", "token", "ner"]) \
.setOutputCol("ner_chunk")
rules = r'''
\b[A-Z]\w+ly\b, starting with a capital letter ending with 'ly'
Stephen\s(?!Proto|Cardinal)[A-Z]\w+, followed by "Stephen"
Simon\s[A-Z]\w+, followed by "Simon"
'''
with open('ulyses_regex_rules.txt', 'w') as f:

f.write(rules)
regex_matcher = RegexMatcher()\
.setInputCols('sentences')\
.setStrategy("MATCH_ALL")\
.setOutputCol("regex_matches")\
.setExternalRules(path='./ulyses_regex_rules.txt', delimiter=',')

Now that our pieces are ready, let’s define the assembly line.

nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
ngrams,
normalizer,
stopwords_cleaner,
lemma,
stemmer,
tokenassembler,
tokenizer2,
pos,
chunker,
glove_embeddings,
onto_ner,
ner_converter,
regex_matcher

])
empty_df = spark.createDataFrame([[‘’]]).toDF(“text”)pipelineModel = nlpPipeline.fit(empty_df)lib_result = pipelineModel.transform(library)

Let’s check regex matches according to our search criteria:
- A whole word that begins with a capital letter and ends with ‘ly’,
- ‘Stephen’ that is not followed by ‘Cardinal’ or ‘Proto’, but followed by a word that starts with a capital letter.
- ‘Simon’ is followed by a word that starts with a capital letter.
We are looking for at least two occurrences in each sentence…

match_df.filter(F.size('finished_regex_matches')>1).show(truncate = 50)

Let’s take a look at the results…

The chunker annotator in our pipeline is going to return chunks that consist of a Determiner, an Adjective, and a singular Noun.

lib_result.withColumn(
"tmp",
F.explode("chunk")) \
.select("tmp.*").select("begin","end","result","metadata.sentence").show(20,truncate = 100)

Here are the top 20 rows of the chunker results.

Let’s Venture Into The Characters…Spark NLP Way.

Here we examine the characters that are in the books…This time we will be using Spark NLP mechanics. Please note differences in accuracy as compared to spaCy.

l = result_ner.filter(result_ner.ner_label == "PERSON").select(F.expr("ner_chunk")).collect()names = list([re.sub("[^a-zA-Z\s\']","",l_[0]).title() for l_ in l if l_[0].replace(' ','').isalpha() and len(l_[0].strip().split())==2 and "’" not in l_[0]])len(set(names))

This time, the number of characters we have are limited to 1284. Let’s take a look at the most common 350 names.

('Buck Mulligan', 93),
('Aunt Chloe', 82),
('Martin Cunningham', 71),
('Bayou Boeuf', 48),
('Aunt Sally', 39),
('Ned Lambert', 39),
('Mary Jane', 38),
('Solomon Northup', 36),
('John Thornton', 34),
('Myles Crawford', 33),
('Ben Dollard', 31),
('Sherlock Holmes', 30),
('Tom Sawyer', 30),
('John Eglinton', 29),
('Nosey Flynn', 28),
('Corny Kelleher', 27),
('Mrs Breen', 27),
('Father Conmee', 26),
('Uncle Tom', 25),
('John Wyse', 24),
('Henry Baskerville', 23),
('Uncle Abram', 22),
('Blazes Boylan', 19),
('Bob Doran', 18),
('Davy Byrne', 18),
('Coombe Tracey', 17),
('Aunt Phebe', 17),
('Simon Dedalus', 17),
........................
('Sandy Hill', 8),
('Theophilus Freeman', 8),
('Father Cowley', 8),
('Gregory B', 7),
('George Harris', 7),
('Rachel Halliday', 7),
('George Shelby', 7),
('Anne Hampton', 7),
('Peter Tanner', 7),
('Almidano Artifoni', 7),
('Hugo Baskerville', 6),
('Laura Lyons', 6),
('Aunt Polly', 6),
('Peter Wilks', 6),
.........................

The names list looks much more accurate. No wonder why Spark NLP is enterprise preferred!

SpaCy is good, but Spark NLP is better…

Let’s Talk about Resource Consumption:
The system used for this study is an 8 core Intel(R) Core(TM) i7–9700K CPU @ 3.60GHz with 32820MB Memory. The operating system is Ubuntu 20.04.

Spark NLP uses less memory and runs twice as fast when compared to spaCy. This fact, being coupled with higher accuracy of the Spark NLP provides good reasons to master this library!

In this article, we compared the NLP pipeline in both libraries. While implementing the same process flow is very difficult using two completely different libraries, code was mimicked to the maximum extent possible. As expected, Spark NLP proved to be faster and more accurate in terms of Named Entity Recognition. However, for small-sized datasets, spaCy may be more practical and possibly even faster, but when the data size increases, Spark NLP’s speed becomes clearly visible. No wonder Spark NLP is the weapon of first choice for enterprises.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store