NER model with ELMo Embeddings in 5 minutes in Spark-NLP

This tutorial consists of the following steps

  1. Download the CoNLL dataset into your Spark environment

This article covers how to train a custom NER model while leveraging the latest embeddings and techniques from the world of deep learning in Python using the lighting fast Spark-NLP library!

We will use a prior SOTA model, based on Named Entity Recognition with Bidirectional LSTM-CNN, by Chiu & Nicols, which is a novel neural network architecture that automatically detects word- and character-level features using a hybrid bidirectional LSTM and CNN architecture. To this NER model, we will feed the embeddings generated by ELMo, which also runs a bidirectional LSTM.

What is named entity recognition (NER)?

Named-entity recognition (NER) (a.k.a entity identification, entity chunking or entity extraction) is a sub-task of information extraction that aims to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. Word embeddings are extremely important for many NLP tasks like NER and many others.

Example medical NER

How does ELMo differ from past approaches?

ELMo, created by AllenNLP broke the state of the art (SOTA) in many NLP tasks. Together with ULMFiT and OpenAi, ELMo brought upon us NLP’s breakthrough imagenet moment. These embedding techniques were a great step forward better results compared to older methods like word2vec or GloVe.

ELMO improving SOTA 10–20 %

How does it differ from newer models like BERT, XLNET or ALBERT?

In contrast to BERT, XLNET, and ALBERT which are trained on masking random words in a sentence, ELMo is trained on predicting the next word in a sequence. ELMo is relying on bidirectional LSTM’s under the hood and is not transformer-based, like BERT, XLNET, ALBERT, and USE. In case you wanna try them out, they are all available in Spark NLP as annotators! We will cover them in the upcoming tutorials.

Elmo's LSTMs under the hood. No transformers to see here please move along!

Let's get our hands dirty with some coding!

I have not yet set up Spark NLP, what should I do?

No problem there are a lot of tutorials to quickly setup Spark NLP in many different environments! The fastest way is to either import the Datarbicks notebook or just click the collab link, which opens a google collab notebook with this code ready to execute for you!

Click here to try out the premade notebook yourself in Google Collab!

1. Download the CoNLL dataset into the Databricks File System

We will download the CoNLL dataset which is nowadays the standard in NER problems and used in most papers as a benchmark. We can download it straight to a Databricks cluster or collab notebook and use it with this code snippet

# Download CoNLL 2003 Dataset
import os
from pathlib import Path
import urllib.request
download_path = "./eng.train"if not Path(download_path).is_file():
print("File Not found will downloading it!")
url = "https://github.com/patverga/torch-ner-nlp-from-scratch/raw/master/data/conll2003/eng.train"
urllib.request.urlretrieve(url, download_path)
else:
print("File already present.")

2. Read CoNLL Dataset into Spark dataframe and

The CoNLL().readDataset() method of the CoNLL class handily parses the CoNLL format and creates a Spark dataframe. It takes care of parsing each sentence into tokens, detecting Part of Speech labels, indexing multi-sentence examples, and embedding the dataset in the internal Spark NLP document structure.

#load CoNLL dataset into Spark
from sparknlp.training import CoNLL
training_data = CoNLL().readDataset(spark, ‘/eng.train’)
training_data.show()
Dataset ready for NER tasks

3. Define the NER ELMO pipeline

Since Spark-NLP does all of the heavy Tensorflow weight lifting for us, all that remains is just defining each step of our pipeline with the handy components of the Spark-NLP library. This pipeline consists of just 2 pieces, the pre-trained ELMO model and the char CNNs — BiLSTM — CRF. Usually, the columns document, sentence, token, pos and label must be manually extracted from the dataset, but we already took care of that in the last step!

3.1 Define the imports

# Sparknlp Imports
from pyspark.ml import Pipeline
# Spark-NLP imports
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

3.2 Define the ELMO pipeline component

The call to ElmoEmbeddings.pretrained() returns a Py-Spark transformer since it is already pre-trained.

3.2.1 Elmo parameters

The most relevant parameters are the following :

  • setInputCols: This defines which columns the ELMo pipeline should check in the Spark Dataframes, for generating its embeddings. It requires a sentence and a token column.

3.2.2 Elmo's Pooling layers

Here is a quick overview and explanation for the potential pooling layers for ELMo :

  • word_emb: the character-based word representations with shape [batch_size, max_length, 512]

Refer to the paper to more specific info about the pooling layers.

Now let's build our ELMo pipeline component!

#Build the ELMo pipe component
elmo = ElmoEmbeddings.pretrained()
.setInputCols("sentence", "token")\
.setOutputCol("elmo")

4. Define the NER char CNN's — BiLSTM — CRF pipeline component

The NER model will leverage the features generated by ELMo to learn the NER classes.

4.1 NER DL parameters

We can tweak the model a lot but standard configurations usually work very well.

  • setLr: Initial learning rate.

The creation of the NerDLApproach() object of the Spark-NLP library returns to us a Py-Spark estimator. It needs to be fitted before it can transform the raw text into embeddings and finally NER predictions.

# Build the NER pipe component
nerTagger = NerDLApproach()\
.setInputCols(["sentence", "token", "elmo"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
.setMaxEpochs(1)\

4. Put it all together in one Pipeline

We now just need to put every pipeline object into a list and pass it to the Py-Spark Pipeline constructor

# Combine pipeline components together and create the full pipeline
pipeline = Pipeline(
stages = [
elmo,
nerTagger
])

4.1 Run the pipeline

Now we are finally ready to let the Spark cluster crunch some numbers and produce some NLP juice for us!
The fit call on a pipeline returns us a trained NER model which we can use to transform annotate raw data. We are limiting the dataset, to save some time, in a real-world scenario you would not want to do that.

fitted_pipe = pipeline.fit(training_data.limit(100))
ner_df = fitted_pipe.transform(training_data.limit(100))
ner_df.show()

4.2 Check out the results

Let's grab the first row and look at the original text and the result of the NER pipeline

ner_df.select(*['text', 'ner']).limit(1).show(truncate=False)
Named entities all recognized and ready to go!

The complete NLP pipeline code

#Spark Imports
from pyspark.sql.types import StringType
#Spark-NLP Imports
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.annotator import *
from sparknlp.base import *
spark =sparknlp.start()
#Create Dataframe with raw Sample data
dfTest = sparknlp.SparkSession.createDataFrame([
"Spark-NLP would you be so nice and cook up some state of the art embeddings for me?",
"Tensorflow is cool but can be tricky to get Running. With Spark-NLP, you save yourself a lot of trouble.",
"I save so much time using Spark-NLP and it is so easy!"
], StringType()).toDF("text")
#Basic Spark NLP Pipeline# Main Entry Point pipeline for most for Spark-NLP functionality.
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
#Detects sentences in document columns
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
#Tokenizes the input with a standard tokenizer
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
#Load the pretrained elmo model from the Spark-NLP repo
elmo = ElmoEmbeddings.pretrained() \
.setInputCols(["token", "document"]) \
.setOutputCol("elmo") \
.setPoolingLayer("elmo")
#Put every piece of the pipeline together
nlpPipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
elmo,
])
#Fit our model and transform
nlp_model = nlpPipeline.fit(dfTest)
processed = nlp_model.transform(dfTest)
processed.show()
That was easy

Next Steps

If this got your appetite started for more NLP and Spark in either Python or Scala, you should next head to these Spark-NLP tutorials

References

Article resources

What did we learn in this tutorial?

  • What is NER?

Data Science, Big Data, Data Engineering, DevOps expert