This tutorial consists of the following steps
- Download the CoNLL dataset into your Spark environment
- Read CoNLL dataset into Spark dataframe using Spark-NLP’s API
- Define ELMo pipeline
- Define the NER model ontop of ELMo (Char CNN's — BiLSTM — CRF)
- Run the pipeline and check out the results
This article covers how to train a custom NER model while leveraging the latest embeddings and techniques from the world of deep learning in Python using the lighting fast Spark-NLP library!
We will use a prior SOTA model, based on Named Entity Recognition with Bidirectional LSTM-CNN, by Chiu & Nicols, which is a novel neural network architecture that automatically detects word- and character-level features using a hybrid bidirectional LSTM and CNN architecture. To this NER model, we will feed the embeddings generated by ELMo, which also runs a bidirectional LSTM.
What is named entity recognition (NER)?
Named-entity recognition (NER) (a.k.a entity identification, entity chunking or entity extraction) is a sub-task of information extraction that aims to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. Word embeddings are extremely important for many NLP tasks like NER and many others.
How does ELMo differ from past approaches?
ELMo, created by AllenNLP broke the state of the art (SOTA) in many NLP tasks. Together with ULMFiT and OpenAi, ELMo brought upon us NLP’s breakthrough imagenet moment. These embedding techniques were a great step forward better results compared to older methods like word2vec or GloVe.
How does it differ from newer models like BERT, XLNET or ALBERT?
In contrast to BERT, XLNET, and ALBERT which are trained on masking random words in a sentence, ELMo is trained on predicting the next word in a sequence. ELMo is relying on bidirectional LSTM’s under the hood and is not transformer-based, like BERT, XLNET, ALBERT, and USE. In case you wanna try them out, they are all available in Spark NLP as annotators! We will cover them in the upcoming tutorials.
Let's get our hands dirty with some coding!
I have not yet set up Spark NLP, what should I do?
No problem there are a lot of tutorials to quickly setup Spark NLP in many different environments! The fastest way is to either import the Datarbicks notebook or just click the collab link, which opens a google collab notebook with this code ready to execute for you!
1. Download the CoNLL dataset into the Databricks File System
We will download the CoNLL dataset which is nowadays the standard in NER problems and used in most papers as a benchmark. We can download it straight to a Databricks cluster or collab notebook and use it with this code snippet
# Download CoNLL 2003 Dataset
from pathlib import Path
import urllib.requestdownload_path = "./eng.train"if not Path(download_path).is_file():
print("File Not found will downloading it!")
url = "https://github.com/patverga/torch-ner-nlp-from-scratch/raw/master/data/conll2003/eng.train"
print("File already present.")
2. Read CoNLL Dataset into Spark dataframe and
The CoNLL().readDataset() method of the CoNLL class handily parses the CoNLL format and creates a Spark dataframe. It takes care of parsing each sentence into tokens, detecting Part of Speech labels, indexing multi-sentence examples, and embedding the dataset in the internal Spark NLP document structure.
#load CoNLL dataset into Spark
from sparknlp.training import CoNLL
training_data = CoNLL().readDataset(spark, ‘/eng.train’)
3. Define the NER ELMO pipeline
Since Spark-NLP does all of the heavy Tensorflow weight lifting for us, all that remains is just defining each step of our pipeline with the handy components of the Spark-NLP library. This pipeline consists of just 2 pieces, the pre-trained ELMO model and the char CNNs — BiLSTM — CRF. Usually, the columns document, sentence, token, pos and label must be manually extracted from the dataset, but we already took care of that in the last step!
3.1 Define the imports
# Sparknlp Imports
from pyspark.ml import Pipeline# Spark-NLP imports
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
3.2 Define the ELMO pipeline component
3.2.1 Elmo parameters
The most relevant parameters are the following :
- setInputCols: This defines which columns the ELMo pipeline should check in the Spark Dataframes, for generating its embeddings. It requires a sentence and a token column.
- setOutputColum: This defines the name of the column which has the word embeddings in the final Spark data frame.
- BatchSize: How many sentences you want to embed at once. Makes the computation potentially faster but costs more memory
- PoolingLayer: ELMo comes with multiple potential output layers, which are the following: word_emb, lstm_outputs1, lstm_outputs2, or elmo.
- You can find a full parameter list here
3.2.2 Elmo's Pooling layers
Here is a quick overview and explanation for the potential pooling layers for ELMo :
- word_emb: the character-based word representations with shape [batch_size, max_length, 512]
- lstm_outputs1: the first LSTM hidden state with shape [batch_size, max_length, 1024]
- lstm_outputs2: the second LSTM hidden state with shape [batch_size, max_length, 1024]
- elmo: the weighted sum of the 3 layers, where the weights are trainable. This tensor has shape [batch_size, max_length, 1024]
Refer to the paper to more specific info about the pooling layers.
Now let's build our ELMo pipeline component!
#Build the ELMo pipe component
elmo = ElmoEmbeddings.pretrained()
4. Define the NER char CNN's — BiLSTM — CRF pipeline component
The NER model will leverage the features generated by ELMo to learn the NER classes.
4.1 NER DL parameters
We can tweak the model a lot but standard configurations usually work very well.
- setLr: Initial learning rate.
- setPo: Learning rate decay coefficient. Real Learning Rate: lr / (1 + po * epoch).
- setDropout: Dropout coefficient.
- Refer to this page for additional parameter info
The creation of the NerDLApproach() object of the Spark-NLP library returns to us a Py-Spark estimator. It needs to be fitted before it can transform the raw text into embeddings and finally NER predictions.
# Build the NER pipe component
nerTagger = NerDLApproach()\
.setInputCols(["sentence", "token", "elmo"])\
4. Put it all together in one Pipeline
We now just need to put every pipeline object into a list and pass it to the Py-Spark Pipeline constructor
# Combine pipeline components together and create the full pipeline
pipeline = Pipeline(
stages = [
4.1 Run the pipeline
Now we are finally ready to let the Spark cluster crunch some numbers and produce some NLP juice for us!
The fit call on a pipeline returns us a trained NER model which we can use to transform annotate raw data. We are limiting the dataset, to save some time, in a real-world scenario you would not want to do that.
fitted_pipe = pipeline.fit(training_data.limit(100))
ner_df = fitted_pipe.transform(training_data.limit(100))
4.2 Check out the results
Let's grab the first row and look at the original text and the result of the NER pipeline
The complete NLP pipeline code
from pyspark.sql.types import StringType
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.annotator import *
from sparknlp.base import *
spark =sparknlp.start()#Create Dataframe with raw Sample data
dfTest = sparknlp.SparkSession.createDataFrame([
"Spark-NLP would you be so nice and cook up some state of the art embeddings for me?",
"Tensorflow is cool but can be tricky to get Running. With Spark-NLP, you save yourself a lot of trouble.",
"I save so much time using Spark-NLP and it is so easy!"
], StringType()).toDF("text")#Basic Spark NLP Pipeline# Main Entry Point pipeline for most for Spark-NLP functionality.
document_assembler = DocumentAssembler() \
.setOutputCol("document") #Detects sentences in document columns
sentence_detector = SentenceDetector() \
.setOutputCol("sentence")#Tokenizes the input with a standard tokenizer
tokenizer = Tokenizer() \
.setOutputCol("token")#Load the pretrained elmo model from the Spark-NLP repo
elmo = ElmoEmbeddings.pretrained() \
.setInputCols(["token", "document"]) \
.setPoolingLayer("elmo")#Put every piece of the pipeline together
nlpPipeline = Pipeline(stages=[
])#Fit our model and transform
nlp_model = nlpPipeline.fit(dfTest)
processed = nlp_model.transform(dfTest)
If this got your appetite started for more NLP and Spark in either Python or Scala, you should next head to these Spark-NLP tutorials
- Introduction to Spark NLP: Foundations and Basic Components (Part-I)
- Introduction to: Spark NLP: Installation and Getting Started (Part-II)
- Spark NLP 101: Document Assembler
- Spark NLP 101: LightPipeline
- Spark NLP workshop repository
- Named Entity Recognition (NER) with BERT in Spark NLP
- Spark NLP Youtube Channel
What did we learn in this tutorial?
- What is NER?
- How does ELMO differ from past approaches?
- How does ELMO differ from newer approaches?
- How to build a scalable NER model with ELMO?
- How to achieve state of the art NER results quick and easy?
- How to build a NER DL pipeline with Spark NLP and ELMO embeddings?
- How to parse or read CoNLL dataset into Spark with python?