Step-by-Step Tutorial: Conditional Random Fields for Named Entity Recognition in Python

Using Spark NLP capabilities to train and use CRF models for NER at scale..

Gursev Pirge
John Snow Labs
9 min readApr 25, 2023

--

Photo by Glen Carrie on Unsplash

TL; DR: Named Entity Recognition (NER) Conditional Random Field (CRF) is a machine learning algorithm in Spark NLP that is used to identify and extract named entities from unstructured text data. Spark NLP provides pre-trained NER models that use NER CRF or users can also train their own custom NER models using the CRF algorithm.

Conditional Random Fields (CRFs) are a class of probabilistic graphical model that is commonly used in machine learning and natural language processing (NLP) applications. In NLP, CRFs are used for the sequence labeling tasks, which involve assigning labels to each element in a sequence of observations, such as assigning part-of-speech tags to words in a sentence or recognizing named entities (such as people, organizations, and locations) in a text.

CRFs use the observed data to predict the labels of the sequence, while taking into account the dependencies between neighboring labels. This makes them particularly effective for tasks where the labels of neighboring elements are dependent on each other, such as in NLP.

NER CRF model is trained on a labeled dataset that includes examples of text with corresponding named entity labels. During training, the model learns to identify patterns and features in the input text that are associated with named entities, such as the presence of specific words or phrases, syntactic structures, or contextual information.

Once the model is trained, it can be used to predict named entities by assigning labels to each token based on the learned patterns and features. The predictions are made using a probabilistic framework that takes into account the dependencies between adjacent tokens in the sequence.

The figure below shows the visualization of the named entities recognized from a sample text. The entities are extracted, labelled (as PERSON, DATE, ORG etc) and displayed on the original text. Please check the post named “Visualizing Named Entities with Spark NLP”, which gives details about Ner Visualizer.

Extracted named entities, displayed by the Ner Visualizer.

Overall, NER CRF models are a powerful tool for automated named entity recognition. The NER CRF algorithm in Spark NLP is highly customizable and can be trained on a wide range of datasets and domains.

Just remember that there are many alternatives in Spark NLP for named entity recognition. The most accurate, but also complex models are Deep learning-based models. Deep learning-based models have shown state-of-the-art performance on NER tasks.

There are also rule-based methods, which use a set of hand-crafted rules based on patterns, heuristics, and dictionaries to identify named entities in text. Users can also create their own custom rules and dictionaries to improve the performance of the NER system.

Finally, it is possible to use multiple models, like a hybrid system, to increase the accuracy of the model.

In this post, you will learn how to use Spark NLP to named entity recognition by CRF using pretrained models and also training a custom model.

Let us start with a short Spark NLP introduction and then discuss the details of NER by CRF with some solid results.

Introduction to Spark NLP

Spark NLP is an open-source library maintained by John Snow Labs. It is built on top of Apache Spark and Spark ML and provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment.

Since its first release in July 2017, Spark NLP has grown in a full NLP tool, providing:

  • A single unified solution for all your NLP needs
  • Transfer learning and implementing the latest and greatest SOTA algorithms and models in NLP research
  • The most widely used NLP library in industry (5 years in a row)
  • The most scalable, accurate and fastest library in NLP history

Spark NLP comes with 17,800+ pretrained pipelines and models in more than 250+ languages. It supports most of the NLP tasks and provides modules that can be used seamlessly in a cluster.

Spark NLP processes the data using Pipelines, structure that contains all the steps to be run on the input data:

Spark NLP pipelines

Each step contains an annotator that performs a specific task such as tokenization, normalization, and dependency parsing. Each annotator has input(s) annotation(s) and outputs new annotation.

An annotator in Spark NLP is a component that performs a specific NLP task on a text document and adds annotations to it. An annotator takes an input text document and produces an output document with additional metadata, which can be used for further processing or analysis. For example, a named entity recognizer annotator might identify and tag entities such as people, organizations, and locations in a text document, while a sentiment analysis annotator might classify the sentiment of the text as positive, negative, or neutral.

Setup

To install Spark NLP in Python, simply use your favorite package manager (conda, pip, etc.). For example:

pip install spark-nlp
pip install pyspark

For other installation options for different environments and machines, please check the official documentation.

Then, simply import the library and start a Spark session:

import sparknlp

# Start Spark Session
spark = sparknlp.start()

NerCrfModel

NerCrfModel is an annotator in Spark NLP and it extracts named entities based on a pretrained CRF Model. In NLP, a pretrained model is a model that has been trained on a large amount of data for a specific task, in this case, for named entity recognition. Pretrained models are typically trained on massive datasets using powerful hardware and advanced algorithms.

The NerCrfModel annotator expects DOCUMENT, WORD_EMBEDDINGS, POS and TOKEN as input, and then will provide NAMED_ENTITY as output. Thus, we need the previous steps to generate those annotations that will be used as input to our annotator.

To understand the concept better, we will use the following model from John Snow Labs Models Hub: Conditional Random Field Based Named Entity Recognizer, where the model automatically extracts the following entities using glove_100dembeddings:

ORG, PER, LOC, MISC

We will use train and test datasets from the John Snow Labs Github, so first let us get their links:

!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/NER_NCBIconlltrain.txt
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/NER_NCBIconlltest.txt

Now, import the training dataset as CoNLL file:

from sparknlp.training import CoNLL

trainingData = CoNLL().readDataset(spark, 'NER_NCBIconlltrain.txt')
trainingData.show(3)
First three rows of the training dataset.

And now the test dataset:

testData = CoNLL().readDataset(spark, 'NER_NCBIconlltest.txt')
testData.show(3)
First three rows of the test dataset.

Spark NLP has the pipeline approach and the pipeline will include the necessary stages to extract the entities from the text:

# Import the required modules and classes
from sparknlp.base import DocumentAssembler, Pipeline
from sparknlp.annotator import (
Tokenizer,
PerceptronModel,
WordEmbeddingsModel,
NerCrfModel,
NerConverter
)

import pyspark.sql.functions as F

# Step 1: Transforms raw texts to `document` annotation
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')

# Step 2: Tokenization
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')

# Step 3: Perceptron model to tag words' part-of-speech
posTagger = PerceptronModel\
.pretrained()\
.setInputCols(["token", "document"])\
.setOutputCol("pos")

# Step 4: Glove100d Embeddings
embeddings = WordEmbeddingsModel.pretrained()\
.setInputCols(["token", "document"])\
.setOutputCol("embeddings")

# Step 5: Entity Extraction
ner_model = NerCrfModel.pretrained()\
.setInputCols(['document', 'token', 'pos', 'embeddings']) \
.setOutputCol('ner')

# Step 6: Converts a IOB representation of NER to a user-friendly one
ner_converter = NerConverter() \
.setInputCols(['document', 'token', 'ner']) \
.setOutputCol('entities')

# Define the pipeline
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
posTagger,
embeddings,
ner_model,
ner_converter
])

# Fit and transform the dataframe to the pipeline
model = pipeline.fit(trainingData)
result = model.transform(trainingData)

This model was trained by using the ‘glove_100d’, so we had to use the same embeddings while running the model.

Now, we will explode the results to get a nice dataframe of the entities. Here, chunks with no associated entity (tagged “O”) were filtered.

result.select(F.explode(F.arrays_zip(result.entities.result, 
result.entities.metadata)).alias("cols")) \
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("ner_label")).show(15, truncate=False)
Extracted chunks and the predicted labels.

As you can see, adding the NerConverter as the last stage helped us display only the chunks (or a combination of valuable extracted entities), not all the extracted tokens. This stage served as a filtering step and additionally, tokens labelled as B- and I- are connected to provide us the full chunk.

One-liner alternative

In October 2022, John Snow Labs released the open-source johnsnowlabs library that contains all the company products, open-source and licensed, under one common library. This simplified the workflow especially for users working with more than one of the libraries (e.g., Spark NLP + Healthcare NLP). This new library is a wrapper on all John Snow Lab’s libraries, and can be installed with pip:

pip install johnsnowlabs

Please check the official documentation for more examples and usage of this library. To run entity extraction by CRF with one line of code, we can simply:

# Import the NLP module which contains Spark NLP and NLU libraries
from johnsnowlabs import nlp

# Returns a pandas Data Frame, we select the desired columns
nlp.load('ner.crf').predict("Donald Trump and Angela Merkel dont share many oppinions")
After using the one-liner model, the result shows the NERs and their labels.

The one-liner is based on default models for each NLP task. Depending on your requirements, you may want to use the one-liner for simplicity or customizing the pipeline to choose specific models that fit your needs.

NOTE: when using only the johnsnowlabs library, make sure you initialize the spark session with the configuration you have available. Since some of the libraries are licensed, you may need to set the path to your license file. If you are only using the open-source library, you can start the session with spark = nlp.start(nlp=False). The default parameters for the start function includes using the licensed Healthcare NLP library with nlp=True, but we can set that to False and use all the resources of the open-source libraries such as Spark NLP, Spark NLP Display, and NLU.

NerCrfApproach

In order to show the capacity of the NerCrfApproach annotator in model training, let us train a model with the training dataset and then use this trained model to get predictions from the test dataset.

The pipeline below is quite similar to the one that we used for NerCrfModel annotator except a few stages:

# Import the required modules and classes
from sparknlp.base import DocumentAssembler, Pipeline
from sparknlp.annotator import (
SentenceDetector,
Tokenizer,
PerceptronModel,
Word2VecModel,
NerCrfApproach
)

import pyspark.sql.functions as F

# Step 1: Transforms raw texts to `document` annotation
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')

# Step 2: Getting the sentences
sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")

# Step 3: Tokenization
tokenizer = Tokenizer() \
.setInputCols(['sentence']) \
.setOutputCol('token')

# Step 4: Perceptron model to tag words' part-of-speech
posTagger = PerceptronModel\
.pretrained()\
.setInputCols(["token", "sentence"])\
.setOutputCol("pos")

# Step 5: Glove100d Embeddings
embeddings = Word2VecModel.pretrained()\
.setInputCols(["token"])\
.setOutputCol("embeddings")

# Step 6: Model training
nerTagger = NerCrfApproach() \
.setInputCols(["sentence", "token", "pos", "embeddings"]) \
.setLabelColumn("label") \
.setMinEpochs(1) \
.setMaxEpochs(3) \
.setOutputCol("ner")

# Define the pipeline
pipeline = Pipeline(stages=[
document_assembler,
sentence,
tokenizer,
posTagger,
embeddings,
nerTagger
])

In this pipeline, instead of the WordEmbeddings annotator, we used Word2Vec to generate the embeddings during model training and got very satisfactory results.

We will use the training dataset for model training and then use the test dataset to get predictions:

# Fit the training dataset to the pipeline
pipelineModel = pipeline.fit(trainingData)

# Get the predictions by transforming the test dataset
predictions = pipelineModel.transform(testData)

Now, we will explode the results to get a nice dataframe of the tokens, ground truths and the labels predicted by the model we just trained.

predictions.select(F.explode(F.arrays_zip(predictions.token.result,
predictions.label.result,
predictions.ner.result)).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
F.expr("cols['1']").alias("ground_truth"),
F.expr("cols['2']").alias("prediction")).show(25, truncate=False)
Tokens, ground truths and the labels predicted by the model.

During training we only used three epochs (an epoch represents one iteration of the model training process, where the model goes through all the training examples once), normally this number will be higher. Still, the results are satisfactory, with only one mistake in the table shown above.

For additional information, please consult the following references.

Conclusion

NER is a critical task in NLP that involves identifying and extracting entities from text data. CRF models are a popular approach for NER in NLP, as they can effectively model the dependencies between adjacent tokens in a sequence while making predictions.

Overall, the NER CRF algorithm in Spark NLP is a powerful tool for named entity recognition and extraction in NLP. It allows users to automate the identification and extraction of named entities from unstructured text data, saving time and effort for data scientists and developers.

By combining the NER CRF algorithm with other tools and components in Spark NLP, it is possible to create powerful and flexible NLP pipelines that can be adapted to a wide range of use cases and domains.

--

--