Training and deployment using Amazon Sagemaker of a Custom spaCy NER model by integrating pre-trained Transformer-based models on HuggingFace

Francesco Ladogana
Data Reply IT | DataTech
15 min readDec 5, 2022

Named Entity Recognition (NER) is the task of detecting named entities in text. Examples of entities are as follows:

- Person

- Location

- Organization

- Date

- Etc.

Entities can be single words or whole sentences.

This paper will explain how to perform Training a model to solve a custom NER task on an Amazon SageMaker notebook instance, exploiting the spaCy library and integrating with it a pre-trained Deep Learning model based on Transformers present in Hugging Face. Therefore, after running the Training phase and getting the best weights, it will be shown how to build a Sagemaker Endpoint to put a spaCy NER Model into production. It will be sufficient to set the computational resource requirements for the Amazon SageMaker hosting of the model and implement the inference script that allows obtaining output entities from a spaCy NER model given an input text. An HTTPS endpoint will then be created to achieve low-latency inference and high throughput. Finally, it will be shown how to invoke the endpoint in an AWS Lambda to get the entities from an input text.

Data Labeling for NER, Data Format used in spaCy 3 and Data Labeling Tools

Typically a NER task is reformulated as a Supervised Learning Task. So suppose we have N texts in our Dataset and C entities that we would like to recognize. Each text in the Dataset we can see as a sequence of L tokens, so the easiest way to label a text would be to identify which tokens represent an entity of interest and label these with the entity they represent. Below is an example of labeling where C = {Organization, Date, Smartphone}:

In practice, there are several formats for representing a Dataset for named entity recognition. The most widely used are the IOB Scheme and the BILUO Scheme, where tokens that are not entities are tagged with the O tag, and entities are tagged with the semantic category preceded by one of the prefixes defined according to the following tables:

Example of labeling with IOB Scheme:

Example of labeling with BILUO Scheme:

Through the IOB Scheme and the BILUO Scheme, more detailed information is specified in the text, and this improves the learning capabilities of the Machine Learning algorithm being used. The BILUO Scheme is more informative than the IOB Scheme, so at the conceptual level it should enable better performance. In this paper, it has been shown that using the BILUO Scheme enables higher performance than the IOB Scheme.

The previous format in spaCy was the JSON format. In spaCy v.3 the data format that is used is the binary format, which is extremely efficient in storage. This format is created by going to serialize a DocBin which is a collection of annotated Docobjects, which are saved in binary files with the extension .spacy.

There are convert commands that help convert formats to the binary format required by spaCy v.3. The easiest way to get the binary format is to label the data following the IOB Scheme and export the labels to a TSV file, so in output we should have something like this:

Apple B-Organization
today B-Date
announced O
the O
iPhone B-Smartphone
14 I-Smartphone
Pro I-Smartphone
Max I-Smartphone

and then use the following convert command to convert the TSV file to the spaCy JSON format:

!python -m spacy convert train_set.tsv ./ -t json -n 10 -c iob

NOTE: -n indicates the number of sentences per document. So if 10 is set as the value, and a given input doc has 20 sentences, it will be split into two docs! Pay attention to this parameter!

Then convert the spaCy JSON format to the binary format with the following convert command:

!python -m spacy convert train_set.json ./ -t spacy

This last command will automatically convert our labels following the BILUO Scheme, and we are then ready for the training phase in spaCy!

A fantastic tool to use to label data and output a TSV file with automatic conversion to IOB Schema is UBIAI. Another important tool that can be used is Label Studio, here it is explained how to convert data labeled with Label Studio into the binary format required by spaCy.

The Dataset that will be used for this tutorial is NCBI-disease, a Dataset of NER in the medical field. It has only one class, namely Disease, and train/val/test set are available for it in TSV format with IOB Scheme.

spaCy Training config file and spaCy Deep Neural Network architecture for NER

The easiest way to train spaCy pipelines is to create config.cfg, which are Training config files that include all settings and hyperparameters. A config file is divided into sections and subsections indicated by square brackets and dot notation. In the subsections then you can enter values, and you can also use the @syntax to call functions and their arguments in the configuration files. The functions must be those in spaCy’s function registry. You can also register your own functions, so that completely customized implementations can be used. Another important feature is that you can reference values in sections in other sections, using the following syntax: ${section.value}. The configuration file to run the NER task in this tutorial will be as follows:

[paths]
train = "./data/train.spacy"
dev = "./data/dev.spacy"
vectors = null
init_tok2vec = null
[system]
gpu_allocator = "pytorch"
seed = 0
[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 1024
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
[components][components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 500
[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 1024
maxout_pieces = 2
use_upper = false
nO = null
[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"
[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}
[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
mixed_precision = false
[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 500
stride = 100
[components.transformer.model.grad_scaler_config][components.transformer.model.tokenizer_config]
use_fast = true
[components.transformer.model.transformer_config][corpora][corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[training]
accumulate_gradient = 1
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null
[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005
[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
[pretraining][initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null
[initialize.components][initialize.tokenizer]

So for example [nlp]and [components]are two sections, instead [components.transformer.model]is a subsection. The section [components]defines the available components, pipeline usually has one or more components. While [nlp] define components used into pipeline and them order. So the pipeline consists of the transformer and ner components.

Important details for the transformer component :

1) Allows you to use Deep Learning model based on Transformers present in Hugging Face library to be used in the spaCy pipeline.

2) The model based on Transformers used is viewed as a feature extraction network, which takes a sequence of tokens as input and returns Contextual embeddings as output. Contextual embedding of a word is a real-valued vector representation that encodes the meaning of a word in a specific sentence. So these models understand each word based on its context (sentence). In fact, a word can have different meanings in different sentences (context), and encoding (understanding) each word based on context words is critical to good performance in NLP tasks.

3) In the subsection[components.transformer.model] the name can be the name of any model that can be loaded from the transformers.AutoModel class. The best strategy would be to use the concept of Transfer Learning.

  • So you would need to search and select a model in HuggingFace that has been pre-trained on a large Dataset that has an equal or very similar domain to the domain of your data. By exploiting these weights and specializing them on your data, you can achieve incredible performance even if you have few training examples.
  • In fact in this tutorial since the training data are medical domain, then it was decided to use PubMedBERT which has been pre-trained on many biomedical domain texts. The PubMedBERT article can be found here.

4) Configurations of the transformers.AutoTokenizer class must be entered in the subsection [components.transformer.model.tokenizer_config] . In our case by setting use_fast=true, we use the fast version of the Tokenizer.

5) The subsection[components.transformer.model.get_spans] is used to manage long documents, by cutting them into smaller sequences before running the transformer. Using strided_spans.v1 you have to set a value for the stride and window parameters:

  • window: Defines the maximum number of tokens that each sub-sequence resulting from splitting the original input text will have. So if for example window=300 and we have input text of 700 words, the text will be split into sub-sequences of at most 300 tokens.
  • stride: Defines the step of splitting the text, so it also defines the number of sub-sequences into which the text will be split. For example, if the input text was 700 words, then 700 // 100 = 7, so after every 100 tokens a sub-sequence of contiguous tokens of length 300 will be created.
  • Obviously if you set window and stride to the same value, each token will be present in one and only one sub-sequence. Instead, entering a stride less than window will allow for overlap, in the sense that some tokens will be present in more sub-sequences. This may be desirable because it allows all tokens to have both left and right context.

Important details for the ner component:

spaCy allows you to create a component of NER that is transition-based, assigning entities to non-overlapping intervals of tokens. As described here, transition-based means: enter a while loop, at each step predicting a state transformation action from a given state, until a termination state is reached, at which point the predicted structure is read off the state. The “reduce” part of each step consists of extracting a particular set of tokens to represent the state, concatenating their transformer-encoded vectors, and passing the result through a feed-forward network to compute a single vector representing the state. The state vector is then passed through a feed-forward network to make the action prediction. It’s not really necessary to have a fully detailed mental model of this algorithm to use the NER, or even to customize its configuration in many ways. But that’s a quick sketch of what’s going on.

The last thing I want to mark is that with gpu_allocator=pytorchwe are going to use the GPU during the Training phase. The rest of the configurations are easy to understand and reviewed from the documentation, though not crucial.

The general spaCy Deep Neural Network architecture for NER with Transformer can be represented as follows:

Create S3 Bucket

Before training the model, it is necessary to create the bucket in which the model weights and inference code will be placed to create the endpoint whenever necessary. After creating the bucket, for example, we can create the spacy_ner_model/ “folder” and inside it create the model_data/ folder that will contain the model weights and the inference_code/ folder that will contain the inference code:

Setup Amazon SageMaker notebook instance

To create a SageMaker notebook instance:

  1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.
  2. Choose Notebook instances, then choose Create notebook instance.
  3. Insert Notebook instance name and choose Notebook istance type. For this tutorial, we choose ml.g4dn.xlarge istance type in such a way that we have access to a GPU:

5. In the Additional configuration section, also lets you specify the size, in GB, of the ML storage volume that is attached to the notebook instance. You can choose a size between 5 GB and 16,384 GB, in 1 GB increments. You can use the volume to clean up the training dataset or to temporarily store validation or other data.

4. In the Permission and Encryption section, choose either an existing IAM role in your account that has the necessary permissions to access SageMaker resources or choose Create a new role. If you choose Create a new role, SageMaker creates an IAM role named AmazonSageMaker-ExecutionRole-YYYYMMDDTHHmmSS. The AWS managed policy AmazonSageMakerFullAccess is attached to the role. The role provides permissions that allow the notebook instance to call SageMaker and Amazon S3.

5. Choose Create notebook instance. When the status of the notebook instance is InService, in the console, the notebook instance is ready to us. Choose Open JupyterLab to open the JupyterLab dashboard. Then choose conda_python3 Kernel:

Finally now we can start using our notebook:

Training spaCy NER Model

After loading the config.cfg on JupyterLab we can finally start the Training phase. When the model training is finished, the output of spaCy containing the model configurations and weights will be compressed into a model.tar.gz file and loaded into the spacy_ner_model/model_data/ path within the bucket that was created earlier. Below is the Jupyter Notebook code for running the spacy NER Model training using the NCBI-disease Dataset:

We achieved 89.08% F1-score, very high performance!

NOTE: To run the spaCy NER Model Training on a custom Dataset, simply upload your TSV files and it will be super easy to run the training phase for your application domain!

Deploy a Model in Amazon SageMaker

Create Inference code

You also need to create inference.py, which allows you to define how to load the model, how to process the input data, how the model performs the prediction, and how to process the prediction in output. In fact, we need to define the following 4 functions:

  • model_fn: Loads the model, the return value will be used in predict_fn function.
  • input_fn: Takes request data and deserializes the data into an object for prediction, so if you need to you can implement logic that performs pre-processing of the input. The return value will be used in predict_fn function.
  • predict_fn: Takes the deserialized request object and performs inference against the loaded model. The return value will be used in output_fn function.
  • output_fn: Takes the result of prediction and serializes this according to the response content type. If you need to you can implement logic that performs post-processing of prediction.

We need to implement inference.py in such a way that the input is text and the model used is a spaCy NER Model. Below is a possible implementation:

This inference.py will allow the Endpoint to be able to input a JSON object, with a key called “text” and the value will be the input text from which to extract entities. So in output will be returned a JSON object, where for each entity found by the model in the input text, there will be an element containing the entity text (“text_ent”), the start and end in the input text, and the associated label.

Amazon SageMaker provides prebuilt Docker images that include deep learning framework libraries and other dependencies needed for training and inference, such as MXNet, TensorFlow, PyTorch. Unfortunately, does not exist prebuilt Docker image for spaCy. As we will see in the next section, we are going to exploit of the prebuilt Docker image for Pytorch Inference, and to use spaCy, we are simply going to create the following requirements.txt that will to install spaCy:

Finally you will need to compress the inference.py and requirements.txt into an inference.tar.gz file and load it into spacy_ner_model/inference_code/ path within the bucket that was created earlier.

Create Endpoint and how Delete Endpoint programmatically

To create a Sagemaker Endpoint and deploy the custom spaCy NER model just run the following lines of code:

import sagemaker

sagemaker_role = 'your-sagemaker-role'

path_model_data_ner_model = "s3://your-bucket-name/spacy_ner_model/model_data/model.tar.gz"

source_dir_ner_model = "s3://your-bucket-name/spacy_ner_model/inference_code/inference.tar.gz"

entry_point_ner_model = 'inference.py'

endpoint_name_ner_model = 'endpoint-name-ner-model'

instance_type_ner_model = 'ml.m4.xlarge'

initial_instance_count_ner_model = 1

spacy_ner_model = PyTorchModel(model_data=path_model_data_ner_model,

role=sagemaker_role,

entry_point=entry_point_ner_model,

py_version='py3',

source_dir=source_dir_ner_model,

framework_version='1.7.1')

predictor_ner_model = spacy_ner_model.deploy(initial_instance_count=initial_instance_count_ner_model, instance_type=instance_type_ner_model, endpoint_name=endpoint_name_ner_model)

For example, this code can be inserted into an AWS Glue Job and run it when the endpoint needs to be created.

To delete the endpoint automatically, simply run the following lines of code:

import sagemaker
import boto3

region_name = 'your-region-name'

sagemaker_client = boto3.client('sagemaker', region_name=region_name)

endpoint_name_ner_model = 'endpoint-name-ner-model'

sagemaker_client.delete_endpoint(EndpointName=endpoint_name_ner_model)

sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_name_ner_model)

Again, a dedicated AWS Glue Job can be created and run it when you need to delete the endpoint.

Invoke an Amazon SageMaker endpoint using AWS Lambda

The following script can be used to invoke the spaCy NER model enpoint in an AWS Lambda:

import boto3
import json

NER_ENDPOINT = 'endpoint-name-ner-model'

# connection to SageMaker
sagemaker_client = boto3.client('sagemaker-runtime')

input_text = 'your-input-text'

response_ner_endpoint = sagemaker_client.invoke_endpoint(

EndpointName=NER_ENDPOINT,

Body=json.dumps(

{

"text": input_text

}),

ContentType='application/json',

Accept='application/json'

)

predictions_ner_endpoint = json.loads(response_ner_endpoint['Body'].read().decode())

for key, prediction_ner_endpoint in predictions_ner_endpoint.items():

label_predicted = prediction_ner_endpoint["label"]

text_entity = prediction_ner_endpoint["text_ent"]

Then you can then perform a wide variety of operations on the output entities!

Remember to create an IAM Role for your Lambda, which authorizes your function to call a sagemaker endpoint. As described here, simply create an IAM role that includes the following policy:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "sagemaker:InvokeEndpoint",
"Resource": "*"
}
]
}

Conclusions

Building a good model to perform NER on a new dataset can be a difficult and time-consuming task. In this paper, we have shown that using spaCy to build a custom NER model allows the Data Scientist not to write code (or almost) and to focus more on the data labeling and hyperparameter tuning tasks, which are critical to good model performance.

In one or more components of a Cloud Architecture, it is necessary to invoke the model to obtain predictions and use them to achieve business goals. Therefore, it was shown how to create a Sagemaker Endpoint and invoke it in an AWS Lambda.

The code related to building the Sagemaker endpoint has been structured in such a way that the endpoint can be created/deleted whenever necessary, always reusing the same weights. Such a solution makes it possible to:

  • Create and delete the endpoint automatically.
  • Pay of the endpoint only when necessary.
  • It is not necessary to re-run the Training phase every time to deploy the model.

Special thanks to Gianluca Favuzzi and Luca Azzini, co-authors of this article.

Reference

--

--