Training and Fine Tuning NER transformer models using spaCy3 and spacy-annotator

Jim Zieleman
8 min readMar 28, 2022

This article explains how to label data for Named Entity Recognition (NER) using spacy-annotator and train a transformer based (NER) model using spaCy3. You will learn how to train a model from scratch and fine tune existing weights with a GPU.

What is Named Entity Recognition?

Named entity recognition (NER) ‒ also called entity identification or entity extraction ‒ is a natural language processing (NLP) technique that automatically identifies named entities in a text and classifies them into predefined categories. Entities can be names of people, organizations, locations, times, quantities, monetary values, percentages, and more.

A quick summary of spacy-annotator

spacy-annotator is a library used to create training data for spaCy Named Entity Recognition (NER) model using ipywidgets.

Founded by: Enrico Alemani

GitHub: https://github.com/ieriii/spacy-annotator

I came across this library earlier last year and used it to dip my feet into NLP. Most annotators are stuck behind a paywall and people initially entering the data science space may not want to commit $$$ to testing out various frameworks. spacy-annotator is perfect for testing the waters for NLP tasks.

Please note that this is not a substitute for an enterprise level labeling software. spaCy has an amazing proprietary NLP tool called Prodigy, and it is everything you will ever need for NLP.

spacy-annotator was an absolute lifesaver for me. As a result when spaCy3 rolled out and the format for training changed, I felt obligated to contribute functionality to spacy-annotator that converted the annotators outputs into a spaCy3 trainable format.

I will be demonstrating a quick example of data labeling using this tool. For a more details on spacy-annotator please refer to the original tutorial by Enrico Alemani: https://enrico-alemani.medium.com/how-to-create-training-data-for-spacy-ner-models-using-ipywidgets-c4aa71bf61a2

Install Dependencies

Open up command prompt or terminal and run the following commands.

pip install spacy-annotatorpip install spacypip install spacy-transformerspip install cupypip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113pip install pandaspython -m spacy download en_core_web_trf

Labeling Data

Create a folder.

Navigate to my tutorial repository here and save SPA_text.csv and SPA_example.ipynb to your folder.

Your directory structure should look like the following:

Open up SPA_example.ipynb in Jupyter notebook.

Here we import the modules needed and read in our unlabeled data.

For the purposes of this example we will load in an existing pretrained spaCy transformer model.

Then we instantiate our annotator. Here we use PERSON and GPE which are pretrained labels from the spaCy model (en_core_web_trf) we loaded in. Click here to see more examples and functionality.

The model preloaded into the annotator has already labeled some of our text already. Verify that these labels are correct and add in other labels that were missed in the example text.

Once you have finished labeling all your data you should see the following.

For the sake of the example we will just save our annotated data as training and validation data (Normally you would have more data and do train/test splits).

Navigate back to your folder and you see a similar file structure.

Congratulations you have now labeled data and saved to a spaCy3 trainable format.

Training your model from scratch

Go this link: https://spacy.io/usage/training

The spacy documentation on how to train and develop models is incredibly extensive and detailed. This tutorial will provide very basic introductory guide to give a starting point on model development using the spaCy3 framework.

Scroll down until you see Quickstart and select the following. Then click the download button in the bottom right corner and save the config file to your tutorial folder.

Your folder should now look like the following:

Open up command prompt or terminal and navigate to the folder you created for this tutorial.

Then run the following command:

python -m spacy init fill-config base_config.cfg config.cfg

Your files should now look like this.

Open up config in a text editor of your choice. I will be using notepad. In config, under [paths], set train and dev equal to the file paths of train.spacy and dev.spacy respectively.

For the purposes of this tutorial we are going to reduce the amount of training for times sake. Under [training] set max_epochs to 0 (this means our training will run until we have completed our steps), max_steps to 100 (this means we will do 100 training steps), and eval_frequency to 20 (this means we will update training metrics every 20 steps).

Return to your command prompt or terminal and run the following command.

python -m spacy debug data config.cfg

You should see something like this.

If you encounter an AssertionError you will need to go back to your config and replace the “\” of your file path with “/”.

Ex. “C:/Workspace/My Tutorial/train.spacy”

Now we can finally train our model by running the following command.

python -m spacy train config.cfg --output ./output --gpu-id 0

In this tutorial our dataset is very very small and our training and validation sets are exactly the same. Hence, our accuracy will get very high as the model is overfitting. For your own projects you will have to perform more in depth analysis and work.

Notice that our training scores are going up from 0.00 so we are learning.

Navigate to your folder and you should see the following file structure.

Click output and you will see your best model and last model saved.

To load your model back into python use the following lines of code.

import spacynlp = spacy.load("C:\Workspace\My Tutorial\output\model-last")

Fine tuning your model

This is exactly like training your model from scratch except for a few more changes to your config.

Go this link: https://spacy.io/usage/training

The spacy documentation on how to train and develop models is incredibly extensive and detailed. This tutorial will provide very basic introductory guide to give a starting point on model development using the spaCy3 framework.

Scroll down until you see Quickstart and select the following. Then click the download button in the bottom right corner and save the config file to your tutorial folder.

Your folder should now look like the following:

Open up command prompt or terminal and navigate to the folder you created for this tutorial.

Then run the following command:

python -m spacy init fill-config base_config.cfg config.cfg

Your files should now look like this.

Open up config in a text editor of your choice. I will be using notepad. In config, under [paths], set train and dev equal to the file paths of train.spacy and dev.spacy respectively.

Under [components.ner] change factory = “ner” to source = “en_core_web_trf”. Do the same under [components.transformer].

For the purposes of this tutorial we are going to reduce the amount of training for times sake. Under [training] set max_epochs to 0 (this means our training will run until we have completed our steps), max_steps to 100 (this means we will do 100 training steps), and eval_frequency to 20 (this means we will update training metrics every 20 steps).

Return to your command prompt or terminal and run the following command.

python -m spacy debug data config.cfg

You should see something like this.

If you encounter an AssertionError you will need to go back to your config and replace the “\” of your file path with “/”.

Ex. “C:/Workspace/My Tutorial/train.spacy”

Now we can finally train our model by running the following command.

python -m spacy train config.cfg --output ./output --gpu-id 0

In this tutorial our dataset is very very small and our training and validation sets are exactly the same. Hence, our accuracy will get very high as the model is overfitting. For your own projects you will have to perform more in depth analysis and work.

Notice unlike training from scratch the score starts at a much higher number. We are beginning training starting from existing weights in our model. Hence, we have “fine tuned” our model.

Navigate to your folder and you should see the following file structure.

Click output and you will see your best model and last model saved.

To load your model back into python use the following lines of code.

import spacynlp = spacy.load("C:\Workspace\My Tutorial\output\model-last")

Last Words

Congratulations. You have now learned how to annotate data, train, and fine tune NER models using spacy-annotator and spaCy3. Best of luck to your future data science endeavors. If you are looking to continue to use spaCy3 for production and require a more powerful annotator I absolutely recommend using prodigy.

--

--

Jim Zieleman

I do data science involving financial, NLP, plant, sound, and image data. NER/Relational models to OCR to Object Detection. github.com/LeafmanZ www.zieleman.dev