QUANTRIUM GUIDES

Automated spaCy Re-training

Save time and boost accuracy with automated spaCy re-training

Priyanshu Sankhala
Quantrium.ai

--

Introduction

Spacy is a popular NLP library that can be used for a variety of tasks, such as Named Entity Recognition (NER), Text Classification, and Sentiment Analysis. spaCy has exceptional NER capabilities which is a crucial component in Natural Language Processing (NLP) applications. It enables the identification and categorization of entities within text, such as names of people, organizations, locations, dates, and more.

When you train the NER models on a specific dataset, they excel. However, as time goes on, the data landscape evolves. New names, entities, and patterns emerge, making it essential to adapt your NER models to stay effective.

We delved deeper into spaCy’s capabilities and discovered an exciting breakthrough: automating the re-training process. Let’s see — how?

Why Automated?

Let’s consider that you are working on a project where you’re using NER to extract information from user reviews for a restaurant recommendation system. You will see that the names of new restaurants, local businesses, and even emerging food trends constantly change. To keep up your model up-to-date manually can be time-consuming and error-prone.

Manually training and re-training NER models can be arduous, involving data preprocessing, entity annotation, and feature engineering. Additionally, it’s not scalable and requires efforts everytime the model is retrained. Automation streamlines this process and often yields better model performance.

The spaCy provide features that can be used to retrain the model and the process can be automated. The benefits include:

  • Increased efficiency: It can save a significant amount of time and efforts, especially, for large datasets or datasets that are constantly changing.
  • Improved accuracy: It can improve the accuracy of spaCy models because the models are constantly being updated with new data.
  • Reduced manual intervention: Automation can reduce the need for manual intervention and frees up time for the ML Engineers to focus on other tasks.

The Automated Process — A Closer Look

Automated updation of your NER model periodically with new data, enables it to adapt to the evolving language and entities in your domain. Here’s how it works:

Data Collection

Setup a pipeline which continuously gather new text data that represents the domain or context in which your NER model operates. This data should include examples of entities you want your model to recognize.

Data Preprocessing

Add a data preprocessing stage to your pipeline where the new data is prepared by cleaning and formatting it, ensuring it’s in a suitable format for NER training. This step will require manual review or annotation of the training data. Once prepared, combine your existing training data with the new data, creating a comprehensive dataset for retraining in .spacy format.

Initiating Automated Retraining

Once the training data is generated, for automatic retraining we can initiate the update functions feature in spacy as soon as the files are updated or added. It leverages pre-defined hyperparameters and training configurations to optimize the model to new patterns and changes in the input data, leading to a upgraded model with improved performance.

Performance Monitoring and Model Selection

After retraining, add an stage to your pipeline which does the constant monitoring of the model’s performance using metrics like precision, recall, and F1-score. And if the model’s performance is unsatisfactory, the then it should trigger retraining with a different set of hyperparameters or you may need to switch to a different architecture to enhance the model’s capabilities.

Use Cases for Automated Spacy Re-training

At Quantrium, we have used custom, trained from scratch spaCy model on our specific classes and data to automate various processes with document intelligence in Finance and Insurance Industry. For example, extraction and identification of various keys and values in:

  • Paystubs or Salary Slips
  • Financial Statements (such as Balance Sheet, Profit & Loss and Cash Flow statements)
  • Tax Documents (such as Income Tax Returns, Form 26AS, W-2 and 1099 Forms)
  • Bank Statements and Passbooks
  • Insurance Claim Documents (Insurer, Insured Person, Amount, Charges, Penalty, X12 Codes, etc.)

Automated retraining with spaCy is a powerful strategy for maintaining the accuracy and relevance of your NER models. In the ever-evolving world of text data, staying up-to-date is not a luxury but a necessity. Embrace automation and watch your NER models shine in accuracy and adaptability, providing better insights and performance in your NLP applications.

--

--