Building NLP datasets: the leap of faith

Published in

Adarga

2 min readJun 17, 2021

While the potential for natural language processing (NLP) to bring about significant, positive disruption across industries has been widely recognised and its potential impact now much better understood, examples of successful commercial deployments remain surprisingly limited

We would like to explore one of the main reasons for this in this post: delivering successful NLP projects to real-world customers requires expertise and patience.

In 2019, NLP “pipelines” — systems that read and understand textual data, have never been easier to build. A significant number of tech companies have developed software by leveraging open source NLP libraries (like NLTK or Spacy) or commoditized NLP APIs (like Basis Tech’s Rosette Text analytics or Google’s Natural Language).

That said, it is crucial to understand that these services deploy generic machine learning (ML) models, which serve general use cases perfectly well but may not answer the needs of a specific customer, operating in a specific industry or sector.

So when these commercial adopters realise the limits of commoditised AI (which does a disservice to NLP specialists), they face the challenge of adapting tools they have not built themselves (when they have control over them) while realistically having no real idea about the probabilities of success for their specific data. There are a number of proven techniques to adapt and specify NLP models to new domains, particularly through the application of exciting new directions in Deep Learning where word embeddings and neural network layers are recycled and their weights updated according to a new target domain dataset.

On the question of whether they should invest in building their own NLP datasets, most companies will not take this leap of faith. However, data brings a significant advantage and is a key asset if created properly.

NLP has been at the core of Adarga’s technology stack since its inception: we have built expertise in a large range of approaches, methods and tasks, that can be adapted to new domains or used to solve new problems. Adarga has followed NLP annotation best practices to build proprietary datasets and models for its target domain — defence and intelligence applications. NLP and Subject Matter experts have collaborated and redefined tasks such as Named Entity Recognition and Relationship Extraction for the needs of this specific domain, to achieve the very best possible results.

Building NLP datasets: the leap of faith

Written by Adarga