Training Data for NLP Algorithms: Your Options for Collecting or Creating Annotated Datasets

This post originally appeared on

Today, natural language processing practitioners have many options when it comes to acquiring annotated datasets to train their algorithms. Newbie data scientists and software engineers developing NLP models have several public datasets available to them as they begin their careers, as well as tools and solutions that make generating custom annotations far more efficient than in years past. (It isn’t like the old days when building an annotated corpus was uphill both ways! Or whatever.)

This new, options-galore world is thanks to the very cool data science community’s philosophy of sharing results and making data and tools open-source whenever possible, as well as the emergence of the aforementioned solution providers (some cooler than others, #justsayin). This blog outlines the training data choices out there, as well as which options work best for which use cases.

There are two main buckets into which sources for gathering and/or generating labeled/annotated datasets fall: pre-existing, publicly available datasets; and tools and solutions for creating your own corpus.

Pre-Existing, Publicly Available Datasets

Ten minutes of googling will yield you a bounty of links to public datasets — see here, here, here, and here. (Also here and here for sentiment analysis specifically, and here, here, and here for health/biology-related data.) These datasets are great for training on more general NLP tasks, and in some cases are enough to do the job. Though the data obviously isn’t custom-created for a particular problem, the accuracy is usually good, so if you’re building a more generic model, you may be able to rely solely on already-available annotated data.

Tools & Solutions for Creating Your Own Annotated Corpus

If you need specialized, custom training data (say, sample conversations with some unique specifics for training a chatbot), you’re gonna need to create that yourself (or have it created). For annotating/generating your own training data, there are, broadly, three options:

  1. DIY (including using in-house resources or hiring temp workers)
  2. what we at Spare5 call “traditional crowdsourcing”
  3. a training data solution platform (we call that “Training Data as a Service” or TDaaS)

If you choose to do it yourself — which, yeesh, what an undertaking — here are some tools that could help. They’re essentially ready-made UIs through which you can pass your text and have your annotators work within, that’ll help you keep track of the process. Handling annotations in-house could work for you if you have plenty of time and internal resources (or the budget to hire contractors), and you don’t need much training data and/or you don’t anticipate needing to scale the process.

Going the traditional crowdsourcing route takes a bit more off your plate, but you’ll still need to create your own annotation tasks for the crowd (and maybe even write custom scripts). You’ll also have to rely on a largely unknown pool of workers, and QA the results yourself. If you have the time to manage the process, don’t need humans with specific skills or specialities to annotate, and don’t require a high degree of accuracy achieved in a timely manner, this might be a good option for you.

Using a complete training data solution platform — your third option — is the most hands-off* with the highest accuracy and specificity of the bunch. And, counterintuitively, it usually works out to be the most cost-effective solution when you factor in precious time spent managing and QAing with the other two options (not to mention the cost of temp hires or contractors). If you need specialized, highly accurate training data generated quickly, and a platform that encompasses everything from customizable annotation tools to skilled domain-specific annotators to an extensive QA process, a complete training data solution platform is your best bet.

*it can be somewhat hands-on if desired, too, at least with Spare5. We offer a customer portal through which you can monitor and manage your annotation task sets, if you wish.

It’s a pretty great time to be an NLP practitioner. The resources available for building, training, testing, and validating NLP models will only continue to grow in number, improve in functionality, and expand in features. We can’t wait to see (and power!) the coming years’ developments.

image credit: Mark Rasmuson via CC0 1.0