Labeling Data for your NLP Model: Examining Options and Best Practices

Ivan Lee
Datasaur
Published in
3 min readAug 5, 2019

So you’re looking to deploy a new NLP model. Perhaps one already exists and your goal this quarter is to improve its precision or recall. You’ve tried multiple models, tweaked the parameters; it’s time to feed in a fresh batch of labeled data. Your company has real-world data readily available, but it needs to be labeled so your model can learn how to properly identify, classify and understand future inputs. This article will start with an introduction to real-world NLP use cases, examine options for labeling that data and offer insight into how Datasaur can help with your labeling needs.

A potential treasure trove of data

Labeling Options

Companies seeking to label their data are traditionally faced with two classes of options. The first is to turn to crowd-sourcing vendors. A wave of companies offer services that take in client data and send it back with labels, functioning like an Amazon Mechanical Turk for AI. The advantage provided is access to armies of labelers at scale. However, as the labelers are paid on a per-label basis, incentives can be misaligned and one bears the risk of quantity being prioritized over quality.

The other solution available is to build a labeling workforce in-house, utilizing freely available software or developing internal labeling tools. Companies may opt into internal workforces for the sake of quality, concerns about data privacy/security, or the requirement to use expert labelers such as licensed doctors or lawyers. Some of our clients going this route used to turn to open-source options, or defer to Microsoft Excel and Notepad++. Working with existing software can be the cheapest option upfront, but these tools are inefficient and lack key features. Teams will end up incurring greater costs through wasted time and avoidable human mistakes long-term. Others dedicate engineering resources to building ad-hoc web apps. While this can appeal to those with engineering roots, it is expensive to dedicate valuable engineering resources to reinventing the wheel and maintaining the tool.

State of the art…?

Datasaur to the Rescue

It was against this existing landscape that we started Datasaur. Our mission is to build the best data labeling tools so you don’t have to. Our existing text labeling tools are designed with the data labeler in mind. We understand your labelers deserve an interface attuned to their needs, providing all necessary supplementary information at a glance while keyboard shortcuts keep them working as efficiently as only a power user can.

We are also dedicated to building additional features learned from years of experience in managing labeling workforces. A team manager is able to assign multiple labelers to the same project to guarantee consensus before accepting a label. Underlying intelligence will leverage existing NLP advances to ensure your output is more efficient and higher quality than ever. Why should your labelers have to label “Nicole Kidman” as a person, or “Starbucks” as a coffee chain from scratch? Our models can pre-label some of your data, or be used to validate human labelers to combine the best of human judgment and machine intelligence.

Are you figuring out how to set up your labeling project? Do you have questions about best practices? Are you interested in learning more about Datasaur’s tools? Reach out to us at info@datasaur.ai.

--

--

Ivan Lee
Datasaur

I enjoy thinking about, designing and building impactful products. I approach life like a game.