2nd Generation Crowd Annotation to Shift AI-NLP Game

Nick Adams, Ph.D.
Jun 4 · 4 min read
New crowd work tools are ready to take on big, complex language analysis projects and train the AI of the future.

They say data is the new oil — and AI will provide the oil derricks necessary to realize its immense value. By all accounts, we are in the early days of a boom. Forrester Research values the AI market at $1.2 Trillion. Tractica values it at a more modest $10B, growing to $100B by 2025. Market Research Engine pegs it at $191B by 2024. And McKinsey Consulting expects AI to add $2.6 Trillion of value annually in the marketing and sales sectors alone.

But not all AI is the same, because not all data are the same. While AI applications sometimes quickly derive insights from numeric or image data, computers are nearly as ineffectual as ever when facing human language. The field of natural language processing (NLP) has progressed only slowly over the decades, and one fact remains the same: computers do not understand language the way we humans do. In fact, the best performing NLP algorithms have all been trained (via supervised machine learning approaches) with data that was painstakingly labeled by humans. It’s no surprise that a raft of recent news stories are reporting that a lot of whiz bang AI is actually bootstrapped by factories of humans creating labeled training data. While this human-centric approach is very helpful for numeric and image data, it is absolutely necessary when working with natural language data.

It may be better, then, to think of natural language data as the new shale oil. No typical oil derrick is going to extract its value. A lot of human effort will be required to build AI capable of understanding the $2.6T of marketing and sales insights lurking within the world’s marketing copy, business reports, and customer testimonials and complaints.

Fortunately, a new generation natural language data labeling system is being made available by Thusly Inc. — a company I founded to help unlock and spread the knowledge and expertise that fuels our information economy. First-generation natural language labeling tools were of two types: expert-only, or crowd-based. The expert-only tools — often called content analysis software or CAQDAS— were designed for the highly-skilled researcher who knows what to label in her documents and only has a few hundred documents to analyze. But they were almost never used for larger projects because they required the expert to first transfer her expertise (via too much onerous training and supervision) to other people so they could use the tool alongside her.

First-generation crowd annotation tools — still provided by Mechanical Turk, Figure Eight, and Alegion — were able to overcome some of these limitations on project size. And researchers using Mechanical Turk were even able to show that crowds of untrained, online workers completing tasks independently could generate data labels as accurate as experts’ in a range of scenarios. But the tools were designed by engineers hacking together “better than nothing” user interfaces and experiences. They weren’t designed to meet the level of data rigor required by academic researchers and data scientists. They weren’t optimized to handle all the data, task, and people management required by larger, more complex annotation projects. And they weren’t designed to value the importance of training data ‘traction’ (which I explain in a separate blog post). The first-gen tools don’t even support effective word-level annotation. Like the first-generation automobile, personal computer, or mechanized loom — the first-gen crowd tools represent valiant efforts leaving a lot of room for improvement.

Now, a second-generation crowd annotation tool is finally available. It’s called TagWorks, and it has been designed, built, and vigorously tested by social scientists, data scientists, and a veteran engineering team. TagWorks is backed by the global leader in social science methods, SAGE Publishing, because it is optimized for the most exacting data quality standards, most efficient data labeling processes, and most user-friendly experiences. Our customers love TagWorks because it saves them months of project management effort while producing high quality, science-grade data labels through a process they can actually validate, understand, and explain. And they love it because they no longer have to accept compromises. Expert-only tools had always kept their projects too small for big data insights. And the first-gen crowd annotation tools are too cumbersome and clunky to manage, preventing researchers from gathering the sort of granular annotation necessary to really dig for deeper insights.

Now, there’s a powerful all-in-one crowd annotation solution for researchers who want to go big and deep, without hiring full time project managers and data scientists. And since TagWorks is the only second-generation crowd annotation system currently available, those who use it will not only reap the rewards of valuable insights, they will also get a leg up on the competition. To get in on this shale oil boom, shoot us an email at office@thusly.co, or visit our website — https://tag.works — to learn more.

Nick Adams, Ph.D.

Written by

Here to help you take full advantage of your organization’s data and expertise.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade