Annotations from a Crowd

Alibaba Tech
Feb 9, 2018 · 3 min read

The New Approach to Information Extraction

This article is part of the Academic Alibaba series and is taken from the paper entitled “Adversarial Learning for Chinese NER from Crowd Annotations” by Yaosheng Yang, Meishan Zhang, Wenliang Chen, Wei Zhang, Haofen Wang, and Min Zhang, accepted by the 2018 Conference of the Association for the Advancement of Artificial Intelligence. The full paper can be read here.

Image for post
Image for post

Tags that classify particular information in text are an absolute necessity for extracting information from documents. These tags usually require vast amounts of training data that have been labeled and annotated by industry experts, which is a costly and inefficient process. An alternative, especially in new domains, is to crowdsource new training data. However, although this approach is generally cheaper and faster, it can lead to lower quality annotations when contributors are not experts. This is one of the biggest limitations currently facing Named Entity Recognition (NER) systems.

Due to the absence of features such as morphological variation and capitalization, word segmentation is more difficult with Chinese text than with languages like English. This leads to difficulties with information extraction of machine-readable documents. Applying extensive training data cross domains often leads to poor results, especially in a social media context, even if character-level tagging is used to mitigate the problem of word segmentation.

Now, a group of researchers including Alibaba Group has developed a unique method called ALCrowd. ALCrowd uses adversarial training, which has yielded impressive results in the image generation and natural language processing (NLP) fields, to reduce the negative influences of multiple crowd annotators by identifying where consistencies lie between different annotators.

The adversarial approach is implemented using a common and a private Bi-LTSM. The common Bi-LTSM covers annotator-generic information that is general common knowledge while the private Bi-LTSM covers annotator-specific information. The team also adapted the LSTM-CRF model to undertake tagging for the target text. An outline of this model is illustrated below.

Image for post
Image for post
Framework of the proposed model, comprising the Worker-Adversarial and Baseline parts

The proposed ALCrowd system was used for Chinese NER in the Dialog and E-commerce domains and achieved positive results, outperforming baseline systems. These results verify that crowd annotation is a feasible low-cost solution for training an NER system, even if it contains inconsistencies.

The full paper can be read here.

Alibaba Tech

First hand, detailed, and in-depth information about Alibaba’s latest technology → Search “Alibaba Tech” on Facebook

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store