The New Approach to Information Extraction
This article is part of the Academic Alibaba series and is taken from the paper entitled “Adversarial Learning for Chinese NER from Crowd Annotations” by Yaosheng Yang, Meishan Zhang, Wenliang Chen, Wei Zhang, Haofen Wang, and Min Zhang, accepted by the 2018 Conference of the Association for the Advancement of Artificial Intelligence. The full paper can be read here.
Tags that classify particular information in text are an absolute necessity for extracting information from documents. These tags usually require vast amounts of training data that have been labeled and annotated by industry experts, which is a costly and inefficient process. An alternative, especially in new domains, is to crowdsource new training data. However, although this approach is generally cheaper and faster, it can lead to lower quality annotations when contributors are not experts. This is one of the biggest limitations currently facing Named Entity Recognition (NER) systems.
Due to the absence of features such as morphological variation and capitalization, word segmentation is more difficult with Chinese text than with languages like English. This leads to difficulties with information extraction of machine-readable documents. Applying extensive training data cross domains often leads to poor results, especially in a social media context, even if character-level tagging is used to mitigate the problem of word segmentation.
Now, a group of researchers including Alibaba Group has developed a unique method called ALCrowd. ALCrowd uses adversarial training, which has yielded impressive results in the image generation and natural language processing (NLP) fields, to reduce the negative influences of multiple crowd annotators by identifying where consistencies lie between different annotators.
The adversarial approach is implemented using a common and a private Bi-LTSM. The common Bi-LTSM covers annotator-generic information that is general common knowledge while the private Bi-LTSM covers annotator-specific information. The team also adapted the LSTM-CRF model to undertake tagging for the target text. An outline of this model is illustrated below.
The proposed ALCrowd system was used for Chinese NER in the Dialog and E-commerce domains and achieved positive results, outperforming baseline systems. These results verify that crowd annotation is a feasible low-cost solution for training an NER system, even if it contains inconsistencies.
The full paper can be read here.
First hand, detailed, and in-depth information about Alibaba’s latest technology → Search “Alibaba Tech” on Facebook