Coner OST Alpha Phase III Blog #3: The Bigger Picture

Daniel
7 min readAug 2, 2018

--

After the technical halfway update blog of Coner last week, and it’s introduction blog the week before, I would like to take this week’s Coner blog post as an opportunity to illustrate the bigger picture of why human feedback on extracted document entities is of importance. Also, I will discuss the challenges of proper user incentivisation (how to motivate users to deliver quality work and lots of it) and how an OST powered branded token economy can be leveraged to boost user engagement! Finally, I summarised my progress of the Coner POC this week at the bottom of this post, incase you are a bit short of time to read the whole article (even though I recommend it ;) )

Intelligent document search in digital libraries has been a long standing challenge, as it requires automatic high precision deep-metadata generation for each uploaded document in such a digital library. In contrast to “metadata”, that describes properties of a paper (title, authors, citations, etc.), “deep-metadata” is information about of the meaning of the actual content (full text) of a paper. Examples of such deep-metadata are automatically generated lists of entities for different types that occur in a paper, for example entities that describe a dataset or method used. If we could accurately extract these typed entities, it would allow for much more intelligent search and exploration of digital libraries, because we could could search for papers based on occurrences of all kinds of entities. An example user query could be “Which methods are commonly applied to Wikipedia’s full text dataset?”.

Coner Workflow

Earlier this year a scientific paper was published that describes an approach to extract entities of different types in a flexible, lightweight and low-cost fashion [1]. The figure above shows an overview of their introduced Named Entity Recognition algorithm (from now on called NER) and how the Coner system is incorporated to boost it’s performance. I know, at first glance it seems like an overflow of information for just one figure, but I will try my best to gently guide you through the steps of this approach!

With NER, in order to train a classification model for a new type of entity, all you have provide is a list of seed terms (between 5 and 50) that are so called “gold standard” (every term in that list is determined to be of that type by a domain expert) for the desired type [1]. Then the iterative algorithm applies the following 4 steps a dynamic number of times until performance converges and it’s optimal performance is reached:

  1. Generate Training Data: NER uses the seed terms list to extract training sentences from a corpus over 11,000 documents and annotates all occurrences of typed entities in each of these sentences.
  2. Train Model: The annotated training data is formatted and labelled to ensure it is usable to train any state-of-the-art machine learning model (they used CRF). The resulting model is capable of extracting typed entities from raw text.
  3. Extract Entities: Entities are extracted from all documents in corpus with the trained model in step 2. The resulting set of entities contains a lot of noise, so further filtering is required to get meaningful accuracy.
  4. Filter Entities: Ensemble of heuristic filters are applied to determine which entities are relevant. For example, all entities that are stopwords and/or common words in the English language are removed. The final set of entities can now be used as input seed term list for step 1 of another iteration of this algorithm.

The steps described above are highly simplified in order to improve readability for readers without background knowledge of the domain, so please feel free to read the full paper if you are interested more details about NER!

The biggest shortcoming of NER is that the filtering setup is simplistic and based on assumptions made about semantic relatedness and the context of extracted entities. In general, machines are much weaker than humans in recognising the meaning of an entity that occurs in a specific place of text written in natural language.

This is the key motivation for Coner! Use human judgement of entity relevance to support and overrule decisions made automatically by machines!

Coner aims to boost the precision of the NER filtering step 4 with three novel Coner pipeline modules (as seen on the right side in the figure earlier in this post):

  1. Document Analyser: Selects representative papers from document corpus based on paper selection criteria like availability of PDF, number of times publication has been cited, distinct number of filtered extracted entities and conference it was published at.
  2. Coner Interactive Document Viewer: Online interactive viewer that visualises automatically annotated entities allows users to interact with them by giving feedback on existing annotations or adding new entities.
  3. Human Feedback Analyser: Calculates entity type labels for each entity that received human feedback. An entity is labelled as an entity type when the majority of evaluators rated it as ’relevant’ for that type.

The resulting entity feedback is incorporated in the filter step of the next training iteration of NER to boost model performance.

So why OST?

For the past few months, ever since I joined the OST Alpha Phase II challenge in May, I have presented this project as the Coner Interactive Document Viewer, which is in fact only one of the three modules that make up the complete Coner pipeline. This was a conscious decision, because the online viewer is the part that users interact with and allows for the crowdsourcing of user feedback. The unique challenges of crowdsourcing user feedback have been widely investigated. First, task formulation should be made with fraudulent workers in mind [2, 3]. Also, proper incentivisation mechanisms for truthful evaluation and annotation are essential to ensure feedback quality [4, 5]. There however seems to be less research on token gamification mechanisms with blockchain technology. This is the reason I was inspired to use OST branded tokens as gamification method! It’s a new twist to more traditional gamification, where users are rewarded with fiat, coupons, credits or even college brownie points. That approach does not allow for a more advanced reward economy, where a true community of users is formed through OST powered transactions, like content-creator-to-evaluator rewards, the ability to gift Coner tokens to a friend or fellow researcher you believe in or an airdrop grant from the Coner company to upload your own document!

Now that you hopefully have a better idea of the big picture of the Coner feedback system, I can finally explain what the name “Coner” means, because it’s actually an acronym for Collaborative Named Entity Recognition = Coner!

Progress of Coner POC

  • Integrated first version of smart entity selection mechanism to make human feedback scalable and maximise the potential information gain for each instance of feedback. Instead of relying on the system’s users to decide themselves on which entity to provide feedback on (users usually opted for providing feedback on nearly all entities), the process is actively steered by merely selecting entities that were kept by the filtering step and doubly classified, so recognised as belonging to multiple types by the trained models. Entities almost never belong to multiple types (e.g. a single entity is not a dataset AND method), so this is where humans are much better at separating the specific types.
  • Built paper Document CNR Token Pool, where tokens are taken from Content Creators to act as paper feedback budget. Also updated reward actions so Document Evaluators are rewarded from the budget pools created for each document instead of receiving rewards from the company directly. Also extended the ApiClient with API call methods for all new transactions (e.g. gift tokens, fetching of user ledger, different reward transactions).
  • Live Material Snackbar notifications when you receive a CNR reward.
  • Setup React.js component for the CNR User Wallet, to make is reusable across multiple places in the web application.

I had to overcome some challenges with the internal state-management of the client-side React.js components and the asynchronous API calls to OST API V1.1, to ensure the CNR balance is synced at all times between the live CNR ticker in the navigation bar (see blog post #2 for more details) and other places in the application.

Time flies! Next week is the final week of the #OSTa3 challenge. I will work on finishing all the features mentioned in the previous blogs and give a comprehensive overview of what I’ve built over the past month!

References

[1] Sepideh Mesbah, Alessandro Bozzon, Christoph Lofi, and Geert-Jan Houben. Long-tail entity extraction with low-cost supervision. https://2018.eswc-conferences.org/paper_8/ , 2018.

[2] Carsten Eickhoff and Arjen de Vries. How crowdsourcable is your task. In Proceedings of the workshop on crowdsourcing for search and data mining (CSDM) at the fourth ACM international conference on web search and data mining (WSDM), pages 11–14, 2011.

[3] Tim Finin, Will Murnane, Anand Karandikar, Nicholas Keller, Justin Martineau, and Mark Dredze. Annotating named entities in twitter data with crowdsourcing. In Proceedings of the NAACL HLT 2010 Workshop 47 on Creating Speech and Language Data with Amazon’s Mechanical Turk, CSLDAMT ’10, pages 80–88, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.

[4] Ece Kamar and Eric Horvitz. Incentives for truthful reporting in crowdsourcing. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems — Volume 3, AAMAS ’12, pages 1329–1330, Richland, SC, 2012. International Foundation for Autonomous Agents and Multiagent Systems.

[5] Luca de Alfaro, Marco Faella, Vassilis Polychronopoulos, and Michael Shavlovsky. Incentives for truthful evaluations. arXiv preprint arXiv:1608.07886, 2016.

--

--

Daniel

MSc. Data Science student at TU Delft | Full-Stack Web Developer | Blockchain enthusiast