Ambiverse - Natural Language Understanding in the Cloud

Linus Kohl
Jun 5, 2019 · 4 min read

While doing performance benchmarks for Named Entity Linking solutions I stumbled upon a very powerful, only just open-sourced framework called AmbiverseNLU. It was developed by Ambiverse and is based on work previously done at the Max Planck Institute¹. The components it uses are more well-known: entity recognition from KnowNER², open information extraction using ClausIE³ and AIDA, an entity detection and disambiguation tool⁴.

You can have a look at the demo here.

Ambiverse basically provides two services:

  1. Ambiverse NLU, the natural language understanding suite, which does Named Entity Linking, and the
  2. Knowledge Graph Service, which provides an interface to YAGO, a different project from the Max-Planck-Institute, that extracted 10 million entities and 120 million facts about those from Wikipedia, WordNet and GeoNames. More details available here.

For the former one you can choose whether to use Apache Cassandra or PostgreSQL as a backend, while the last one uses Neo4j. All components are already dockerized and available in the official GitHub repository. As we required further customization of the containers, I forked the repository and use my own images.

Google Cloud deployment

Before deployment you should adapt the settings in the main configuration file to wire the different components together and choose between different versions of datasets and languages. For testing purposes I would recommend using the smaller ones first.

It is pretty straightforward to get the whole thing up and running, all you need to do is to apply the configuration files consecutively and track if services and workloads are being created successfully. The Neo4j and Cassandra image download and import large amounts of data, so it takes a while to start the services.

Usage

# pip install ambiverseclient

Now lets simply process a sample text and see the result:

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

Performance

This results in mean values of nr_words: 11, nr_matches: 1.46 and
a duration of 0.71sec per headline.

The distribution of how long the processing with Ambiverse took and the correlation between the length of a headline and the duration is as follows:

Tests considering precision — presenting only correct entities, and recall — the ability to recognize all relevant entities are about to follow for our individual case. There is however a comparative study from Gagnon et al.⁷ that measured the performance of semantic annotation systems in regards to semantic annotation, NER and keyword detection. The paper uses the AI, IITB⁸ datasets for semantic annotation evaluation, the MSNBC dataset for evaluation of the named entity recognition, and the SemEval⁹ and Inspec¹⁰ datasets for the evaluation of keyword extraction. The overall results considering spotting and disambiguation in relation to other solutions from the paper are displayed below.

Conclusion

[1]: Lizenz für besseres Textverständnis. https://www.mpg.de/11670355/ambiverse-lizenz-yago

[2]: KnowNER: Incremental Multilingual Knowledge in Named Entity Recognition. https://arxiv.org/pdf/1709.03544.pdf

[3]: ClausIE: Clause-Based Open Information Extraction. http://resources.mpi-inf.mpg.de/d5/clausie/clausie-www13.pdf

[4]: Robust Disambiguation of Named Entities in Text. http://aclweb.org/anthology/D11-1072

[5]: Repository containing the Kubernetes configurations. https://github.com/linuskohl/ambiverse-nlu-kubernetes

[6]: Python wrapper library
https://github.com/linuskohl/ambiverse-client

[7]: An Analysis of the Semantic Annotation Task on the Linked Data Cloud, Gagnon et al. (2018)
https://arxiv.org/pdf/1811.05549.pdf

[8]: Collective annotation of Wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, Kulkarni, S., Singh, A., Ramakrishnan, G., & Chakrabarti, S. (2009, June) (pp. 457–466)
https://www.cc.gatech.edu/~zha/CSE8801/query-annotation/p457-kulkarni.pdf

[9]: SemEval Portal
https://aclweb.org/aclwiki/SemEval_Portal

[10]: Improved automatic keyword extraction given more linguistic knowledge In Proceedings of EMNLP 2003, Anette Hulth (p. 216–223)
https://www.aclweb.org/anthology/W03-1028

Voice Tech Podcast

Voice technology interviews & articles.