Ambiverse - Natural Language Understanding in the Cloud
While doing performance benchmarks for Named Entity Linking solutions I stumbled upon a very powerful, only just open-sourced framework called AmbiverseNLU. It was developed by Ambiverse and is based on work previously done at the Max Planck Institute¹. The components it uses are more well-known: entity recognition from KnowNER², open information extraction using ClausIE³ and AIDA, an entity detection and disambiguation tool⁴.
You can have a look at the demo here.
Ambiverse basically provides two services:
- Ambiverse NLU, the natural language understanding suite, which does Named Entity Linking, and the
- Knowledge Graph Service, which provides an interface to YAGO, a different project from the Max-Planck-Institute, that extracted 10 million entities and 120 million facts about those from Wikipedia, WordNet and GeoNames. More details available here.
For the former one you can choose whether to use Apache Cassandra or PostgreSQL as a backend, while the last one uses Neo4j. All components are already dockerized and available in the official GitHub repository. As we required further customization of the containers, I forked the repository and use my own images.
Google Cloud deployment
For easy deployment on Google Kubernetes Engine I created a repository containing the respective YAML files⁵. Due to performance reasons I chose Cassandra over PostgreSQL to host the AmbiverseNLU database. You should not try to run the setup on a cluster with tight memory or CPU constraints or else you are definitely running into problems. Specially the AmbiverseNLU deployment needs a lot of memory (>16Gb), this is why I use a nodeSelector and assign it to different nodes with higher capacities.
Before deployment you should adapt the settings in the main configuration file to wire the different components together and choose between different versions of datasets and languages. For testing purposes I would recommend using the smaller ones first.
It is pretty straightforward to get the whole thing up and running, all you need to do is to apply the configuration files consecutively and track if services and workloads are being created successfully. The Neo4j and Cassandra image download and import large amounts of data, so it takes a while to start the services.
There will be two services, knowledgegraph and nlu, both listening for connections on port 8080. There is no authorization implemented, so you should use the services only from inside the cluster or take care of security otherwise. I implemented a simple wrapper library⁶ in Python, that also contains models for requests and responses.
You can install it by running:
# pip install ambiverseclient
Now lets simply process a sample text and see the result:
Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com
To create a naive benchmark — considering speed only, I used the NewsAPI to get the last 100 headlines and extract entities using the following script:
This results in mean values of nr_words: 11, nr_matches: 1.46 and
a duration of 0.71sec per headline.
The distribution of how long the processing with Ambiverse took and the correlation between the length of a headline and the duration is as follows:
Tests considering precision — presenting only correct entities, and recall — the ability to recognize all relevant entities are about to follow for our individual case. There is however a comparative study from Gagnon et al.⁷ that measured the performance of semantic annotation systems in regards to semantic annotation, NER and keyword detection. The paper uses the AI, IITB⁸ datasets for semantic annotation evaluation, the MSNBC dataset for evaluation of the named entity recognition, and the SemEval⁹ and Inspec¹⁰ datasets for the evaluation of keyword extraction. The overall results considering spotting and disambiguation in relation to other solutions from the paper are displayed below.
Ambiverse is a great tool and definitely worth having a closer look at. The next article will be about the Named Entity Classifier for Wikidata (NECKAr) from the university of Heidelberg, so stay tuned.
: Lizenz für besseres Textverständnis. https://www.mpg.de/11670355/ambiverse-lizenz-yago
: KnowNER: Incremental Multilingual Knowledge in Named Entity Recognition. https://arxiv.org/pdf/1709.03544.pdf
: ClausIE: Clause-Based Open Information Extraction. http://resources.mpi-inf.mpg.de/d5/clausie/clausie-www13.pdf
: Robust Disambiguation of Named Entities in Text. http://aclweb.org/anthology/D11-1072
: Repository containing the Kubernetes configurations. https://github.com/linuskohl/ambiverse-nlu-kubernetes
: Python wrapper library
: An Analysis of the Semantic Annotation Task on the Linked Data Cloud, Gagnon et al. (2018)
: Collective annotation of Wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, Kulkarni, S., Singh, A., Ramakrishnan, G., & Chakrabarti, S. (2009, June) (pp. 457–466)
: SemEval Portal
: Improved automatic keyword extraction given more linguistic knowledge In Proceedings of EMNLP 2003, Anette Hulth (p. 216–223)