This article summarizes the lessons I learned during my thesis. I wrote my thesis at the Department of Journalism and Communication Research in Hannover, Germany and worked together with the Newslab of the German Press Agency (dpa). The thesis was also part of the Newsstream project, in which journalists and developers are working together to create big data tools for journalists.
The subject of my work is the comparison of the performance of Named Entity Recognition (NER) and Linking (NEL) for 200 German Newswire articles from six different NER tools.
To compare these tools I created an annotated corpus of 200 dpa texts with found and linked entities. The analyzed tools need to recognize entities and link them to a knowledge base such as WikiData, can process German texts and are accessible through an API. Analyzed tools are:
- Google Cloud Natural Language (GCNL)
- Micsrosoft Entity Linking Intelligence (MELI)
If you have never heard of Named Entity Recognition the Wikipedia article gives a good overview. Here are the lessons I learned during my thesis:
1. NER is important and can be used for many things
NER is part of Information Extracting in Computer linguistics and plays an important role for many NLP-applications. Aaron Swartzs semantic web (or web 3.0) can only work with NER. The German Press Agency can use NER to enrich their texts with more and better structured metadata and can simplify and automate the process of adding of metadata for their journalists. Texts with more metadata can be better found and organized and have an extra value for their customers.
And even for communication science, NER can be used to automate classical content analysis.
2. The creation of an annotated corpus is more difficult than I thought
For the creation of the annotated corpus I didn’t start from scratch. I adapted the guidelines used for the GermEval 2014 and used a modified version of the annotation tool Webanno. Even with detailed guidelines (12 pages) it took a while to teach the three annotators. On average they needed 12 minutes for one text. They reached an average reliability of Krippendorff’s Alpha α = 0.83. This value is acceptable but could be better by a improving the teaching of the annotators and curating of the results of several annotators.
3. No tool performs really well
As I expected the tools perform quite different. The human annotators found 4039 entities, while the tools found an average of 5092 entities. Google Cloud Natural Language and Microsoft Entity Linking Intelligence found less entities than the humans found. Textrazor found almost twice as many entities than in the annotated corpus.
This difference can also be seen in the precision and recall values of the tools. The tools with more entities tend to have a higher recall.
No tool reaches results close to the annotated corpus. Ambiverse gets the best results with a F-measure of 67.57. Especially the tools from the big technology corporations Google and Microsoft disappointed me.
4. A recognized entity is not automatically linked correctly
In a second step I looked if all recognized entities are linked correctly. If you compare the results with the graph above, it is visible that the linking of entities throughout all tools and benchmarks is around 10 points lower than the recognition only.
5. NER works really well for Persons
The tools are extracting four categories of entities: Persons, Organizations, Locations and Other. An analysis of the benchmarks per category shows that Persons are recognized the best, while Locations have the highest recall. The precision of Locations is higher than Organizations. For entities of the category Other NER does not really work well.
6. NER works the best for sport articles
The analyzed texts from the dpa are sorted into 5 sections: Arts & Culture, Economy, Sports, Politics and Miscellaneous.While precision and recall are the highest for texts about sport, Miscellaneous has the lowest precision while art has the lowest recall.
7. A combination of tools CAN improve the results (a bit)
If you combine tools and only take the overlap of several tools you can improve the results. You get the highest recall for a combination of two tools, while the precision is logically the highest when there is an overlap of all six tools together, for the price of a very low recall. For the combination of four tools you get the highest F-measure of 73.71. This F-measure is only slightly higher than the highest F-measure reached by the tool Ambiverse (67.57).
8. NER tools are easy to use
I am not a developer and my Python skills are limited. This didn’t stop me from accessing the APIs of the tools. Most of the tools are well documented and often have their own SDK in different programming languages. All tools offer free trials, which I could use for my thesis. Implementing NER in an existing system is still complicated, but trying out NER with this APIs is very simple.
It was expectable that the tools perform worse for German texts than they do for English ones (refer to Rizzo, 2014), where tools reach F-measures close to 90. But even compared to the analysis of the Named Entity Extractors build for the GermEval 2014, the tools underperform. While the tools from the GermEval reached an average F-measure of 69.33, the tested tools of my thesis stayed quite below that (58.31). The results of the tools aren’t good enough to automate tagging of texts with keywords. They are good enough though to be used for recommendations in the context of adding metadata to articles in the newsroom. For further research the quantity (more texts) and quality (curating several annotators) of the annotated corpus needs to be improved. To understand the strengths and weaknesses of the tools better a closer look at the aberrations of the tool from the corpus (different start or end position, different link for the same word) could help. Since big software corporations as Microsoft, Google, Amazon and IBM are starting to get interested in NER, we can expect an improvement of the tools for the future.
I compared NER-tools for the German Press Agency (dpa). I found out, that the six tools perform differently. Ambiverse reached overall the best results. NER works the best for the category Person and texts about sport, which are heavy in persons compared to texts from other subject areas. A combination of several tools improves the results a bit. None of the tools’ performance is good enough to replace the recognition of entities from humans. NER-Tools can be used to recommend entities for markup to editors though.
If you are interested in the code I used for this thesis you can find it in this (quite unstructured) repository. The server of the annotation tool webanno will be online for some days more, if you want to have the link to this server or any other questions, please contact me.