Who is in the News!

How often are people mentioned in news articles published online? 
How does this vary over time and are the mentions of different peoples correlated?

I tried to answer this questions with a small project and a website. The results can be seen at http://in-the-news.stoeckl.ai/. The site is bulit with the Python microframework “Dash” which uses the plattform “Plotly” for the interactive charts.

The Data

The data comes from articles published by Reuters agency on their website www.reuters.com. At the moment about 70.000–80.000 news articles in English and German are indexed. German news are from 2015 until now, English from 2016 until now. A part of the dataset can be found on kaggle.com: https://www.kaggle.com/astoeckl/newsen

The Analysis

For each article a Named Entity Recognition (NER) is conducted with a machine learning algorithm to detect the mentions of the persons in the texts. I used the Python library “Spacy” as in the following example:

Persons and organisations in the newstitles

This algorithm uses a model which was pretrained on a corpus of Google news articles for English and German. The lists of persons in the articles are used to calculate the counts and are stored in a database.

The Plots

I show a barchart of the counts for the most often mentioned persons. For up to four of this persons you can plot the timeseries of the counts at the same time for a time period you select. For two persons you can calculate their relation / correlation as a funcion of time.

Correlation of Persons

Related Persons measures how correlated two persons are, in the sense that they are mentioned in the news at the same day. On one hand if they have the same counts every day the correlation is 1, on the other hand if a person appears always on days the second one does not, they are negative correlated near -1. If there is no correlation the value is near zero.

This measure varies over time as the correlation changes in the same way the relationship of the persons may change. We calculate the correlations over a sliding time window of 30 days and plot this values as a function of time.

More details can be found in the article https://arxiv.org/abs/1809.06083.