Detecting Satire and Fake News with Machine Learning

Sometimes it is even hard for humans to understand if a news article is real, fake or satire. So I asked my self if I can train a machine learning model to decide to which class (real or satire) an given article belongs. There are websites like https://www.theonion.com publishing satire news every day, which can be used together with regular news sites, to collect trainings data for this classification problem.

Dataset

I grabbed large datasets of news articles in German language from news agencies and newspapers via their websites:

and from the satirical news sites:

for training and testing of the model. In total I collected 63,868 articles from 2008 to 2018 and stored them in a local database.

Database of news articles

Implementation

To train a classifier I used the “ScikitLearn” Package with a linear Support Vector Classifier (SVC). The news texts were vectorized with a count vectorizer and Tf-idf weighting (see the code below).

Results

80% of the data was used for the training of the classifier and 20% for testing. On the test-set I achieved an accuracy of 0.996, a precision of 0.986, a recall of 0.952 and a F1 score of 0.969. In the confusion matrix below you can see the distribution of the correct and wrong classifications. Only 11 of the real news are classified as satire but 42 of the satirical texts are not detected as satire. Quite good results.

Confusion matrix

I think the presented method can be used with other languages and I expect similar results as with the German news.

Are computers better than humans in detecting satire in texts?

More details can be found in the article https://arxiv.org/abs/1810.00593