Detecting fake news about COVID-19 with Natural Language Processing

Melissa de Beyer
Voice Tech Podcast
Published in
6 min readOct 7, 2020

Since the outbreak of COVID-19, we’ve seen a surge in fake news exploiting public fear and uncertainty around the pandemic. The World Health Organization calls this spread of fake news an infodemic.

Fake news is not a new phenomenon, but this infodemic is exacerbated by the global scale of the COVID-19 emergency and the interconnected way information is shared via social media platforms.

“We’re not just fighting an epidemic; we’re fighting an infodemic.”- WHO Director-General

According to a report of the British Center for Countering Digital Hate, 90% of social media posts containing misleading information remained visible online after the research center reported 649 posts to Twitter and Facebook.

In this article, I show how fake news about COVID-19 can be detected using Natural Language Processing (“NLP”) with an accuracy of 91%.

Exploratory data analysis
The dataset consists of 1,164 articles (578 fake articles collected primarily from Facebook and 586 real news articles collected primarily from the New York Times and Harvard Health Publishing). The dataset can be found here.

Let’s first explore the content of the fake and real articles using word clouds. To create these word clouds I remove all stop words (commonly used words such as: “the”, “an”, “a”, etc.).

news['text'] = news['text'].str.replace('[^\w\s]','')
news['text'] = news['text'].str.lower()
import nltk
nltk.download('stopwords')
stop = stopwords.words('english')
news_nostopwords=news.copy()
news_nostopwords['text'] = news_nostopwords['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
fake_news = news_nostopwords[news_nostopwords["label"] == "FAKE"]
all_words = ' '.join([text for text in fake_news.text])
wordcloud = WordCloud(width= 800, height= 500,
max_font_size = 110,
collocations = False,colormap="OrRd").generate(all_words)
plt.figure(figsize=(10,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Word cloud fake news
Word cloud real news

The word clouds show similarities between fake and real articles in terms of wording, but also clear differences. Fake articles often contain words such as: ‘5G’, ‘Wuhan’, and ‘vitamin’. Real articles more often contain words such as: ‘said’, ‘disease’, ‘could’, ‘public’, and ‘would’.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

The distribution of the most frequently used words in fake and real news articles can be visualized using bar charts:

token_space = tokenize.WhitespaceTokenizer()
def counter(text, column_text, quantity):
all_words = ' '.join([text for text in text[column_text]])
token_phrase = token_space.tokenize(all_words)
frequency = nltk.FreqDist(token_phrase)
word_frequency = pd.DataFrame({"Word": list(frequency.keys()),
"Frequency": list(frequency.values())})
word_frequency = word_frequency.nlargest(columns = "Frequency", n = quantity)
plt.figure(figsize=(12,8))
ax = sns.barplot(data = word_frequency, x = "Word", y = "Frequency", color=(0.10588, 0.61961, 0.46667))
ax.set(ylabel = "Count")
plt.xticks(rotation='vertical')
plt.show()
counter(news_nostopwords[news_nostopwords["label"] == "TRUE"], "text", 25)
counter(news_nostopwords[news_nostopwords["label"] == "FAKE"], "text", 25)
Distribution of popular words in fake articles
Distribution of popular words in real articles

The graph below shows the distribution of the article length. Although the fake articles are on average a bit longer compared to the real news articles, the difference is not statistically significant.

#length of the articlesdef text_length(x):
if type(x) is str:
return len(x.split())
else:
return 0
news['text_length'] = news['text'].apply(text_length)
nums_text = news.query('text_length > 0')['text_length']
fig = px.histogram(news, x="text_length", color="label",
marginal="box",
hover_data=news.columns, nbins=100,
color_discrete_sequence=px.colors.qualitative.Dark2)
fig.update_layout(title_text='Distribution of article length', template="plotly_white")
fig.show()

Modeling
To create a model that detects fake news, I create and test two Natural Language Processing models: a logistic regression-based model and a random forest model. In natural language processing, logistic regression is the baseline supervised machine learning algorithm for classification. Random forest, like its name implies, consists of a large number of individual decision trees. Each individual tree in the random forest gives a class prediction and the class with the most votes becomes the model’s prediction.

Not all articles have the same length and not all words within the articles are equally important to detect fake and real news. To account for this, I use a method called Term Frequency-Inverse Document Frequency (TF-IDF). The Term Frequency is the frequency of a specific word divided by the total number of words in the article. TF implies that all words in the text are equally important. However, it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. Therefore, I need to weigh down the frequent terms while scaling up the rare ones (Inverse Document Frequency), by computing the following:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

For grammatical reasons, articles contain different forms of a word such as drive, drives, driving. There are two methods to account for this in NLP: stemming and lemmatization. Stemming essentially chops of the last letters of a word until the stem is reached. This method works well in many cases, but let’s take the example of the word run. If a text contains the words: ‘run’, ‘ran’, ‘runner’, stemming would result in the words: ‘run’, ‘ran’, and ‘run’. Also, the word ‘easily’ would become ‘easili’ after stemming. Thus, stemming does not account for verbs, adverbs, or nouns. In contrast to stemming, lemmatization looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words. The lemma of ‘was’ is ‘be’ and the lemma of ‘mice’ is ‘mouse’. Therefore I prefer to use lemmatization instead of stemming.

#lemmatizationw_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = WordNetLemmatizer()
def tokenizer_lemmatizer_title(title):
return [lemmatizer.lemmatize(word, pos="v") for word in w_tokenizer.tokenize(title)]
def tokenizer_lemmatizer_text(text):
return [lemmatizer.lemmatize(word, pos="v") for word in w_tokenizer.tokenize(text)]
df['text_new'] = df.text.apply(tokenizer_lemmatizer_text)
df['title_new'] = df.title.apply(tokenizer_lemmatizer_title)
from nltk.tokenize.treebank import TreebankWordDetokenizer
df['title_new']=df['title_new'].apply(TreebankWordDetokenizer().detokenize)
df['text_new']=df['text_new'].apply(TreebankWordDetokenizer().detokenize)
df['title_text'] = df['title_new'] + ' ' + df['text_new']df.head()

To create and test the models, I split the dataset into a training dataset and a testing dataset. The testing dataset will not be used to create the model, but only to test its accuracy on new data provided to the model.

Logistic regression model
The logistic regression model has an average accuracy of 91.24%, meaning that for 91.24% of the articles in the testing dataset the model correctly classifies them as either fake or real. The confusion matrices (or error matrices) below show that 277 fake articles were also correctly detected as fake by the model, compared to 24 fake articles falsely detected as real articles.

Confusion matrix logistics regression (# articles)
Confusion matrix logistic regression (x100%)
# Vectorizing and applying TF-IDF
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('model', LogisticRegression())])
# Fitting the model
model = pipe.fit(X_train, y_train)
# Accuracy
prediction = model.predict(X_test)
print("accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))
titles_options = [("Confusion matrix, without normalization", None),
("Normalized confusion matrix", 'true')]
for title, normalize in titles_options:
disp = plot_confusion_matrix(model, X_test, y_test,
display_labels=['Fake', 'Real'],
cmap=plt.cm.Blues,
normalize=normalize)
disp.ax_.set_title(title)
print(title)
print(disp.confusion_matrix)
plt.show()

Random forest model
The accuracy of the random forest model is 89.52%, hence the accuracy of the random forest model is slightly lower compared to the accuracy of the logistic regression model. The confusion matrices below show that 271 fake articles were also correctly detected as fake by the model, compared to 30 fake articles falsely detected as real articles.

Confusion matrix random forest (# articles)
Confusion matrix random forest (x100%)
from sklearn.tree import DecisionTreeClassifier
# Vectorizing and applying TF-IDF
pipe = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('model', DecisionTreeClassifier(criterion= 'entropy',
max_depth = 20,
splitter='best',
random_state=42))])
# Fitting the model
model = pipe.fit(X_train, y_train)
# Accuracy
prediction = model.predict(X_test)
print("accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))

Thank you for reading! If you have any questions or comments regarding this article, please feel free to comment below.

Something just for you

--

--