What Can Natural Language Processing Teach Us About Fake News?

Daniel DiNicola
5 min readJul 2, 2020

--

Sourced From: Getty Images and BBC

Fake news is a problem. It radicalizes, it polarizes and infuses in our discourse points of view which do not exist. Just about this time last year, a pew research poll found that some Americans believe that false news is a more significant issue than racism, terrorism, and illegal immigration.

However, when you take a step back, the issue derives from a multitude of potential sources. Nation-states engage in the practice to foment chaos in other countries, and political actors push ideas to increase or decrease voter participation.

Today, the novel coronavirus or COVID-19 has put a halt to our economy and claimed the lives of thousands around the world. Such an issue one would hope should be beyond the maleficence of disinformation is a global pandemic. Unfortunately, that is not the case; the World Health Organization has described the phenomenon as an infodemic.

Amid this infodemic, an online community arose, which I joined looking to tackle and understand further the differences between fake news and trustworthy news about COVID-19. The rest of this article discusses the work colleagues, and I did to uncover patterns about fake news regarding COVID-19.

Data Collection

The Poynter Institute for Media Studies is a non-profit journalism school and research organization located in St. Petersburg, Florida. In an attempt to help combat disinformation about the coronavirus, the institution in January set up an alliance that united more than 100 fact-checkers worldwide to coalesce fake news stories on the virus.

Sourced From: https://www.poynter.org/ifcn-covid-19-misinformation/?covid_countries=48837&covid_rating=0&covid_fact_checkers=0

With the help of some great computer scientists, our team was able to scrape 983 instances of these stories and place them into a data frame. For the real news portion of this project, I used the News API, where you can contain your search to specific news organizations and topics. Of course, for this endeavor, I came to the main-stream news organizations in the United States and kept the topic to coronavirus. The result was a data frame of 1983 rows, 1000 of which I categorized as “Real” news and 983 as “Fake” news.

Word Cloud

It’s not a real NLP project without a word cloud.

Pre-Processing

This data set did not require any noise-removal, so; I ran through some of the basics for textual data.

  1. Removing all punctuation.
  2. Making all words lowercase.
  3. Creating a Text column which as a list type.
  4. Stemming and creating a new column.
  5. Lemming and creating a new column.
#Create a binary target 
df['Label'] = df['Label'].map({'Real':0, 'Fake':1,})
#Remove all punctuation from our text data
df["Text"] = df['Text'].str.replace('[^\w\s]','')
#Make sure all the words are lowercase.
df["Text"] = df["Text"].str.lower()
#Create our text column into one where it is just a list of words
df["Text_as_list"] = df["Text"].str.split()
#Create our stemmed data
df['stemmed_text'] = df['Text_as_list'].apply(lambda x: [stemmer.stem(y) for y in x])
#Create our lemmed data
df['lemmed_text'] = df["Text_as_list"].apply(lambda x: [lemmer.lemmatize(y) for y in x])

Now that’s done, let’s talk about modeling and the results.

Modeling

Pipelines and grids are great tools. Since I want to enable both CVEC and TFIDF vectorizers setting up a pipeline allows me to utilize fit/transform/predict functionality to the training data and transform the test data without having to do it individually.

Also, the parameters set within the pipeline will be broad. I varied n-gram range from (1,2), (1,3), and (1,4). I set the max_df parameter to range from 75% to 95% since this would create options where words like “coronavirus” are excluded from that particular model’s decision. You can see the full pipeline below.

pipe_c_lr = Pipeline([('cvec', CountVectorizer()),
('lr', LogisticRegression(solver = 'liblinear'))
])
pipe_para_cvec = {
'cvec__max_features': [100, 500, 1000],
'cvec__ngram_range': [(1,2), (1,3), (1,4)],
'cvec__stop_words' : ['english', None],
'cvec__min_df': [2, 5, 10],
'cvec__max_df': [.75, .80, .85, .95]
}
pipe_t_lr = Pipeline([('tvec', TfidfVectorizer()),
('lr', LogisticRegression(solver = 'liblinear'))
])
pipe_para_tfidf = {
'tvec__max_features': [100, 500, 1000],
'tvec__ngram_range': [(1,2), (1,3), (1,4)],
'tvec__stop_words': ['english', None],
'tvec__min_df' : [2, 5, 10],
'tvec__max_df' : [.75, .80, .85, .96]

}

I iterated over these pipes. I used the grid search optimization algorithm to help select the best parameters as a sort of trial and error method.

Results

For classification problems like this, which I equated to detecting spam emails, one has to ask themselves what the goals are?

Since we are not predicting if someone has a disease and or classes our balanced, I wanted to examine further a model that minimized false positives and had the highest accuracy score. Expectedly, this happened in two separate models. Nevertheless, when reviewing the results, I decided to look into feature importance for the XGBoost to see if there is anything to communicate back to the public on the nature of fake and real news.

SHAPLEY Values and Feature Importance

I want to preface that there is someone out there who can explain SHAPLEY values better than I can. With that said, I still believe that in the world of model explainability, this tool is critical. To me, SHAP values quantify the impact of having a given feature in comparison to the prediction we’d make if that feature took some baseline value.

XGBoost SHAP Value Chart

What we see above are the features or words which have the highest average impact on the model’s output magnitude. The first word, “financial,” has an extremely high correlation with the desired output of a real news classification.

What does that mean in English? Articles with the word “financial” significantly impact the decision making of the model to classify that story as real. Furthermore, terms like update and stimulus also rank high up on the list as affecting the model’s decision making.

What can we learn from this, and how do we use this to inform the public?

Essentially, what this data shows is that if a news story contains key-words like “financial,” “update,” or “stimulus,” then you are more than likely reading a story whose information you can trust.

GitHub Link

Check Out More of My Work Here

--

--