Fake News Detection with Neural Networks
Clique aqui para ler esse artigo em Português.
Fake news has been a recurring problem in our post-globalization era with easy access to the internet. To combat them, it is necessary to first identify them. This can be achieved using algorithms trained for this purpose.
In this article, I have shown the steps I took to create a supervised machine-learning model using neural networks to detect whether a news article is reliable or not. This involved exploratory data analysis, data cleaning and preprocessing, text standardization, and the creation of word clouds to familiarize myself with the most commonly used terms in the articles. I also analyzed authors with the highest number of articles in the dataset. Next, I performed the necessary preprocessing for TensorFlow using tokenization and padding techniques. The algorithm’s construction was tailored to the problem, with a base of a stacked sequential neural network consisting of Embedding, Conv1D, GlobalMaxPool1D, and Dense layers. After feeding new data into the model, I evaluated its performance using accuracy and recall values, as well as constructing a confusion matrix.
* Note
This is a complete study report, including the codes and methodology used. I published a brief version of it, more direct, where I bring only the main results of this research.
To check the summarized article click here.
Summary
- About the Project
- General Objective
2.1. Specific Objectives - Obtaining the Data
- Variable Dictionary
- Data and Library Importation
- Exploratory Data Analysis
- Data Cleaning
7.1. Missing Data
7.2. Duplicate Data
7.3. Deleteid
attribute - Data Processing
8.1. Word standardization
8.2. WordCloud
8.2.1. WordCloud using titles
8.2.2. WordCloud using news text
8.3. Authors’ analysis
8.4. Splitting sets into: training, validation, and testing - Performance Assessment Metrics
- What are Neural Networks?
- WhyTensorFlow?
- Algorithm Development using TensorFlow
- Predictions on the test set
- Results Comparison: Validation x Test
- Model Deployment
- Conclusion
1. About the Project
With globalization and technological advancement, news can now travel across the globe in a matter of seconds. The internet has facilitated access to information, but on the flip side, it has also made the dissemination of fake news a lot easier. This problem has become a global challenge, impacting everything from everyday tasks to the international political landscape.
In a broader sense, fake news refers to manipulated news that is entirely untrue or contains a distorted element, fact, or number with the intention of leading the reader to have a different perception of the event. Therefore, the goal of this type of news is to deceive, influence opinion, or create sensationalism around a particular topic.
The consequences are severe and range from a personal level, causing confusion, anxiety, fear, and distorting an individual’s reality. This can lead people to make wrong decisions that can harm various aspects of their lives, including their finances, relationships, and even their health.
However, in a broader context, fake news polarizes society, incites gratuitous violence and hatred, and undermines trust in certain institutions. In the political arena, for example, it can impact election outcomes and the democratic stability of a nation.
Therefore, the fight against this type of news is of utmost importance and requires the collaboration of governments, media and technology companies, as well as the general public. We must exercise critical thinking to question the information we receive and assist in identifying and stopping the spread of fake news.
Furthermore, another way to combat this type of news is through the use of technology. With that in mind, this study aims to leverage data science technology to help identify fake news.
2. General Objective
Build a supervised machine-learning model that classifies whether a news is trustworthy or not.
2.1. Specific Objectives
- Make an exploratory analysis of the data with the aim of understand the dataset and extract insights that can help with subsequent steps.
- Perform data cleaning and processing to prepare it to be properly used in the machine-learning model.
- Create a machine-learning model using neural networks, and assess its performance.
4. Variable Dictionary
The understanding of the dataset involves checking the available variables in it so that a proper analysis can be conducted. According to the website’s documentation, the table below was constructed with the variable names and their respective meanings
in alphabetic order
author
: Name of the author of the newsid
: Unique identification for the newslabel
: Target variable that indicates whether the news is fake news or not, using Boolean values. Therefore, 0 represents reliable news and 1 represents unreliable news or fake news.text
: News text (may be incomplete)title
: News title
5. Data and Library Importation
When starting a project, it is necessary to install packages, import libraries that have specific functions to be used in the following lines of code and make the necessary configurations for the code output. Also, the datasets are imported, saving them in specific variables so that they can be used later.
# install additional packages
!pip install tensorflow-addons -q
!pip install scikit-plot -q
# import libraries
import pandas as pd # data manipulation
import numpy as np # array maniputlation
import random as rnd # random numbers
import missingno as msno # missing data evaluation
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # statistical data visualization
import string as st # functions to handle string data
import tensorflow as tf # build machine learning models
import scikitplot as skplt # data visualization and machine-learning metrics
from tensorflow import keras # build deep learning models
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator # create wordcloud
from sklearn.model_selection import train_test_split # split into training and test sets
from keras.preprocessing.text import Tokenizer # create tokens
from keras.preprocessing.sequence import pad_sequences # create padding
from keras.optimizers import Adam # optimizer for training neural networks
from sklearn.metrics import confusion_matrix # generate confusion matrix
from sklearn.metrics import accuracy_score, recall_score, precision_score # performance assessment
from sklearn.metrics import classification_report # performance report generation
import warnings # notifications
warnings.filterwarnings('ignore') # set notifications to be ignored
# additional settings
## chart style
plt.style.use('ggplot')
sns.set_style('dark')
## set fixed random seed for reproducible results
seed = 123
tf.random.set_seed(seed)
np.random.seed(seed)
rnd.seed(seed)
# configure the output to show all columns
pd.options.display.max_columns = None
# configure output of figures in 'svg' format (best quality)
%config InlineBackend.figure_format = 'svg'
# import data sets and save them into variables
data_path = "https://www.dropbox.com/scl/fi/7gubsdfvlvtgswsmp2a7d/train.csv?rlkey=cof2cxeyek1zveki3nnkpygcp&dl=1"
df_raw = pd.read_csv(data_path)
6. Exploratory Data Analysis
This is an essential step in data science projects, where the goal is to gain a better understanding of the data by identifying patterns, outliers, potential relationships between variables, and more. In this study, we explored information that was relevant to guide the responses to the objectives mentioned earlier (see General Objective and Specific Objectives).
To achieve this, various techniques and tools will be used as deemed necessary. In this phase, the data scientist becomes a detective in search of things that are not explicitly present in the data frame. As a result, the data will also be plotted in different ways to visualize them better and test initial hypotheses, with the aim of gaining insights that can guide the rest of the project.
First, I generated a visualization of the first 5 and last 5 entries to check the composition of the dataset, and I checked that at the end of it, there were no incorrect records, such as total sums.
# print the 5 first entries
df_raw.head()
# print the 5 last entries
df_raw.tail())
It can be seen that the fields appear to be well filled out and that there are no apparent problems. However, we will continue to investigate more deeply.
The next step is to know the size of this dataset.
# check the data set size
print('Dataset Dimensions')
print('-' * 30)
print('Total records:\t\t {}'.format(df_raw.shape[0]))
print('Total attributes:\t {}'.format(df_raw.shape[1]))
'''
Dataset Dimensions
------------------------------
Total records: 20800
Total attributes: 5
'''
Let’s look a little closer at these 5 variables. The objective will be to understand the type of variable found in these attributes, check missing values, data distribution, outliers, etc.
# generate data frame information
df_raw.info()
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 20800 non-null int64
1 title 20242 non-null object
2 author 18843 non-null object
3 text 20761 non-null object
4 label 20800 non-null int64
dtypes: int64(2), object(3)
memory usage: 812.6+ KB
'''
With the output above, we can verify that there are missing data in the title
, author
, and text
variables. This requires specific treatment.
Furthermore, we can see that id
and label
are integers (int), while title
, author
, and text
are of string type. Therefore, the attributes are in the correct types, and there is no need for any changes in this regard.
As seen, we have variables with missing values. Let’s examine this in more detail by printing the percentage of missing data for each attribute, organized in descending order, from the highest to the lowest.
# check amount of missing data
print(((df_raw.isnull().sum() / df_raw.shape[0]) * 100).sort_values(ascending=False).round(2))
'''
author 9.41
title 2.68
text 0.19
id 0.00
label 0.00
dtype: float64
'''
At the top of the column, we observe the author
attribute with the highest amount of missing data, representing 9.41% of the total dataset without proper filling. Next are the title
and text
variables, with 2.68% and 0.19% missing, respectively.
Therefore, in general, it is necessary to address the missing values.
Let’s visualize below the quantity of missing data in each attribute to facilitate understanding of the data quality.
# print chart to check for missing data
msno.bar(df_raw, figsize=(10,4), fontsize=8);
Finally, I will check the balance of the data in the label
target variable.
# representation of the amount of 'label' in percentage
print('Total amount of (FALSE): {}'.format(df_raw.label.value_counts()[0]))
print('Total amount of (TRUE): {}'.format(df_raw.label.value_counts()[1]))
print('-' * 30)
print('The total amount of fake news represents {:.2f}% of the dataset.'.format(((df_raw.label.value_counts()[1]) * 100) / df_raw.shape[0]))
'''
Total amount of (FALSE): 10387
Total amount of (TRUE): 10413
------------------------------
The total amount of fake news represents 50.06% of the dataset.
'''
As we have a total of 50.06% of the data set representing fake news, we know that we have a balanced data set and that, therefore, there is no need for specific treatment.
7. Data Cleaning
Since we have missing data, it’s essential to perform proper treatment as they can cause issues during model training. Since we’re dealing with text data, I’ll remove the records with null values.
It’s also necessary to remove duplicate data.
Additionally, I’ll create a new variable df_clean
to identify the dataset with treated data while keeping the original data intact.
7.1. Missing Data
# delete records with missing data
df_clean = df_raw.dropna()
I’ll check if the deletion was successful:
# check amount of missing data
print(((df_clean.isnull().sum() / df_clean.shape[0]) * 100).sort_values(ascending=False).round(2))
'''
id 0.0
title 0.0
author 0.0
text 0.0
label 0.0
dtype: float64
'''
With this, our data set now has the following dimensionality:
# verificar tamanho do data frame
print('Dimensões do conjunto de dados')
print('-' * 30)
print('Total de registros:\t {}'.format(df_clean.shape[0]))
print('Total de atributos:\t {}'.format(df_clean.shape[1]))
'''
Dataset Dimensions
------------------------------
Total records: 17714
Total attributes: 4
'''
7.2. Duplicate Data
It is also worth checking the presence of duplicate data. To do this, first I will look at the number of unique values in each attribute.
# check number of unique entries
print('Unique Entries')
print('-' * 30)
print('Total records in the dataset:\t {}'.format(df_clean.shape[0]))
print('Unique values in each attribute:')
display(df_clean.nunique())
'''
Unique Entries
------------------------------
Total records in the dataset: 18285
Unique values in each attribute:
id 18285
title 17931
author 3838
text 18017
label 2
dtype: int64
'''
Since we have 18,285 records, it’s possible that the title
and text
attributes may have duplicate values. The author
variable is naturally expected to have duplicate values, and those will be retained.
Let’s remove the duplicate data and verify if the action was successful:
# delete duplicate data from the dataset
df_clean = df_clean.drop_duplicates()
# check results
display(df_clean.nunique())
'''
id 18285
title 17931
author 3838
text 18017
label 2
dtype: int64
'''
We can see that there were no changes, meaning nothing was deleted. This does not necessarily mean that there are no duplicate data. With a more specific evaluation, we know that even though the text
attribute contains only a part of the news, the chances of it being exactly the same as another are extremely low, as that would constitute plagiarism.
Furthermore, as we observed, the article titles available in the title
variable are neither short nor generic. Therefore, we can also rule out the possibility of having identical titles.
Let’s proceed with the removal of duplicate data by attribute instead of looking at the entire dataset, as done in the previous code. This way, we will ensure the removal of these duplicated pieces of information that may interfere with the algorithm’s construction.
# delete duplicate data from 'title'
df_clean = df_clean[~df_clean.title.duplicated(keep='last')].reset_index(drop=True)
# delete duplicate data from 'text'
df_clean = df_clean[~df_clean.text.duplicated(keep='last')].reset_index(drop=True)
# check unique data to confirm deletion
display(df_clean.nunique())
'''
id 17714
title 17714
author 3813
text 17714
label 2
dtype: int64
'''
From the number of records given by id
we can see that the title
and text
attributes have the same number, that is, 17714 entries.
7.3. Delete id
attribute
We can also delete the id
attribute since it does not add relevant data to our model.
# delete 'id' attribute
df_clean.drop(['id'], axis=1, inplace=True)
# check the first 5 entries to check the changes made
df_clean.head()
8. Data Processing
In this project, we are dealing with a natural language processing (NLP) problem, meaning the data is in string format. Therefore, NLP knowledge is necessary to handle the data properly and achieve the desired results.
Once the data is standardized, we can perform some additional analyses, such as WordCloud generation and analysis of categorical data. In this case, we will focus on the authorship of the news articles.
8.1. Word standardization
For this purpose, the clean_text
function will be created to standardize news texts.
# create the 'clean_text' function that receives the 'text' argument
def clean_text(text):
# create variable 'word' and divide the text by words and use whitespace as delimiter
words = str(text).split()
# convert words to lowercase by adding space to the end of each word
words = [i.lower() + " " for i in words]
# join words into a single string, but separated by spaces
words = " ".join(words)
# remove punctuation from strings, using the st.punctuation method as an argument
words = words.translate(words.maketrans('', '', st.punctuation))
return words
# apply the 'clean_text' function to the attributes: 'title', 'author' and 'text'
df_clean.title = df_clean.title.apply(clean_text)
df_clean.author = df_clean.author.apply(clean_text)
df_clean.text = df_clean.text.apply(clean_text)
# check the first 5 entries to check the changes made
df_clean.head()
8.2. WordCloud
Let’s create a word cloud based on the titles
and text
of the news to get an idea of which words are most used.
8.2.1. WordCloud using titles
# create variable based on 'title' attribute
title = df_clean.title
# title treatment
## concatenate words
all_title = ' '.join(s for s in title)
# create variable with list of stopwords
stopwords = set(STOPWORDS)
# instantiate wordcloud
wordcloud = WordCloud(stopwords=stopwords,
background_color='black').generate(all_title)
# print image with result
fig, ax = plt.subplots(figsize=(10, 6))
ax.imshow(wordcloud, interpolation='bilinear')
ax.set_axis_off()
# print WordCloud
plt.imshow(wordcloud)
# save generated image
wordcloud.to_file('news_titles_wordcloud.png')
Note that the most prominent words are: new york, york times, breitbart, trump, donald trump, hillary, hillary clinton.
We can also check the number of words that were used to create the cloud above:
# print total words used to create cloud
print('Total words used: {}'.format(len(all_title)))
'''
Total words used: 1540706
'''
8.2.2. WordCloud using news text
# create variable based on 'text' attribute
text = df_clean.text
# text treatment
## concatenate words
all_text = ' '.join(s for s in text)
# create variable with list of stopwords
stopwords = set(STOPWORDS)
# instantiate wordcloud
wordcloud = WordCloud(stopwords=stopwords,
background_color='black').generate(all_text)
# print image with result
fig, ax = plt.subplots(figsize=(10, 6))
ax.imshow(wordcloud, interpolation='bilinear')
ax.set_axis_off()
# print WordCloud
plt.imshow(wordcloud)
# save generated image
wordcloud.to_file('news_titles_wordcloud.png')
Here, it is clear that the most prominent words are: said, one, people, mr trump, many, now.
Let’s check the number of words that were used to create the cloud above:
# print total words used to create cloud
print('Total words used: {}'.format(len(all_text)))
'''
Total words used: 97524186
'''
8.3. Authors’ analysis
Let’s look in more detail at the authorship of the news in the dataset under analysis. First, we will check how many authors we have in total, then we will make a ranking with the number of articles published.
# check number of unique entries
print('This set has {} different authors.'.format(len(df_clean.author.unique())))
'''
This set has 3804 different authors.
'''
# ranking of the 10 authors with the highest number of published articles
best_authors = df_clean.author.value_counts()[:10]
print(best_authors)
'''
pam key 243
admin 216
jerome hudson 166
charlie spiering 141
john hayward 140
katherine rodriguez 124
warner todd huston 122
ian hanchett 119
breitbart news 118
daniel nussbaum 112
Name: author, dtype: int64
'''
# extract names of authors and the number of articles published
authors = best_authors.index
article_counts = best_authors.values
# create horizontal bar chart
fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(authors, article_counts, color='#91aac9')
ax.set_xlabel('Number of Articles')
ax.set_ylabel('Author')
# title
ax.text(-60, -1.5, 'Top 10 authors with the highest number of published articles', fontsize=16, color='#004a8f',
fontweight='bold')
# number of articles per author
## pam key
ax.text(230, 0.15, '243', fontsize=12, color='#004a8f')
## admin
ax.text(203, 1.15, '216', fontsize=12, color='#004a8f')
## jerome hudson
ax.text(153, 2.15, '166', fontsize=12, color='#004a8f')
## charlie spiering
ax.text(127, 3.15, '141', fontsize=12, color='#004a8f')
## john hayward
ax.text(127, 4.15, '140', fontsize=12, color='#004a8f')
## katherine rodriguez
ax.text(111, 5.15, '124', fontsize=12, color='#004a8f')
## warner todd huston
ax.text(109, 6.15, '122', fontsize=12, color='#004a8f')
## ian hanchett
ax.text(105, 7.15, '119', fontsize=12, color='#004a8f')
## breitbart news
ax.text(105, 8.15, '118', fontsize=12, color='#004a8f')
## daniel nussbaum
ax.text(100, 9.15, '112', fontsize=12, color='#004a8f')
plt.gca().invert_yaxis() # Invert the y-axis so that the author with the most articles is at the top
plt.show()
8.4. Splitting sets into: training, validation, and testing
Next, we will create a validation set from the training set with a 75:25 ratio. This means that 75% of the dataset will continue to be used for training the model, while 25% will be used to validate the model created by the algorithm.
# split training, testing and validation data
## stratify= df_clean.label (to split so that classes have the same proportion)
## random_state so that the result is replicable
train, df_temp = train_test_split(df_clean, test_size=0.25, stratify=df_clean.label, shuffle=True, random_state=123)
validation, test = train_test_split(df_temp, test_size=0.5,stratify=df_temp.label, shuffle=True, random_state=123)
# check set sizes
print('The training set has \t{} records'.format(train.shape[0]))
print('The testing set has \t{} records'.format(test.shape[0]))
print('The validation set has \t{} records'.format(validation.shape[0]))
'''
The training set has 13285 records
The testing set has 2215 records
The validation set has 2214 records
'''
9. Performance Assessment Metrics
Before beginning the construction of a machine learning model, it’s important to define how we will evaluate the model, i.e., decide whether it is performing as expected.
In this study, I will use two metrics: accuracy and recall, with a focus on accuracy and recall.
Accuracy is a simpler metric as it provides an overall view of the correct predictions out of the total predictions made. Its formula is given by:
However, accuracy is not the best metric to consider when dealing with imbalanced data, for example, in cases like bank fraud and disease detection. In these cases, we expect fraud and the presence of a disease to be the exception rather than the rule. This naturally results in imbalanced data. This might be the case here as well. Despite the dataset appearing balanced, in real life, we typically encounter trustworthy news most of the time. Hence, fake news should be the exception rather than the rule.
In such cases, we often use recall as an evaluation metric, and that’s why we will also look at these values in this study.
Its formula is given by:
Finally, I will also use the confusion matrix to compare the predicted values with the actual values, showing the errors and successes of the model. Its output has 4 different values, arranged as follows:
Each of these values corresponds to:
Model Correct Predictions
- True Positive: It’s fake news and the model classifies it as fake news.
- True Negative: It’s not fake news and the model classifies it as not fake news.
Model Errors
- False Positive: It’s not fake news, but the model classifies it as fake news.
- False Negative: It’s fake news, but the model classifies it as not fake news.
10. What are Neural Networks?
According to SAS, neural networks are “interconnected computing systems that function like the neurons in the human brain. Using algorithms, they can recognize hidden patterns and correlations in raw data, group and classify them, and — over time — continuously learn and improve” (Source: SAS).
While initially the use of neural networks aimed to recreate the functioning of the human brain for problem-solving, over time they have been used to solve specific problems such as speech recognition, text translation, chess-playing, aiding in healthcare diagnostics, computer vision, among others. This led to the creation of various types of neural networks. Here, we will use Convolutional Neural Networks (CNNs), which are ideal for image-related problems like object detection and natural language processing.
Let’s look at an example. A node can be compared to a neuron. When stimulated, electrical currents are emitted to activate another node or neuron. These are the synapses. And so on, traversing the entire network or, in our analogy, our brain. At the end of this stimulation, a response is generated.
I bet you’ve stubbed your toe and experienced one of the fastest synapses where you swear at people you don’t even know up to the third generation. Or when you smell that food that only your grandmother knew how to make…
In the same way, this is what we aim to achieve with our model: build neurons, which are our nodes, so that they generate or don’t generate stimulation, and in the end, give us a response. But all of this involves the execution of complex mathematical calculations, where we try to identify patterns and then decide whether there are sufficient reasons to pass the stimulus forward or not.
Look at the image below. It’s a classification model, just like the one being constructed in this project. However, here, the algorithm must respond whether the given image is of a cat or a dog. We provide an image of a dog, which is processed, fed into the model, and generates an output that says whether it’s a cat or a dog. While the image is an illustration of what happens, it helps understand the mentioned concepts.
In a more comprehensive manner, we can say that data is fed into the neural network through the input layer, which then communicates between the hidden layers. It’s in these layers that data is processed and weighted. In other words, data is computed with different coefficients that generate specific results based on the input assigned to the model. In the end, everything is summed up, and the result passes through the activation function. Depending on the result, it will either activate the next node or not, and so on. Each activation or non-activation of a node helps the model determine the final outcome.
11. Why TensorFlow?
TensorFlow is an open-source machine learning library developed by the Google Brain Team. Its features, such as working with tensors, handling large volumes of data, and its flexibility in building and training advanced machine learning models using a wide variety of tools, make it one of the best libraries for developing deep learning models.
Among the advantages of using TensorFlow for building neural networks for binary classification of fake news, you can cite:
- Ability to Process Large Volumes of Text:
- TensorFlow’s capability to handle large volumes of text data aids in the analytical and learning capacity of the news content.
- Construction of Complex Models:
- TensorFlow allows the construction of complex models capable of capturing nuances and patterns in text, which is helpful for distinguishing between fake and true news.
- Parameter Tuning and Model Customization:
- TensorFlow provides the flexibility to test different parameters, insert multiple layers, and customize the model to optimize accuracy and performance.
12. Algorithm Development using TensorFlow
To develop the best algorithm that serves the purpose of this project, we need to perform data preprocessing. This step is important to simplify the text, making it easier to process and train. For this, we use the techniques of tokenization and padding, known as tokenizer and padding, respectively.
Tokenization is a step that reduces text into smaller units such as groups of words, individual words, characters, punctuation, or symbols. These units are technically called tokens. See the image below for a representation of text tokenization into tokens, where the unit is divided.
From there, each token is assigned a numerical value that represents it, and this information is stored in the form of a dictionary. In other words, it has key-value pairs where the key is the token and the value is its numerical identifier.
The padding step, as the name suggests, aims to standardize the length of sequences. In other words, all sequences will have the same length to facilitate model training. To achieve this, a maximum length is determined, and we specify which value should be inserted in cases where the sequence is shorter than the maximum length specified.
# settings
vocab_size = 10000 # maximum number of tokenized words
trunc_type = 'post' # set truncate
pad_type = 'post' # set padding
oov_tok = '<OOV>' # default token for out-of-vocabulary tokenization
# tokenize training data
tokenizer = Tokenizer(num_words = vocab_size,
oov_token = oov_tok)
tokenizer.fit_on_texts(train.text)
# store the training data index in a variable
word_index = tokenizer.word_index
# apply tokenizing and padding
training_sequences = tokenizer.texts_to_sequences(np.array(train.text))
training_padded = pad_sequences(training_sequences,
truncating = trunc_type,
padding = pad_type)
# set maximum padding length
max_length = len(training_padded[0])
# encode the string of validation data
validation_sequences = tokenizer.texts_to_sequences(np.array(validation.text))
# apply padding to validation data
validation_padded = pad_sequences(validation_sequences,
padding = pad_type,
truncating = trunc_type,
maxlen = max_length)
# create arrays with inputs
x_train = np.copy(training_padded)
x_val = np.copy(validation_padded)
y_train = train['label'].values
y_val = validation['label'].values
With the preprocessing steps completed, let’s check some outputs to illustrate what was done in this stage.
First, let’s check the tokenization process. We mentioned earlier that it returns a dictionary with key-value pairs, where the key is the token, and the value is the numerical information corresponding to that token. Let’s verify this information with the code below.
# check the type of variable 'word_index'
print('The variable "word_index" is of type: {}'.format(type(word_index)))
# visualizar os primeiros 10 pares chave-valor de 'word_index'
print('\nFirst 10 entries in "word_index":')
for word, index in list(word_index.items())[:10]:
print(word, index)
'''
The variable "word_index" is of type: <class 'dict'>
First 10 entries in "word_index":
<OOV> 1
the 2
to 3
of 4
and 5
a 6
in 7
that 8
is 9
for 10
'''
Above, we confirmed that the variable word_index where we saved the index of the tokenization process, is indeed a dictionary. Next, we printed the first entries for verification. For example, we now know that for received the value 10, while OOV, our padding token in the padding process, received the value 1.
As it is a dictionary, we can search for its key to discover its value. Let’s search for the word ‘york’ that appeared in the word cloud.
# find value of key 'york'
value = word_index.get('york', 'word not found')
print('The calue of "york" is: {}'.format(value))
'''
The calue of "york" is: 178
'''
Vamos verificar agora as variáveis training_sequences e training_padded, que são as sequências de tokens que foram criadas e depois, devidamente preenchidas, quando houve necessidade.
# verificar o tipo da variável 'training_sequences' e 'training_padded_
print('A variável "training_sequences" é do tipo: {}'.format(type(training_sequences)))
print('A variável "training_sequences" é do tipo: {}'.format(type(training_padded)))
'''
A variável "training_sequences" é do tipo: <class 'list'>
A variável "training_sequences" é do tipo: <class 'numpy.ndarray'>
'''
Let’s now check the training_sequences and training_padded variables, which are the sequences of tokens that were created and then, duly filled in, when necessary.
# check the type of variable 'training_sequences' and 'training_padded_
print('The variable "training_sequences" is of type: {}'.format(type(training_sequences)))
print('The variable "training_sequences" is of type: {}'.format(type(training_padded)))
'''
The variable "training_sequences" is of type: <class 'list'>
The variable "training_sequences" is of type: <class 'numpy.ndarray'>
'''
Since we have a list and an array, we can print examples to check their structure:
# 'training_sequences' sample
training_sequences[:2]
Note above that we requested the printing of 2 samples, meaning we have 2 sequences. And they are in list format, within another list.
# 'training_padded' sample
training_padded[:3]
'''
array([[ 2, 1610, 1, ..., 0, 0, 0],
[1579, 1, 323, ..., 0, 0, 0],
[ 6, 3950, 273, ..., 0, 0, 0]], dtype=int32)
'''
Now, we can see a portion of 3 sequences that have been properly padded (hence the ‘0’ values at the end) to make them all the same length. By having the same length, they could be transformed into arrays, which also enhances the performance of building the algorithm. We can print their dimensions:
# check dimension of variable 'training_padded'
print('Dimension fo variable "training_padded":{}'.format(training_padded.shape))
'''
Dimension fo variable "training_padded":(13285, 24195)
'''
To finish the pre-processing, let’s also check the final size of our training set:
# verificar o tamanho dos conjuntos de treino
print('Quantidade de Registros')
print('Dados de Treino: \t', len(x_train))
print('Variável-Alvo:', len(y_train))
'''
Number of Records
Training Data: 13285
Target Variable: 13285
'''
Now, with the data properly prepared for use in building the algorithm, let’s create it!
The model was specifically developed for the problem at hand, which is to determine whether a news article is reliable or not. Therefore, we have a binary classification problem. A stacked sequential neural network was used for this purpose.
In total, there are 4 layers:
- Embedding
This layer creates the embedding layer of the model, where the vocabulary size and maximum sequence length are determined. This assigns values close to tokens with similar meanings. - Conv1D
This is the convolutional layer that applies batch-wise model learning to better identify sequence characteristics. It is configured to use 16 filters of size 5 and the ReLU (Rectified Linear Unit) activation function for neuron activation. - GlobalMaxPool1D
This layer creates the global pooling layer, which returns only the maximum value for each batch to further emphasize sequence characteristics and reduce the output dimension. - Dense
Finally, the dense layer is created, configured with 1 unit and activated by the sigmoid function, which returns binary values (0 or 1) to determine the news article’s reliability.
It’s worth noting that a RandomSearch was performed to find the best parameters for our specific model. Since this is a time and computationally intensive process, it was carried out in a separate notebook from this project to make this step more practical and straightforward.
Next, we have the construction, training, and printing of the characteristics of the final model.
# construction of binary classification neural network model
model = tf.keras.Sequential([ # create a layer-stacked sequential neural network model
tf.keras.layers.Embedding(vocab_size, (155), input_length=max_length), # create embedding layer to give close values to tokens that are similar
tf.keras.layers.Conv1D(16, 5, activation='relu'), # create convolution layer for batch learning to better identify features
tf.keras.layers.GlobalMaxPooling1D(), # create global pooling layer to return only the maximum value of each batch to further emphasize characteristics
tf.keras.layers.Dense(1, activation='sigmoid') # create dense layer with one unit and apply activation by sigmoidal function
])
# compile the model
model.compile(loss='binary_crossentropy', # define the loss function
optimizer=Adam(learning_rate=.001), # set optimizer and learning rate
metrics=['accuracy','Recall','Precision','FalseNegatives']) # define metrics for model evaluation during training and testing
# train the model
## verbose=2 to show one message by epoch
## epochs=4 defines the number of epochs that the model will go through the entire training set
history = model.fit(x_train, y_train, verbose = 2, epochs = 4,
validation_data = (x_val, y_val), # validation data
callbacks=[tf.keras.callbacks.EarlyStopping('val_loss', patience=3)]) # stop training when there is no improvement in 'val_loss' for 3 consecutive epochs
'''
Epoch 1/4
416/416 - 105s - loss: 0.2203 - accuracy: 0.9029 - recall: 0.8282 - precision: 0.9304 - false_negatives: 948.0000 - val_loss: 0.0620 - val_accuracy: 0.9783 - val_recall: 0.9750 - val_precision: 0.9729 - val_false_negatives: 23.0000 - 105s/epoch - 252ms/step
Epoch 2/4
416/416 - 72s - loss: 0.0416 - accuracy: 0.9867 - recall: 0.9853 - precision: 0.9827 - false_negatives: 81.0000 - val_loss: 0.0444 - val_accuracy: 0.9833 - val_recall: 0.9717 - val_precision: 0.9878 - val_false_negatives: 26.0000 - 72s/epoch - 172ms/step
Epoch 3/4
416/416 - 60s - loss: 0.0124 - accuracy: 0.9980 - recall: 0.9976 - precision: 0.9976 - false_negatives: 13.0000 - val_loss: 0.0347 - val_accuracy: 0.9860 - val_recall: 0.9880 - val_precision: 0.9785 - val_false_negatives: 11.0000 - 60s/epoch - 144ms/step
Epoch 4/4
416/416 - 54s - loss: 0.0037 - accuracy: 0.9998 - recall: 1.0000 - precision: 0.9995 - false_negatives: 0.0000e+00 - val_loss: 0.0325 - val_accuracy: 0.9864 - val_recall: 0.9848 - val_precision: 0.9826 - val_false_negatives: 14.0000 - 54s/epoch - 131ms/step
'''
In the output above, we have the values of the evaluation metrics for each epoch and the duration of the training. Overall, it is evident that with each epoch, there is an optimization of these metrics. This demonstrates the model’s progression in seeking better performance.
Let’s examine the results by epoch, by evaluation metric, and for both training and validation data:
Now, I will print the constructed model.
# print model
print(model.summary())
'''
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 24195, 155) 1550000
conv1d (Conv1D) (None, 24191, 16) 12416
global_max_pooling1d (Glob (None, 16) 0
alMaxPooling1D)
dense (Dense) (None, 1) 17
=================================================================
Total params: 1562433 (5.96 MB)
Trainable params: 1562433 (5.96 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None
'''
The results are good and can be subject to a more detailed analysis to determine if there was overfitting or not. For now, we will continue with the educational objective of the project, which is the construction and application of the model itself.
In the code that prints the model, we can see the architecture of the model. It is a sequential model with 4 layers (each line specifies one of the built layers), the output shape of each of them, and the number of parameters that each of them has. This is interesting for understanding how data flows through the model and how parameters are distributed across the layers.
Next, I will plot the evaluation metrics in graphs, according to the epoch of the data. This way, we can better understand the model’s evolution. Since there are 5 graphs with a similar structure, I will build a function to maintain a standard format between the graphs and to streamline this step.
# create function to plot the results
def plot_graphs(history, string):
plt.plot(history.history[string])
plt.plot(history.history['val_'+string])
plt.xlabel('Epochs')
plt.xticks(ticks=history.epoch)
plt.ylabel(string)
plt.legend([string, 'val_'+string])
plt.title(string.upper() + ' vs. Epochs')
plt.show()
We will evaluate the results, according to the evaluation metric and its performance according to each epoch.
# plot accuracy graph
plot_graphs(history, 'accuracy')
We can observe that the accuracy value during the first training epoch was below 0.9, while the validation data performed much better than that. In the next epoch, there was a significant improvement in the training data, and after that, the training and validation sets became closer, with the validation data performing slightly below the training data.
# plot loss graph
plot_graphs(history, 'loss')
Here, the lower the value, the better. Soon. we see that the first training resulted in a higher loss than in the validation data. However, in the following epoch this adjusts, as did accuracy.
# plot recall graph
plot_graphs(history, 'recall')
Once again, we aim for values closer to 1 to be considered a model with better performance. Just like with the two previous metrics, here we observe the same pattern: an initial epoch with poor training data but improving and converging in the subsequent epochs.
# plot precision graph
plot_graphs(history, 'precision')
In this metric, we noticed a small difference from the previous graphs. It achieves its best results in epochs 2 and 3 and begins to decline in performance in the last epoch.
# plot false negative graph
plot_graphs(history, 'false_negatives')
Finally, for false negatives, we aim for the lowest possible value. When looking at the results for both the training and validation sets, we notice that they improved over the epochs, with the exception of epochs 1 to 2 and epochs 3 to 4 for the validation set. However, these fluctuations are within a healthy margin.
13. Predictions on the test set
Just as was done with the training set, it’s necessary to preprocess the test data using the same steps, including tokenizer and padding. This will allow us to feed this new data into the model we created and assess its real performance.
# handle test data
test_sequences = tokenizer.texts_to_sequences(np.array(test.text))
test_padded = pad_sequences(test_sequences, padding=pad_type, truncating=trunc_type, maxlen = max_length)
Now, we can make predictions.
# make predictions
preds = np.round(model.predict(test_padded))
'''
70/70 [==============================] - 2s 24ms/step
'''
And let’s check the number and records in the test set and just confirm the final size of the set predicted by the model. Therefore, we should obtain the same values as a result.
# check number of expected records
print('Number of records in the test set: \t {}'.format(test.shape[0]))
print('Number of predictions made by the model: {}'.format(len(preds)))
'''
Number of records in the test set: 2215
Number of predictions made by the model: 2215
'''
With the test data in hand, we can compare the results obtained from the model with the actual results. This way, we can calculate the evaluation metrics to assess the model’s performance. Below, I will print the values of accuracy, recall, and precision to demonstrate that it’s possible to obtain them individually, if needed.
# calculate evaluation metrics
## accuracy
accuracy = accuracy_score(test['label'].values, preds)
print('Accuracy: \t{:.4f}'.format(accuracy))
## recall
recall = recall_score(test['label'].values, preds)
print('Recall: \t{:.4f}'.format(recall))
## precision
precision = precision_score(test['label'].values, preds)
print('Precision: \t{:.4f}'.format(precision))
'''
Accuracy: 0.9792
Recall: 0.9772
Precision: 0.9729
'''
We can already see above that the values indicate that the model has good performance. Now, I will print a more comprehensive report of these metrics with just one line of code, using the classification_report
method from the sklearn
library.
Additionally, I will plot two graphs, both related to the confusion matrix of the test data, one with normalized values and the other with absolute values from the dataset.
# save predicted data to a variable
y_pred = np.round(model.predict(test_padded))
# print assessment metrics report
print('Assessment Metrics Report'.center(65) + ('\n') + ('-' * 65))
print(classification_report(test['label'], y_pred, digits=4) + ('\n') + ('-' * 15))
# plot charts
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
# normalized confusion matrix
skplt.metrics.plot_confusion_matrix(test['label'], y_pred, normalize=True,
title='Normalized Confusion Matrix',
text_fontsize='large', ax=ax[0])
# confusion matrix
skplt.metrics.plot_confusion_matrix(test['label'], y_pred,
title='Confusion Matrix',
text_fontsize='large', ax=ax[1])
plt.show()
'''
70/70 [==============================] - 1s 20ms/step
Assessment Metrics Report
-----------------------------------------------------------------
precision recall f1-score support
0 0.9837 0.9807 0.9822 1295
1 0.9729 0.9772 0.9751 920
accuracy 0.9792 2215
macro avg 0.9783 0.9789 0.9786 2215
weighted avg 0.9793 0.9792 0.9792 2215
---------------
'''
Above, we can see that the generated report provides more information about the evaluation metrics. And with the graphs, we can visualize the model’s correct predictions through the confusion matrix:
Model’s Correct Predictions
- True Positive: It’s fake news, and the model classified it as fake news, in this case, 899 articles.
- True Negative: It’s not fake news, and the model classified it as not fake news, here there were 1270 articles.
Model’s Errors
- False Positive: It’s not fake news, but the model falsely classified it as fake news, in this case, 25 articles.
- False Negative: It’s fake news, but the model classifies it as not fake news, here there were 21 articles.
14. Results Comparison: Validation x Test
In this final section of the study, we will compare the model’s performance to understand how it performs with new data. This way, we will have a clearer view of how this algorithm will behave with real-world data.
First, we need to make predictions on the training and validation data. Then, we will plot the performance report for each of the sets (training, validation, and test). Finally, we will generate the respective confusion matrices.
# save training and validation data predictions in variables
y_pred_train = np.round(model.predict(x_train))
y_pred_val = np.round(model.predict(x_val))
'''
416/416 [==============================] - 9s 22ms/step
70/70 [==============================] - 1s 20ms/step
'''
# print training data evaluation metrics report
print('Evaluation Metrics Report - Training Data'.center(65) + ('\n') + ('-' * 65))
print(classification_report(train['label'], y_pred_train, digits=4) + ('\n') + ('-' * 65))
# print validation data evaluation metrics report
print('Evaluation Metrics Report - Validation Data'.center(65) + ('\n') + ('-' * 65))
print(classification_report(validation['label'], y_pred_val, digits=4) + ('\n') + ('-' * 65))
# print test data evaluation metrics report
print('Evaluation Metrics Report - Test Data'.center(65) + ('\n') + ('-' * 65))
print(classification_report(test['label'], y_pred, digits=4) + ('\n') + ('-' * 65))
'''
Evaluation Metrics Report - Training Data
-----------------------------------------------------------------
precision recall f1-score support
0 1.0000 0.9999 0.9999 7766
1 0.9998 1.0000 0.9999 5519
accuracy 0.9999 13285
macro avg 0.9999 0.9999 0.9999 13285
weighted avg 0.9999 0.9999 0.9999 13285
-----------------------------------------------------------------
Evaluation Metrics Report - Validation Data
-----------------------------------------------------------------
precision recall f1-score support
0 0.9892 0.9876 0.9884 1294
1 0.9826 0.9848 0.9837 920
accuracy 0.9864 2214
macro avg 0.9859 0.9862 0.9861 2214
weighted avg 0.9865 0.9864 0.9865 2214
-----------------------------------------------------------------
Evaluation Metrics Report - Test Data
-----------------------------------------------------------------
precision recall f1-score support
0 0.9837 0.9807 0.9822 1295
1 0.9729 0.9772 0.9751 920
accuracy 0.9792 2215
macro avg 0.9783 0.9789 0.9786 2215
weighted avg 0.9793 0.9792 0.9792 2215
-----------------------------------------------------------------
'''
Above, if we just look at accuracy we see that it drops from almost 1 in the training data, reduces performance in the validation data and we have a slightly lower value in the test data.
Once again, it confirms what was seen in the performance reports. That is, in the training data, the performance is excellent, in the validation data, the metrics drop slightly, and in the test data, they drop even more. This is because the model has already seen these training data before, so it has essentially “cheated” them to provide results. This does not occur with the validation and test data, leading to reduced performance.
However, the training data was included here to provide context. What we are primarily interested in is looking at the validation and test data because a significant distortion between them could indicate a problem with the model. This is not the case here; the data are very close. In the validation data, the accuracy was 0.9864, while in the test data, it was 0.9792, indicating that the model is performing well with data it has never seen before.
15. Model Deployment
Before the model can be put into production, it must be saved.
# save the model as TensorFlow
fakenews_classifier = model.save('fakenews_cfl.tf')
# save model weights
# classifier_fakenews_weights = model.save_weights('clFakeNews_weights.tf')
Below, I have provided the code to load the model. Remembering that it is necessary to install the load_model module from the TensorFlow library.
# load the model
# from tensorflow.keras.models import load_model
# fakenews_model = load_model('fakenews_clf.tf')
16. Conclusion
This study aimed to create a supervised machine learning model that classifies whether a news article is reliable or not. To achieve this, an exploratory data analysis was conducted to understand the structure and content of the dataset. During this phase, it was observed that the target class data was balanced, and missing data was present.
Data cleaning and preprocessing involved removing missing data, detecting and removing duplicates, removing the id
attribute, and standardizing the text.
Following this, data preparation began to be inserted into machine learning models. In this stage, the tokenizer and padding were used to transform the text into tokens and create sequences of the same length. Then, a sequential neural network model was created, with the goal of performing binary classification, consisting of 4 layers: model embedding (embedding), convolution (Conv1D), max-pooling (GlobalMaxPool1D), and dense (dense).
With the algorithm created, the test set was used to evaluate the model’s performance. An accuracy of 0.9792 and a recall of 0.9772 were achieved. Therefore, it is capable of distinguishing fake news from real ones and can be a useful tool in combating this type of problem.
Get to know more about this study
This study is available on Google Colab and on GitHub. Just click on the images below to be redirected.