Detecting Fake News using Machine Learning

Gabriel Mayers
Analytics Vidhya
Published in
4 min readJun 19, 2020
GIF here

Currently, one of the biggest problems is the Fake News. It’s almost impossible to stay away of their.

But how can we use Machine Learning to predict what’s Fake News and what’s not?

I worked in this problem in the last 3 Days, and in this post, I’ll explain how I built a Machine Learning Model able to identify Fake News just by the Headline of the news.

Before start it: This was just a personal project! I really think we need more than just the Headline to detect what’s Fake News and what’s not! The main idea here is show how we can use Machine Learning approaches to identify Fake News.

Basically, this project is divided by 5 steps, let’s see it better…

These are going to be our tasks for this project:

  • Take the data to input into our Model;
  • Format the Data;
  • Tokenize the Data;
  • Build Our Model;
  • Train Our Model;
  • Make Predictions;

Let’s get started baby!

Taking The Data

For this project, I used the Fake or Real News Dataset from Kaggle.

This Dataset have 2 csv files, fake and true, we can download it and merge into one.

But before merge our Datasets into one, we need to labelled their. We wanna use 1 for True and 0 for False.

To do that, we can just add a column into each Dataset called “label” and assign her value as 1 for true or 0 for fake, like the code below:

true[‘label’] = 1
fake['label'] = 0

Other good practice is verify if our Dataset have NaN values, we can use Seaborn with heatmap plot to verify it!

plt.figure(figsize=(12, 8))
sns.heatmap(data.isnull(), cmap='viridis')

After do that, we can merge our Datasets into one, by using the code below:

# Concating:

data = pd.concat([true, fake], axis=0)
data

I made some others changes in the Dataset in case I use it again in the future, you can see the notebook with all the changes in this notebook.

Now we can save our Dataset, like the code below:

# To csv: data.to_csv(‘data.csv’)

We already finished the first two task, now, we can start to tokenize our Data!

Tokenizing the Data

Basically, when you tokenize the data, you transform words in numbers, the language of Machine Learning Models.

You can see the tokenizing process in this notebook.

To tokenize our Data, we wanna use the Tensorflow Preprocessing library, first we wanna import it:

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

We wanna use the pad_sequences to make all the vectors of numbers have the same length, you can read more about pad_sequences in here.

Now, we wanna tokenize our data by using the code below:

train = df[:30000]
test = df[30000:]

train_sentences = train['title'].tolist()
test_sentences = test['title'].tolist()
vocab_size = 5000
embedding_dim = 16
max_length = 500
trunc_type = 'post'
padding_type = 'post'
oov_tok = '<OOV>'
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index

training_sequences = tokenizer.texts_to_sequences(train_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_length, padding=padding_type)

testing_sequences = tokenizer.texts_to_sequences(test_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

Now, we already finalized the Tokenizing Process of our Data, let’s build our Model.

Building the Model

We’ll use LSTM’s to build our Model. LSTM’s have the advantage of memory, this is very useful when we’re working with sequential data like: Speech, Text and Stock Prices. Basically, we have sequential data when the past matters to the prediction!

You can read more about LSTM’s in this article.

You can see the Building Model process in this notebook.

Besides that, we’ll use Tensorflow to build our Model!

model = tf.keras.models.Sequential()

model.add(tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length))
model.add(tf.keras.layers.Bidirectional(
tf.keras.layers.LSTM(300, dropout=0.3, recurrent_dropout=0.3)
))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

model.summary()

We’ll compile our Model using Binary Cross-Entropy loss function and Adam Optimizer:

model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])

We’ll add a callback to improve the time of the train from our Model:

cb = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=2)

And now, it’s time to train baby!

data_model = model.fit(training_padded, train['label'], epochs=50, validation_data=(testing_padded, test['label']), callbacks=[cb])

Let’s see our results:

Not bad…

As you can see, I passed the test set as validation_data, but feel free to use it to make predictions!

Actually, we can improve this Model by changing the Hyperparameters or using the text instead just the headline of the news.

But for now, this is nice to have an idea of the approach!

I hope you have enjoyed this practical example!

For now, this is all!

See you next time!

--

--

Gabriel Mayers
Analytics Vidhya

Artificial Intelligence Engineer, Science enthusiast, Self-taught and so curious.