Week 5: NLP for fake news detection

A step by step Natural Language Processing tutorial.

Published in

Joguei os Dados

6 min readJun 9, 2020

Week 5 of Pyrentena and I finally dived in Natural Language Processing! Actually, this is week 6 of Pyrentena…. Project 5 was based on a COVID-19 dataset: predicting positive testing for the disease of the year. Great project to learn about classification methods and the hyperparameter entropy! I decided not to write and article about it tho — the prediction model was merely for study purposes and not some huge revelation, so I thought It was best to keep a low profile about it. You can check the complete code on my Github and test your own skills on classification methods.

Inspired by this thought of “this might be distorced a bit”, I decided that my first NLP project would be about a subject very sensitive to me as a journalist: fake news. Way beyond whatsapp creepy texts, fake news endangers our society and I very much believe, our democracy as a whole. Welcome to week 5 (6?) of Pyrentena: we are building a fake news detector!

Data Exploration

The dataset is available on Kaggle and was separated in two csv’s: fake & true news. Let’s take a look:

Fake news csv looked exaclty the same, but with only fake news as data. I started by creating our target variable ‘True/Fake’ and merging the two csv’s into one:

# adding feature that will be target variable True/Fakefake_news['True/Fake'] = 'True'
true_news['True/Fake'] = 'Fake'# combining the dataframes into a single one using concat methodtotal_news = pd.concat([true_news, fake_news])
total_news['Article'] = total_news['title'] + total_news['text']

Basic Data Cleaning

A lot of on-line data is in the form of text (e-mails, articles, documents, etc) and one of the most exciting applications of ML is to learn from those texts! To analyse this data, we use something called ‘bag of words’: you count the frequency in which that word appear on your dataset. This allow us to get from each title or text a frequency count of the words in the text. The downside is that the order in which those words are arranged does not matter, which is kind of a big contradiction when you think about what reading a real text actually involve.

Some words are what we call ‘low information words’ — like ‘the’,’and’, ‘you’, for example. Almost every english-text in the world contains this word, so it’s not really meaningful for our ML model. This type of words are stopwords, which are words that has high frequency but low value — they are just making noise in the dataset. We get a function the process this ‘stopewords’ out of NLTK(National Language Toolkit) library.

One trick here is that NLTK can’t be called on a notebook after just pip installing it, you need to set the downloading inside the notebook itself like I did below:

# downloading stopword from nltk libimport nltk
nltk.download('stopwords')# output[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
True

After downloading, we can import this libraries normally and start using them! I built a function to process our data: get rid of punctuation and apply stopwords technic.

# build function to process textdef process_text(s):     # Check each string for punctuation
     no_punc = [char for char in s if char not in    string.punctuation]    # Join the characters again
    no_punc = ''.join(no_punc)    # Convert string to lowercase and remove stopwords
    clean_string = [word for word in no_punc.split() if word.lower() not in stopwords.words('english')]
    
    return clean_string

Function ready to go! I created a new feature in the dataset: total_news[‘Clean Text’] and applied our process_text on the ‘Article’ feature. This is where you need a little patience… Our dataset merged has more than 40 thousand rows! I strongly recommend you build this model in a cloud service such as Google Colab instead of a Jupyter Notebook. I did it on both and it took an average of 30 minutes to run it on GColab and 3 hours to run it on Jupyter.

# processing the text wo we can have a Clean Text feature
# this process might take a while: avg 30 mimtotal_news['Clean Text'] = total_news['Article'].apply(process_text)

Can you believe this one line of code took all this time to run?! Well, with NLP, it’s easy to go from data to some sort of big data, so we better get used to it! Once it was done, our ‘Clean Text’ feature was good to go:

Vectorization

Time to count the frequency of each word using the bag of words technic:

# importing libsfrom sklearn.feature_extraction.text import CountVectorizer# Bag-of-Words (bow) accountf for the frequency of the text databow_transformer = CountVectorizer(analyzer=process_text).fit(total_news['Clean Text'])

Total vocabulary words: 39099. Let’s check the shape of our matrix data and move on to Tfidf tehcnic!

# printing out the shape of our Matrixprint(f'Shape of Sparse Matrix: {news_bow.shape}')
print(f'Amount of Non-Zero occurences: {news_bow.nnz}')# outputShape of Sparse Matrix: (44898, 39099) 
Amount of Non-Zero occurences: 44898

Tfidf: inverse document frequency

Tfidf representation is another text processing we can apply on our text data. ‘Tf’ stands for term frequency and ‘Idf’ stand for ‘inverse document frequency’. Tf works just like bag of words: analysing the frequency of it. The idf part tho accounts for the idea that the word also has a ‘weight’ by how often it occurs in all the dataset together, rating highly the less frequent. How is this different from bag of words? Well, the Tfidf altogether rates the words from rare to common, rating the rare words higher. This technique allow us to discover interesting cases in a bunch of overlapping text about the same subject! By accessing the ‘rare’ words through Tfidf, we might find out the most important information about what’s going on in the dataset.

# applying Tfidf to check for rare wordsfrom sklearn.feature_extraction.text import TfidfTransformer# checking news_bow shapetfidf_transformer = TfidfTransformer().fit(news_bow)
news_tfidf = tfidf_transformer.transform(news_bow)
print(news_tfidf.shape)# output
(44898, 39099)

Build and train model

I stopped my text processing here, but there are a few other technics you can apply to your text data. I suggest you research them all, since each dataset has its own peculiarities! To train our model, I decided to go with a Multinomial Naive Bayes algorithm. Naive Bayes has this name because it doesn’t really care about the order, but about the frequencies — thats why is ‘naive’. This algorithm have worked quite well in many real-world situations, like famously document classification and spam filtering. Seemed a good choice.

# importing algorithm to build the modelfrom sklearn.naive_bayes import MultinomialNBfakenews_detect_model = MultinomialNB().fit(news_tfidf, total_news['True/Fake'])

After spliting my data with train_test_split, I decided to build a pipeline so all those text processing would happen in an organized way: our train data would go through Vectorization with bag of words followed by Tdif method and them the Naive Bayes classifier algorithm.

# building pipelinefrom sklearn.pipeline import Pipelinepipeline = Pipeline
         ([('bow', CountVectorizer(analyzer=process_text)),
         ('tfidf', TfidfTransformer())
         ('classifier', MultinomialNB()),
])
pipeline.fit(news_train,text_train)

Output of our pipeline:

Beautiful! It was my first time bulding a pipeline and I’m so proud of it. Time to do some predictions on our test data and evaluate the model.

Evaluating the model

To evaluate our model, I decided to go with classification_report from Sickit Learn. It’s a great report that gives not only accuracy, which in my opinion is a weak evaluation metric, but also precision, recall and f1-score!

from sklearn.metrics import classification_report# predicting news on test datasetprediction = pipeline.predict(news_test)# printing out classification reportprint(classification_report(prediction,text_test))

Output:

Multinomial Naive Bayes Algorithm performed well on our processed text data, sucessfully predicting both fake and true news! We achieved a high score both for precision, recall and f1-score. I’ll talk more about this evaluation metrics in another article and give more detailed information about when to use each one of them, but know this: it’s a pretty nice result! (unless we overfitted, god help us).

You can check the complete code on my Github. Next step is to try NLP for a brazilian fake news dataset.