Fake News Detection Using Machine Learning

Manthan Bhikadiya 💡
The Startup
Published in
7 min readOct 5, 2020

In this modern world, data is very important and by the 2020 year, 1.7 megaBytes data generated per second. So there are many technologies that change the world by this large amount of data. Machine learning is one of them and we are using this technology to detect fake news.

Machine Learning

Machine learning is an application of AI which provides the ability to system to learn things without being explicitly programmed. Machine learning works on data and it will learn through some data. Machine learning is very different from the traditional approach. In, Machine learning we fed the data, and the machine generates the algorithm. Machine learning has three types of learning

  1. Supervised learning
  2. Unsupervised learning
  3. Reinforcement learning

Supervised learning means we trained our model with labeled examples so the machine first learns from those examples and then performs the task on unseen data. In this fake news detection project, we are using Supervised learning.

Check out more here

What is Fake news?

Fake news's simple meaning is to incorporate information that leads people to the wrong path. Nowadays fake news spreading like water and people share this information without verifying it. This is often done to further or impose certain ideas and is often achieved with political agendas.

For media outlets, the ability to attract viewers to their websites is necessary to generate online advertising revenue. So it is necessary to detect fake news.

Workflow :

Source:towardsdatascience

The above diagram shows how this pipeline generated numerical features and feed it into a machine learning algorithm. In this project, we are using some machine learning and Natural language processing libraries like NLTK, re (Regular Expression), Scikit Learn.

Natural Language Processing

Machine learning data only works with numerical features so we have to convert text data into numerical columns. So we have to preprocess the text and that is called natural language processing.

In-text preprocess we are cleaning our text by steaming, lemmatization, remove stopwords, remove special symbols and numbers, etc. After cleaning the data we have to feed this text data into a vectorizer which will convert this text data into numerical features.

Dataset

You can find many datasets for fake news detection on Kaggle or many other sites. I download these datasets from Kaggle. There are two datasets one for fake news and one for true news. In true news, there is 21417 news, and in fake news, there is 23481 news. Both datasets have a label column in which 1 for fake news and 0 for true news. We are combined both datasets using pandas built-in function.

In this Dataset there are no missing values otherwise we have to remove that information or we have to impute some value.

Our final dataset is balanced because both categories have the approximate same no. of examples

Cleaning Data

We can’t use text data directly because it has some unusable words and special symbols and many more things. If we used it directly without cleaning then it is very hard for the ML algorithm to detect patterns in that text and sometimes it will also generate an error. So that we have to always first clean text data. In this project, we are making one function ‘cleaning_data’ which cleans the data.

Lemmatization: Convert the word or token in its Base form.

Examples :

Stay, Stays, Staying, Stayed — -> Stay

House, Houses, Housing — -> House

Stop words: words that occur too frequently and not considered informative

Examples :

{‘the’, ‘a’, ‘an’, ‘and’, ‘but’, ‘for’, ‘on’, ‘in’, ‘at’ …}

Split the Data

Splitting the data is the most essential step in machine learning. We train our model on the trainset and test our data on the testing set. We split our data in train and test using the train_test_split function from Scikit learn.

We split our 80% data for the training set and the remaining 20% data for the testing set.

Tfidf Vectorizer

Tfidf-Vectorizer : (Term Frequency * Inverse Document Frequency)

1.Term Frequency: The number of times a word appears in a document divided by the total number of words in the document. Every document has its own term frequency.

2. Inverse Document Frequency: The log of the number of documents divided by the number of documents that contain the word w. Inverse data frequency determines the weight of rare words across all documents in the corpus.

Finally Tfidf vectorizer

This vectorizer is already predefined in Scikit Learn Library so we can import by :

Now we first create the object of TfidfVectorizer with some arguments.

You can check the meaning of Arguments here

Now fit this vectorizer on our training dataset and transform its values on the training and testing dataset with respect to the vectorizer.

After vectorizing the data it will return the sparse matrix so that for machine learning algorithms we have to convert it into arrays. toarray function will do that work for us.

Multinomial Naive Bayes Classifier

Naive Bayes: The Naive Bayes Classifier technique is based on the Bayesian theorem and is particularly suited when then high dimensional data.

Formula :

Check out More Information Bayes Theorem

Multinomial Naive Bayes :

It is used for classification when is in discrete form. It is very useful in text processing. Each text will be converted to a vector of word count. It cannot deal with negative numbers.

It is predefined in Scikit Learn Library. So we can import that class in our project then we create an object of Multinomial Naive Bayes Class.

1. Fit the classifier on our vectorized train data

2. When the classifier fitted successfully on the training set then we can use the predict method to predict the result on the test set.

Classification Metrics

To check how well our model we use some metrics to find the accuracy of our model. There are many types of classification metrics available in Scikit learn

  1. Confusion Matrix
  2. Accuracy Score
  3. Precision
  4. Recall
  5. F1-Score

Confusion matrix: Basically this metrics how many results are correctly predicted and how many results are not correctly predicted

Accuracy Score: It is the number of correct prediction over the total no. of predictions

Details of Precision, Recall and F1-Score

More Details on Classification Metrics

As you can see we have a very good score of Precision, Recall, and F1 Score. So we can say our model performs excellently on unseen data. The accuracy score on Test Dataset is 95% which is very good.

Now we see a classification report on the training set

We also get a very good Accuracy score on the training set.

Accuracy score on train and test set

You can see both accuracies are near equal. So that we can say our model performs well.

Save the Model

After the good performance on data, we can save our model so that next time we can use it directly. ‘joblib’ and ‘pickle’ library used to save the machine learning model. From the Following step, you can save and load your model.

Another method for save and load your model Checkout here

Code :

Summary

Today, we learned to detect fake news with Python. We took a Fake and True News dataset, implemented a Text cleaning function, TfidfVectorizer, initialized Multinomial Naive Bayes Classifier, and fit our model. We ended up obtaining an accuracy of 95.31% in magnitude.

I hope you enjoyed this project.

Connect with me :

Github :

Medium :

LinkedIn :

Final Note :

Thanks for reading! If you enjoyed this article, please hit the clap 👏button as many times as you can. It would mean a lot and encourage me to keep sharing my knowledge.

--

--

Manthan Bhikadiya 💡
The Startup

Beyond the code lies magic. 🪄 Unveiling AI's potential with Generative AI, ML, DL, NLP, CV. Explore my blog's insights!