Fake News Detection Using Machine Learning
In this modern world, data is very important and by the 2020 year, 1.7 megaBytes data generated per second. So there are many technologies that change the world by this large amount of data. Machine learning is one of them and we are using this technology to detect fake news.
Machine Learning
Machine learning is an application of AI which provides the ability to system to learn things without being explicitly programmed. Machine learning works on data and it will learn through some data. Machine learning is very different from the traditional approach. In, Machine learning we fed the data, and the machine generates the algorithm. Machine learning has three types of learning
- Supervised learning
- Unsupervised learning
- Reinforcement learning
Supervised learning means we trained our model with labeled examples so the machine first learns from those examples and then performs the task on unseen data. In this fake news detection project, we are using Supervised learning.
Check out more here
What is Fake news?
Fake news's simple meaning is to incorporate information that leads people to the wrong path. Nowadays fake news spreading like water and people share this information without verifying it. This is often done to further or impose certain ideas and is often achieved with political agendas.
For media outlets, the ability to attract viewers to their websites is necessary to generate online advertising revenue. So it is necessary to detect fake news.
Workflow :
Source:towardsdatascience
The above diagram shows how this pipeline generated numerical features and feed it into a machine learning algorithm. In this project, we are using some machine learning and Natural language processing libraries like NLTK, re (Regular Expression), Scikit Learn.
Natural Language Processing
Machine learning data only works with numerical features so we have to convert text data into numerical columns. So we have to preprocess the text and that is called natural language processing.
In-text preprocess we are cleaning our text by steaming, lemmatization, remove stopwords, remove special symbols and numbers, etc. After cleaning the data we have to feed this text data into a vectorizer which will convert this text data into numerical features.
Dataset
You can find many datasets for fake news detection on Kaggle or many other sites. I download these datasets from Kaggle. There are two datasets one for fake news and one for true news. In true news, there is 21417 news, and in fake news, there is 23481 news. Both datasets have a label column in which 1 for fake news and 0 for true news. We are combined both datasets using pandas built-in function.
In this Dataset there are no missing values otherwise we have to remove that information or we have to impute some value.
Our final dataset is balanced because both categories have the approximate same no. of examples
Cleaning Data
We can’t use text data directly because it has some unusable words and special symbols and many more things. If we used it directly without cleaning then it is very hard for the ML algorithm to detect patterns in that text and sometimes it will also generate an error. So that we have to always first clean text data. In this project, we are making one function ‘cleaning_data’ which cleans the data.
Lemmatization: Convert the word or token in its Base form.
Examples :
Stay, Stays, Staying, Stayed — -> Stay
House, Houses, Housing — -> House
Stop words: words that occur too frequently and not considered informative
Examples :
{‘the’, ‘a’, ‘an’, ‘and’, ‘but’, ‘for’, ‘on’, ‘in’, ‘at’ …}
Split the Data
Splitting the data is the most essential step in machine learning. We train our model on the trainset and test our data on the testing set. We split our data in train and test using the train_test_split function from Scikit learn.
We split our 80% data for the training set and the remaining 20% data for the testing set.
Tfidf Vectorizer
Tfidf-Vectorizer : (Term Frequency * Inverse Document Frequency)
1.Term Frequency: The number of times a word appears in a document divided by the total number of words in the document. Every document has its own term frequency.
2. Inverse Document Frequency: The log of the number of documents divided by the number of documents that contain the word w. Inverse data frequency determines the weight of rare words across all documents in the corpus.
Finally Tfidf vectorizer
This vectorizer is already predefined in Scikit Learn Library so we can import by :
Now we first create the object of TfidfVectorizer with some arguments.
You can check the meaning of Arguments here
Now fit this vectorizer on our training dataset and transform its values on the training and testing dataset with respect to the vectorizer.
After vectorizing the data it will return the sparse matrix so that for machine learning algorithms we have to convert it into arrays. toarray function will do that work for us.
Multinomial Naive Bayes Classifier
Naive Bayes: The Naive Bayes Classifier technique is based on the Bayesian theorem and is particularly suited when then high dimensional data.
Formula :
Check out More Information Bayes Theorem
Multinomial Naive Bayes :
It is used for classification when is in discrete form. It is very useful in text processing. Each text will be converted to a vector of word count. It cannot deal with negative numbers.
It is predefined in Scikit Learn Library. So we can import that class in our project then we create an object of Multinomial Naive Bayes Class.
1. Fit the classifier on our vectorized train data
2. When the classifier fitted successfully on the training set then we can use the predict method to predict the result on the test set.
Classification Metrics
To check how well our model we use some metrics to find the accuracy of our model. There are many types of classification metrics available in Scikit learn
- Confusion Matrix
- Accuracy Score
- Precision
- Recall
- F1-Score
Confusion matrix: Basically this metrics how many results are correctly predicted and how many results are not correctly predicted
Accuracy Score: It is the number of correct prediction over the total no. of predictions
Details of Precision, Recall and F1-Score
More Details on Classification Metrics
As you can see we have a very good score of Precision, Recall, and F1 Score. So we can say our model performs excellently on unseen data. The accuracy score on Test Dataset is 95% which is very good.
Now we see a classification report on the training set
We also get a very good Accuracy score on the training set.
Accuracy score on train and test set
You can see both accuracies are near equal. So that we can say our model performs well.
Save the Model
After the good performance on data, we can save our model so that next time we can use it directly. ‘joblib’ and ‘pickle’ library used to save the machine learning model. From the Following step, you can save and load your model.
Another method for save and load your model Checkout here
Code :
Summary
Today, we learned to detect fake news with Python. We took a Fake and True News dataset, implemented a Text cleaning function, TfidfVectorizer, initialized Multinomial Naive Bayes Classifier, and fit our model. We ended up obtaining an accuracy of 95.31% in magnitude.
I hope you enjoyed this project.
Connect with me :
Github :
Medium :
LinkedIn :
Final Note :
Thanks for reading! If you enjoyed this article, please hit the clap 👏button as many times as you can. It would mean a lot and encourage me to keep sharing my knowledge.