Passive Aggressive Algorithm — For big data models

Sriram
Geek Culture
Published in
3 min readJun 13, 2021
Photo by Joshua Sortino on Unsplash

Passive-Aggressive algorithms are a family of Machine learning algorithms that are popularly used in big data applications.

Passive-Aggressive algorithms are generally used for large-scale learning. It is one of the online-learning algorithms. In online machine learning algorithms, the input data comes in sequential order and the machine learning model is updated sequentially, as opposed to conventional batch learning, where the entire training dataset is used at once.

This is very useful in situations where there is a huge amount of data, and it is computationally infeasible to train the entire dataset because of the sheer size of the data.

Online-learning algorithm will get a training example, update the classifier, and then throw away the example.

A very good example of this would be to detect fake bulletin on a social media website like Twitter, WhatsApp where new data is being added every second. To dynamically read the data from Twitter continuously, the data would be huge, and using an online-learning algorithm would be ideal.

Passive-Aggressive algorithms are somewhat similar to a Perceptron model because they do not require a learning rate. However, they do include a regularization parameter. I will not go into the mathematics of the algorithm as it is out of the scope of this article, but I will link an excellent video of the mathematics behind this algorithm, I suggest watching this excellent video on the algorithm’s working by Dr. Victor Lavrenko.

I will provide a use case, where fake news detection by using a passive-aggressive algorithm in python will be performed.

Reading the Data and DataFrame

Table overview

Getting the labels from the DataFrame and splitting the dataset into training and testing sets.

Initialize a TfidfVectorizer

Initialize stop words from the English language and a maximum document frequency of 0.7 (terms with a higher document frequency will be discarded). Stop words are the most common words in a language that is to be filtered out before processing the natural language data. And a TfidfVectorizer turns a collection of raw documents into a matrix of TF-IDF features.

Now, fit and transform the vectorizer on the train set, and transform the vectorizer on the test set.

Initialize a Passive-Aggressive Classifier.

PassiveAggressiveClassifier(max_iter=50)

Predicting on the test set from the TfidfVectorizer

Calculating the accuracy with accuracy_score() from sklearn.metrics.

Accuracy: 92.66%

Printing out a confusion matrix

Photo by Jon Tyson on Unsplash

Gaining insight into the number of false and true negatives and positives.

array([[589,  49],
[ 39, 590]], dtype=int64)

From this algorithm, we get 589 True positives and 590 True Negatives.

References

[1] Fake News Detection using Passive Aggressive and TF-IDF Vectorizer — https://www.irjet.net/archives/V7/i9/IRJET-V7I9274.pdf

[2] Passive Aggressive Classifier — https://www.youtube.com/watch?v=TJU8NfDdqNQ

--

--