News article virality prediction
The previous 2 articles explained about Information retrieval and some basic codes that would help you implement web scraping. This article focuses on extraction of news articles and prediction of their virality using various machine learning models. The virality is measured in terms of the “number of shares the article received”.
Click the links below if you need prior reference regarding web scraping.
- Information-retrieval Part-1 : Extracting webpages
- Information-retrieval Part-2 : Simplifying the Information
2. Libraries used:
newspaper3k, TextBlob, sklearn, numpy, requests, BeautifulSoup, nltk, pandas, seaborn, catboost, xgboost
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor
from sklearn.linear_model import RidgeCV
from xgboost import XGBRegressor
from xgboost import plot_importance
from sklearn.metrics import mean_squared_error as RMSE
from sklearn.decomposition import PCA
import seaborn as sns
from newspaper import Article
from bs4 import BeautifulSoup
from textblob import TextBlob
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
3. Working with the dataset
- A few precautionary steps are taken initially . E.g. formatting the column names, dropping unnecessary columns
- Extracting the important features ( below )
The XGBoost library provides a method plot_importance() to calculate the importance of the features comprised within the dataset. Following is the plot created for the UCI dataset.
The top 20 features are extracted, keeping a threshold of 600. I managed to calculate 9 of them as the rest of them were not described clear enough to formulate. Following are the feature calculated:
- n_tokens_title : Number of words in the title
- n_tokens_content: Number of words in the content
- n_unique_tokens : Rate of unique words in the content ( #unique words / #words )
- average_token_length : Average length of the words in the content
- avg_positive_polarity : Average polarity of positive words
- n_non_stop_unique_tokens : Rate of unique non-stop words in the content
- num_hrefs : Number of links in the article
- global_subjectivity : subjectivity of article content
- global_sentiment_polarity : Text sentiment polarity
Let us have a brief overview about the two libraries, newspaper3k and TextBlob, which are used to calculate the above features.
Newspaper is an amazing python library for extracting & curating articles.
The following code focuses on extracting news from a news article URL and accessing various features like the title of the article, content text etc.
TextBlob is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
The TextBlob library helps us in calculating the polarity and subjectivity of words and sentences. It is used as follows:
Having done this, using nltk.tokenize and nltk.corpus.stopwords, the first 6 features can be easily calculated. ( Code in GitHub repository ).
Lastly, num_hrefs is computed with the method below:
4. Scraping news articles
I have scraped articles from the British Broadcasting Corporation (BBC) website to collect news articles.
The articles links are then collected into a single list. Further, the articles are then scraped using the newspaper3k library.
Finally, the 9 features described in the previous section are calculated for the article texts and a final dataset of news articles’ features is created.
NOTE: The 9 features defined above are sliced from the original UCI dataset to match the features of the news articles’ dataset.
5. Testing on Machine Learning models
The UCI dataset was split 4 different models are trained and tested, as follows:
- XGboost Regressor
Two versions of XGBoost were used, one using gblinear as a booster and the other using gbtree as one.
The average result on multiple runs gave RMSE of ~10973 for gblinear and ~11430 for gbtree.
The average result on multiple runs gave RMSE of ~10973.
- RidgeCV Linear Regressor
The average result on multiple runs gave RMSE of ~10952.
The average result, with the parameters tested, was an RMSE~11200.
The final plot of the evaluations looks like this
Apparently, RidgeCV performed the best among all the models. The models are then used to predict the number of shares on the news articles extracted. The results of RidgeCV were, on an average, 10x that of the other models. Since the true number of shares were not available, I could not conclude which of them were accurate.
Thanks for your attention!
The link to the notebook is given below. Feel free to comment about any improvements which can be made! :-)
Link to the GitHub repository : News Virality Prediction
Link to the Kaggle notebook : News Virality Prediction