A few years ago on Kaggle they held a competition sponsored by StumbleUpon to determine if there was any way to determine which websites were likely to be consistently liked. StumbleUpon works by taking users to a random website that’s within categories they are interested in. Users can then chose to either like, dislike, or abstain from expressing an opinion about the website they’ve been sent to.
Websites that are consistently liked are considered ‘Evergreen’ by StumbleUpon. Additional data such as the category the website was assigned, various ratios (i.e. image to html) and other metrics. This gave a large number of predictors to be used in determining whether or not websites would be likely to be considered Evergreen.
Early Data Analysis
All data projects start with looking at the data to begin with. There are many tools and one that’s incredibly helpful in determining where to start has been Tableau.
Throwing data into Tableau and seeing how the data looks has become an invaluable first step. In this instance it began with just seeing how many Evergreen outcomes there were by category(assigned by alchemyapi).
It was clear that the alchemyapi was not the most rigorous of categorizers, and many websites were not assigned a category. When ignoring the unassigned recreation and business both had high presence both in their overall counts and in the number of evergreen sites they possess. People are routinely interested in things they run into often, and recreation and business are important aspects of modern life. Sports and Technology did not fare as well, likely due to the constant changes they undergo.
Beyond just looking at the categories it was possible to see whether or not the number of images or frames present on the site had an impact on the Evergreen status.
It seems that from above people are more interested in websites that don’t have many images and have low frame presence. From the cursory tableau work, some basic questions were formulated. With that done as soon as the data is cleaned modeling would be possible.
Since this is categorical data with a focus on whether or not the site is Evergreen the modeling will be focused on comparing logistic regression (with ridge and lasso) to K-nearest neighbors. Additional tools such as grid search were used in order to determine the optimal parameters of the model.
Logarithmic regression is certainly a versatile tool, it can serve as a fantastic classifier and doesn’t run into many of the issues that K-Nearest Neighbors does since it’s categorization does not rely on its neighbors, but instead is a probability function that makes the prediction based on the weights of predictors versus weights of neighbors.
Both scikit learn and statsmodels were used in logarithmic regression of the data. Both of these modules perform different tasks and are used in conjunction to gain an understanding both retrospective(statsmodels) and prospective(scikit learn).
Thorough statsmodels details of the logistic regression are shown.
Most important data from the statsmodel output was gleaned from variable significance in the creation of the model. The values of the coef, std err reflect the coefficient the variable has in the model and it’s standard error, respectively.
The ‘z’ references to the z-score of that particular variable achieving the coefficient it obtained, and the P>|z| is the probability of that z score from chance. Low values of P>|z| as seen in html_ratio, frameTagRatio, spelling_errors_ratio, and numberOfLinks show that they are likely contributors to the model.
95% confidence interval is the likelihood that if the model was repeated that it would lie within the said interval.
Low coefficient of the numberOfLinks score does also mean that its contribution is not substantial. With all of these in mind, the predictive model could be constructed using scikit learn.
The final scikit model was done with the application of a gridsearch algorithm to determine what parameters would be best in prediction, and using a standard train test split.
Above grid search will iterate through all of the listed search parameters, until it finds the model with the best score. Then that model will be available for reference and application for future applications.
The classification report comes from looking at the modeled versus actual values of the Evergreen classification. Looking at both the precision (how many of the selected values are relevant TruePositive/(TruePositive+FalsePositive)) and recall (how many important values were chosen TruePositive/(TruePositive+TrueNegative)).
From above it would seem that the model performs well, but could certainly do with a little bit of tweaking.
Natural Language Processing
Finally some natural language processing was performed on the data with the help of scikit learn pipelines. Through creating a categorical variable focused on whether or not the website was of arts and culture, it could be possible to develop a model of which words are most likely to be associated with that category.
Through building up a pipeline it was very easy to apply several operations at once to text. This made it very easy to take the data and clean it, and model in one pass.
Once the pipeline was applied to the data the coefficients and the words associated to those coefficients could be obtained. Through taking them from the fit model it allowed for the construction of a data frame
The coefficients and the words were then turned into a dataframe. The dataframe then allowed for sorting based on how influential the word was on the basis of its coefficient. Which led to the top twenty five words of websites from the culture and politics category.
From above it would seem that hosting a wedding with raw pumpkin dough on halloween would be sure to categorize you into culture and politics.