Article Recommendation System Using Python

Nayana Kumari
Web Mining [IS688, Spring 2021]
14 min readMay 9, 2021

--

The main motivation for me to pick up this topic for my post is probably because I learned the value of reading late in life, and after I did, I regret not doing so since my school. Now by reading, I mean books and or articles outside my education curriculum. Most of them know the value of reading, as for me, it really changed my viewpoint to various things in life that I was taught by family and society around me. It introduced a new perspective, and I started to question the norms, which otherwise I never did. Most importantly, when I started making reading a daily habit, it also trained my mind to be analytical and make decisions based on critical thinking.

In earlier years, between 2009–2013, I was only interested in reading books; however, over the past few years, I realized some good reading stuff that exists over the internet, like Scholarly writeups, Ribbon Farm, etc., motivates me.

“Internet is the world’s largest library. It’s just that all the books are on the floor”

I figured that reading Articles has its advantages. For example, they mostly have the latest information and are much more agile. If there is a breakthrough or reinvention of certain things, it reflects more efficiently in articles than books. Another thing that really works for me is reading on the topics I might not be interested in, only to find out it is actually interesting. I would not do that with a book, as reading a book needs commitment in terms of time and attention. Some Articles have often exposed me to new topics, information, and authors.

Sometimes when I was not sure what to read next, I often went to friends who are readers like me for suggestions on which book to read but often ran into a dead end as not many were into this habit. Thanks to the recommender system, I never ran into this problem again. There are so many services available now that we are presented with recommendations based on our interests and preferences without even asking for them.

Courtesy: Alibaba Tech

Recommender Systems analyze reader's past article selections and reading behavior to suggest items they might like to read. Is that not a great deal for those who are really passionate about reading regularly and often looking for material over the internet?

Speaking of other areas of usage- The recommender system is used extensively on E-commerce websites to suggest buyers products of their liking based on things they purchased earlier, on Job portals to suggest similar Jobs based on postings applied for, on movie or song streaming services based on recently watched or heard content. It is almost everywhere and helps you and companies in boosting sales and revenue and retaining their customers.

There are basically two kinds of recommendation methods that I learned about-

Content-Based recommendation

A content-based recommendation system works by analyzing the similarity among the items or users using their attributes. It could be the user’s demographic information like location, age, etc., and for items- it can be item name, specifications, category, etc. For each item, an item profile (essentially a feature vector) is created. For example- Text (Set of important words in the document)

Basically, Content-based filtering uses item features to recommend other items similar to the user's likes, based on their previous actions or explicit feedback. It works well for recommendation systems of reading lists, articles like medium, Reddit, etc. It is also suitable when we don’t have enough data regarding user behaviors in the past, ratings, and how other users see the contents/items.

“Digital behavior is just a replication of human behavior.”

Collaborative filtering

Collaborative filtering (CF) works with additional inputs by looking at the user's items/content and attributes. It checks for the user's reaction, behavior, and preference in the past. It may also check for the views, opinions, ratings of a particular content/items of other users. This algorithm can be made interesting by making it dynamic. So, over time, this system can do feature learning on its own, which means that it can start to learn for itself what features to use.

In this post, I plan to use the content-based recommendation method to create a recommendation system for articles over the web. These articles could be in HTML format or Video format, or Rich Text format.

Objective:

Courtesy: Unsplash

I plan to take a dataset that provides the attributes of articles and the target user. I will explain the attributes later in this post. The idea is to assume the user is reading one of the articles from the articles dataset and providing that input. This recommender system will return a list of articles on the same/similar topics that the user might want to read next. This model is beneficial when a user researches a particular subject/topic and subsequently might want to read similar articles.

Tools:

I plan to use python as my programming language with the following set of libraries/tools for this analysis.

import pandas as pd
from IPython.display import display
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

The codebase for this analysis can be found here:

https://github.com/nt27web/RecomendarSystem

Data Analysis and preparation:

I have used the below datasets. Datasets are in CSV format.

The dataset contains a real sample of 12 months logs (Mar. 2016 — Feb. 2017) from CI&T’s Internal Communication platform (DeskDrop).
It contains about 3k public articles shared on the platform.

CSV link:

https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop?select=shared_articles.csv

#Retrieving the data from.CSV
articles_df = pd.read_csv('shared_articles.csv')
print(articles_df.shape)

Result: 3122 rows , 13 columns. Enough records to use for my analysis.

The dataset has the following columns and values:

Timestamp: Time when an event has occurred. It is not helpful for our analysis as we will focus on the items available.

print(articles_df['timestamp'].head(5))
Timestamp values

eventType: Article shared or article removed at a particular timestamp.

print(articles_df['eventType'].value_counts())

I will filter out the articles removed as they will not help in the recommendation.

articles_df = articles_df[articles_df['eventType'] == 'CONTENT SHARED']
print(articles_df.shape)

Result: 3047 rows, 13 columns. Not a significant drop from the actual dataset after eliminating the not existing articles.

contentId: Article Id in numeric format.

print(articles_df['contentId'].head(5))

Since I plan to use the title of the articles as a unique qualifier, this id has no use for my analysis. So, I will remove this column.

authorPersonId: Author Id.

print(articles_df['authorPersonId'].head(5))
#Retrieve unique values
print(len(articles_df['authorPersonId'].unique()))

Result: 252 unique Authors. I may use this to create the top 5/10 articles for an author to enhance the recommendation.

authorSessionId- Session ID of the author. The author might have created the articles in different sessions.

print(articles_df['authorSessionId'].head(5))

Since this is contextual and we focus on attributes rather than transactional details, we will remove this column.

authorUserAgent: The browser author used.

print(articles_df['authorUserAgent'].tail(5))
#Retrieve unique browsers
print(len(articles_df['authorUserAgent'].unique()))

Result: 115 unique browser agents. As this doesn’t contribute to any important distribution, I will drop this column.

authorRegion: states/regions of the author.

print(articles_df['authorRegion'].tail(5))
print(len(articles_df['authorRegion'].unique()))
print(articles_df['authorRegion'].isnull().sum(axis=0))
print(articles_df['authorRegion'].isna.sum(axis=0))

So, there are 20 records with Region as null, and more than 50% of the total records have ‘NaN’ in the region column. This finding makes it an obvious candidate for removal.

authorCountry: Country of the author of the articles. The distribution is as follows.

print(articles_df['authorCountry'].tail(5))
print(articles_df['authorCountry'].unique())
print(articles_df['authorCountry'].isnull().sum(axis=0))
print(articles_df['authorCountry'].isna.sum(axis=0))
Country-wise distribution

As we can see, only 2 countries(Brazil & USA) contributed most of the articles, and other countries have an insignificant number of articles. On top of that, about an equal number of articles, like the USA, have no country information. So, I will remove this column from the analysis.

contentType: The formats articles are shared.

print(articles_df['contentType'].unique())
print(articles_df['contentType'].isnull().sum(axis=0))

Clearly, there are three formats by which the articles are shared- HTML, Video & Rich Text. I may use this for grouping them, but this will limit the recommendations to one format only. Unless a user has a specific preference, I will let all three formats come in the recommendation.

URL: URL of the articles; we can use this for reference to navigate to the articles.

print(articles_df['url'].head(5))
print(articles_df['url'].isnull().sum(axis=0))
print(articles_df['url'].isna().sum(axis=0))

This column currently has no blank or null value, which is good. It means if I use it along with the recommended article title. It will be useful for the users to navigate to the article directly.

Title: Title/headline of the articles. Let’s look at the nature of this column.

print(articles_df['title'].head(5))
print(articles_df['title'].isnull().sum(axis=0))
print(articles_df['title'].isna().sum(axis=0))

The good news here is there’s no record with the title as null or blank. I will use this to identify the recommendation and also as input to the recommendation system.

Text: Content of the articles.

print(articles_df['text'].head(5))
print(articles_df['text'].isnull().sum(axis=0))
print(articles_df['text'].isna().sum(axis=0))

This is the most critical column in our analysis since I will use a content-based recommendation system. I will use this field to create a TF-IDF matrix for our analysis.

I will explain the TF-IDF method later in this article.

Lang — language in which the article is written. Below is the distribution of articles over languages.

print(articles_df['lang'].unique())
print(articles_df['lang'].isnull().sum(axis=0))
print(articles_df['lang'].isna().sum(axis=0))
Language distribution of Articles

So, the most used languages are English and Portuguese. It makes sense because we found that countries which contribute mostly are the USA and Brazil.

Though people might read in different languages, for simplicity, I will restrict the language to English.

articles_df = articles_df[articles_df['lang'] == 'en']
print(articles_df.shape)

Now we have about 2.2k records to work with.

Let me summarize; I will start with 2.2k records keeping the language as English and content type as ‘content shared.’

Exploratory Data Analysis:

Now that I have the data fields identified and cleansed, I will start the analysis, leading me to the recommendation system creation.

First, I will create a dataframe using the pandas library.

articles_df = pd.DataFrame(articles_df, columns=['contentId', 'authorPersonId', 'content', 'title', 'text'])

Observe, I have included only the relevant columns for which I explained the rationale in the above section.

Let’s try to find the articles which are similar to the article user is reading, which is basically the input to my recommendation system. We can derive a pairwise cosine similarity and recommend articles with a similar threshold score to the input articles. While doing so, my recommendation system will face a known issue called the natural language processing problem. This means since the articles have different oratory and styles of sentence framing, finding similarity will be next to impossible. So, I will have to extract some features out of these texts. The feature is similar to the index, which focuses on the highly used words and creates an index.

I will first create a column to my dataframe and call it ‘soup.’ Soup is nothing but a concatenated version of all the feature fields. In this case, it will only contain the values in the column ‘text.’

articles_df['soup'] = articles_df.apply(create_soup, axis=1)def create_soup(x):
soup = ' '.join(x['text'])
return soup

I can later add additional columns to enhance the system to incorporate additional features.

Several methods are available to create the vectorized form of the values in the ‘text’ column. I will use Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each article. This will produce a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one article). Each column represents an article(title).

The TF-IDF score is the frequency of a word occurring in an article, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that frequently occur in plot overviews and, therefore, their significance in computing the final similarity score.

TF-IDF assigns a weight to each term(word) in a document based on Term frequency(TF) and inverse document frequency(IDF).

Mathematical formula’s

So if a word occurs more often in an article but fewer times in all other articles, its TF-IDF value will be high.

I will use scikit-learn that gives me a built-in TfIdfVectorizer class that produces the TF-IDF matrix.

# Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

# Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(articles_df['text'])

So, I selected the ‘English’ language to stop words like ’the’, as I already have filtered my dataset to include articles published in English only.

Now I have the TF-IDF matrix. let’s take a look at the matrix-

# Output the shape of tfidf_matrix
print(tfidf_matrix.shape)
# print(tfidf.get_stop_words().pop())

# Array mapping from feature integer indices to feature name.
print(tfidf.get_feature_names()[5000:5010])

Result:

The output clearly says that there are about 45k words shared among 2.2k articles.

Now that my word vector is ready, I can find pairwise similarities among the articles taking their words. Several techniques are available to find similarities, but cosine similarity would be the most appropriate. It checks for the angle between two words for similarity, not just the geographical distance, if they are plotted on a dimensional graph.

Cosine Similarity:

As mentioned earlier, it is independent of magnitude and is relatively easy and fast to calculate (especially when used in conjunction with TF-IDF scores). Mathematically, it is defined as follows:

Cosine formula (Mathematical)

I will use the cosine_similarity() function of the sklearn library.

cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix, True)
display(cosine_sim.shape)
display(cosine_sim)

Result: Cosine similarity vector shape and sample values

So, cosine similarity checks each pair of elements vector and finds the cosine angle between them. The less the angle, the more similar the elements are to each other. The values lie between 0 & 1. In this case, it’s a 2.2k by 2.2 matrices with values ranging from 0 to 1.

The similarity vector is ready. I will now create a reverse map of indices using the indices of the field ‘title.’ I will also remove the duplicate titles, if any.

# Reset index of main DataFrame and construct reverse mapping as before
metadata = articles_df.reset_index()
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()
display(indices[:10])

Result:

Now the indices are ready. I will create the recommender system using these maps and matrices. I created a function for simplicity. The function takes the article title as input, indices( having the titles and their indices), cosine similarity matrix, and the originally prepared dataset.

# Function that takes in article title as input and outputs most similar articles
def get_recommendations(title, indices, cosine_sim, data):
# Get the index of the article that matches the title
idx = indices[title]

# Get the pairwsie similarity scores of all articles with that article
sim_scores = list(enumerate(cosine_sim[idx]))

# Sort the articles based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

# Get the scores of the 10 most similar movies
sim_scores = sim_scores[1:11]

# Get the article indices
movie_indices = [i[0] for i in sim_scores]

# Return the top 10 most similar articles
return data['title'].iloc[movie_indices]

First, it finds the index of the input title. Since we have to remove it later, it iterates over the length of the cosine similarity matrix and checks for the distance of the articles from the input article we passed.

The result looks similar to this -

Cosine Similarity enumerated list

Now it will sort the articles based on their similarity values. The highest come first.

Then it picks only the top 10 similar articles having the highest similarity thresholds.

It then creates a list of indices that are top 10 similar articles. Finally, it returns the list with article titles matching the indices as derived in the earlier step.

Time to use the recommendation system with various input titles. Here, the input means a user has read or reading that article. So, the recommender will recommend based on the similarity of that article.

print(get_recommendations('Intel\'s internal IoT platform for real-time enterprise analytics', indices, cosine_sim
, metadata))
Top 10 Recommended articles based on user input title

Notice that all the articles are based on IoT since the input article title had IoT in it. It is also imperative that the articles have IoT in their titles and have a similar frequency of usage for more than one word. Also, notice the variant IoT and Internet of Things have also been detected as similar, which is my intention.

Let’s try with the second title as input.

print(get_recommendations('Google Data Center 360° Tour', indices, cosine_sim,
metadata))
Top 10 Recommended articles based on user input title

Notice that it has found similar articles to google and I/O. It has also sensed an article that is related to youtube and doesn’t have google in it. So evidently, our recommender system can detect the variants of a company's products. This is encouraging.

I will use one final input to conclude this test.

print(get_recommendations('The Rise And Growth of Ethereum Gets Mainstream Coverage', indices, cosine_sim,
metadata))
Top 10 Recommended articles based on user input title

So, maintaining its reputation so far, my recommender system has returned a list of articles which has a range of articles related to Ethereum and Bitcoin. Also, it has identified articles that talk about policies related to these financial instruments across the globe.

Conclusion

So, as it unfolded, using a simple list of articles and a limited set of attributes, I created a recommendation system using a content-based filtering mechanism. This system accurately suggests the articles user may be interested in reading/watching next based on the currently reading/watching article. This is particularly helpful if a system starts without any user history or publications. As it evolves, the system needs to be remodeled using user history, behavior patterns, etc. The system needs a redesign, and the recommended approach will be to use collaborative filtering mechanisms. We can also use the Collaborative filtering techniques to automatically build a recommendation system that evolves and decides which feature needs to be used. But at the beginning, content-based filtering will be a recommended approach as it helps avoid problems like ‘cold start’ where users have no history, no first-rater, and recommend to users with unique tastes.

Limitations:

The amount of data was insufficient for a more robust and accurate recommendation system. As you have seen, the similarity matrix had ~0.4 as the max similarity between articles. A dataset with more data and various topics might give more relevant articles as recommendations to the user, thus enhancing the user experience by meeting the user expectations. Due to data distributions in various columns, userCountry, authorPersonId, etc., could not be used as features. More well-distributed columns might enhance the recommender system to suggest articles with various topics, i.e., increasing the diversity in terms of subject matters.

References:

--

--

Nayana Kumari
Web Mining [IS688, Spring 2021]

A Traveler at heart and Techie by profession!! Learn / Explore / Live Today.