News Recommendation System

A simplified explanation of steps in building a news recommendation model.

Shanty Shabu
The Startup
4 min readDec 21, 2020

--

source:https://images.app.goo.gl/w4tjsZ9kuwMyvcRT9

Online news reading has gotten extremely famous as the web gives admittance to news articles from a huge number of sources far and wide. The main aim of our system is to help users to find news articles that are interesting to read. Here we explain 3 steps that lead to a recommendation model.

1.Data preprocessing

This is one of the major steps in building the recommendation system. It is an integral part of machine learning as the quality of data affects the capacity of our model that derive information. Therefore, it is critical that we preprocess our data prior to taking care of it into our model.

  1. Imputing the missing values

There are different libraries such as pandas, NumPy,sci-kit learn are available for handling missing values. It can be done in different ways:

a. Deleting the entire row when a particular row has more than 70–75% missing values.

b. Assign a unique category when missing values found in a categorical feature like gender.

2. Fetching only the latest news article.

Our dataset is quite large so training our model with the entire dataset is time-consuming. so we have to fetch only the latest article.

2.Text Preprocessing

Natural language processing is done by text preprocessing. It changes the text into a more absorbable structure so that the machine learning model can perform better.

  1. Tokenization

Tokenization divides the long strings of text into smaller pieces of text. It is also called text segmentation. The segments can be word, character, and subword. For example: consider the sentence, “covid cases in India”, the tokenization of this sentence is covid-cases-in-India. This is a word tokenization.

2.Normalization

Before going to further processing, the text in the article needs to be normalized. In this step, we convert all the text to either uppercase or lowercase equally. This approach has different steps:

a. Stemming-Obtain the stem of the word by removing the prefix and affix of that word. Eg: Sleeping →Sleep

b. Lemmatization-It finds the base form of words and identifies the canonical forms of that base form. Eg: Better →Good.

3. Item-based similarity in the news article.

Generally, we are finding similarities on the basis of distance. If the value of the distance is high then the similarity is low and vice-versa. To calculate the similarity of the keyword we have to represent the keyword (which is text) into a d-dimensional vector. Different methods can be used to represent the keyword into a d-dimensional vector:

a. TFIDF vectorizer

The concepts of Term Frequency (TF) and Inverse Document Frequency (IDF) are used in content-based recommender. TF is the Term frequency of the document and IDF is the inverse of the document frequencies among the whole corpus of documents.TF-IDF gives more importance to less frequent words in the corpus. Using the result obtained from Term frequency and Inverse document frequency it assigns a weight to each term or word in the document.

b. Bag of Words method(BOW)

It represents the occurrences of words within a document. It is more preferable to understand the concept through an example:

These are some reviews of the news article.

  • Review 1: This news is very scary and long
  • Review 2: This news is not scary and is short
  • Review 3: This news is informative and good

We will first build a vocabulary from all the unique words in the above three reviews. The vocabulary consists of these 11 words: ‘This’, ‘news’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘short’, ‘informative’, ‘good’.

We can now take each of these words and mark their occurrence in the three news reviews above with 1s and 0s. This will give us 3 vectors for 3 reviews:

source:https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/

Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]

Vector of Review 2: [1 1 2 0 0 1 1 0 1 0 0]

Vector of Review 3: [1 1 1 0 0 0 1 0 0 1 1]

And that’s the idea behind a Bag of Words model.

c. Word2Vec

We can prepare our unstructured content information using this Word2Vec method. Word2Vec (W2V) is a calculation that takes each word in your document. The outcome is vectors in which words with similar meanings end up with a similar mathematical representation. For getting a better result in this method we need a large corpus. Below shows the input and output of Word2Vec:

source:https://images.app.goo.gl/kmEJPCj8aot9WYxc7

This method recommends a document containing a word and other words associated with that word. Also, we can say it as Word2Vec consider semantic similarity other than syntactic similarity.

These are some of the text classification methods for recommendation systems. Finally, we can measure our model by evaluating the accuracy, precision, and Recall.

--

--

Shanty Shabu
The Startup

passionate about data science and machine learning.