News Application that reads your mind — Part 2

Sayed Athar
Codalyze
Published in
5 min readJun 4, 2019

Introduction:

Political news plays a crucial role in our daily lives, it helps us make decisions about who we hand our power to while keeping us aware of our political surroundings, this includes everything from the international relations to the prices of the groceries we consume every day.

This project collects Indian political news articles from websites and identifies the overall tone of that article. It gathers all similar news articles together and generates a summary.

The project is split into two parts:

Part 1: Sentiment analysis
Part 2: Similarity and Summary generation

Crawlers were used to gather the news from different sources. We have used the following libraries, BeautifulSoup (for navigating through static websites) and Selenium (for scrolling and clicking through dynamic websites). The data was collected every 24 hours and stored in MongoDB(Database).

The collected data includes:

  • Heading
  • Summary
  • Subheading
  • Time and date
  • Body
  • Image (if any)
  • Author

To start off we analysed the data for sentiment prediction. The user can then choose how the data is displayed, for example: if the user only wanted to see all the positive news related to a certain subject, they would be able to set the data accordingly. Users could also see articles from different websites under one news block. This block will generate a summary based on the articles it contains, it would also show a list of news articles related to it.

In our previous post, we had collected and analysed the news articles that would give us an intuitive feel for what the news was like. In this part, we will be looking into the following features of the project:

1. The similarity between news articles
2. Generating Summary
3. Detecting fake news

Similarity
We noticed that reading through 100 articles at a time might be undesirable user experience. We found that we could enhance the user experience simply by providing similar content from different sources in one place. So we decided to merge these articles to overcome the problem.
While working with textual data with a large number of documents usually we come there questions:
1. How similar is Document d1 to Document d2?
2. Whether this document is unique or whether it has been plagiarized?
3. Who provided this document first?
Often times when scrolling through Google’s news feed we encounter results similar to this:

Google News search

As you can see, Google has merged similar news articles together. As humans, we check for similarity as we read and interpret an entire article, or scan headlines to get a feel of what the news is, then based on this we can tell whether the articles are similar or not. We basically look for a common theme! There are various algorithms for the machine to do it:
1. Jaccard Similarity
2. Cosine Similarity

Both techniques basically allow us to convert text to vectors, after this we have vectors with us, we can compute the pairwise cosine similarity between two vectors.

Generating Summary
Summaries are crispier to read and they provide the gist of news articles and helps us understand entire news articles with 300 words to 50–60 words. Text summarization is an active area of research and there are various ways we can do it. Broadly it can be divided as follows:

  1. Abstractive Text Summarization
    In this form of summarization summaries for each of the news articles is generated using advanced Nlp techniques, it behaves in a similar manner that human generates summaries, that is first by going through the entire article and generating summaries using keyword taken from the text and also from own words.
  2. Extractive Text Summarization
    In this form of summarization, the summary is generated by taking important keywords and phrases and combining them together to create a meaningful summary.

We chose Extractive Text Summarization

  • Abstractive text summarization is still an area of research and methods doesn’t scale up to large text articles. Also, it might be possible that sentences generated are not up to the mark for a large dataset
  • Extractive text summarization is computationally less expensive

There are many algorithms that a capable of doing extractive text summarization, but the one we have used is Text Rank Algorithm which is similar to Page Rank algorithm used by Google to Rank Web Pages.

We took the following steps to implement the algorithm:

  1. concatenate all the text contained in the articles, making a single sentence i.e. one sentence per article
  2. We’ll now find vector representation for each sentence
  3. The similarity between sentences are computed and stored in the matrix
  4. Next, we convert the similarity matrix into a graph with sentences as vertices and similarity scores as edges, for sentence rank calculation
  5. Finally, a certain number of top-ranked sentences form the final summary

there are various algorithms and articles in Literature, you can get them from here.

These are the popular ones:
1. Using sequence to sequence encoder neural networks model
2. Using Long Short Term Memory combined with attention mechanics

Fake News

With an increase in information, the amount of unreliable and fake news on the internet is increasing. This spread of fake news spreads lies and effects lives within a community. To combat this problem we could tag the news or develop a system which predicts whether the given news article is fake or not, to do this we gather data from Kaggle Dataset.

For text feature extraction we could use Term Frequency-Inverse Document Frequency and Bag of Words Vectorizer. For modelling, we can use Naive Bayes and Logistic Regression Models as discussed earlier.

We could always use some thresholds like only if the probability of news being fake is 90% or above, only then we can declare news to be fake.

For Training Purpose we need to use cross-validation and Hyperparameter Tuning to ensure that our model doesn’t overfit or underfit.

Summary And Conclusions :

  1. This blog mainly focused on Text Summarization and News Similarity for News articles.
  2. We looked at various methods for Text Summarization and also various methods available for measuring the extent of similarity between news articles.
  3. We also Looked at various fake news detection and combating dataset, and what models we can use to Combat Fake News.

--

--

Sayed Athar
Codalyze
Writer for

I am a Machine Learning , Deep Learning enthusiast who routinely reads Self Help Books , I would like to share my knowledge by writing blogs . Sky is the limit!