News Application that reads your mind — Part 1

Sayed Athar
Codalyze
Published in
5 min readJun 4, 2019

Introduction:

Political news plays a crucial role in our daily lives, it helps us make decisions about who we hand our power to while keeping us aware of our political surroundings, this includes everything from the international relations to the prices of the groceries we consume every day.

This project collects Indian political news articles from websites and identifies the overall tone of that article. It gathers all similar news articles together and generates a summary.

The project is split into two parts:

Part 1: Sentiment analysis
Part 2: Similarity and Summary generation

Crawlers were used to gather the news from different sources. We have used the following libraries, BeautifulSoup (for navigating through static websites) and Selenium (for scrolling and clicking through dynamic websites). The data was collected every 24 hours and stored in MongoDB(Database).

The collected data includes:

  • Heading
  • Summary
  • Subheading
  • Time and date
  • Body
  • Image (if any)
  • Author

To start off we analysed the data for sentiment prediction. The user can then choose how the data is displayed, for example: if the user only wanted to see all the positive news related to a certain subject, they would be able to set the data accordingly. Users could also see articles from different websites under one news block. This block will generate a summary based on the articles it contains, it would also show a list of news articles related to it.

Problem Statement:

To predict the sentiment from news articles, sentiment can be identified as positive, negative or neutral. Informally the sentiment can be thought of deriving an opinion from a given sentence or article.

Why sentiment analysis of news articles?

  1. Our goal is to develop a system that would help our users get the news they enjoy reading. The users would be able to choose from a given set of sentiments, depending on their personal preferences.
  2. This would also give political parties the ability to keep track of their image in the media and public opinion. For example, if a certain political party receives negative coverage it would indicate that they are doing something wrong which would put them at risk of losing the election unless they take positive steps.

Workflow for Sentiment Analysis:

Workflow

Step 1: Data Acquisition:

  1. For Data Acquisition we scraped news websites and pulled over two thousand news articles.
  2. We then labelled or categorized the news articles as ‘Positive’, ‘Negative’ and ‘Neutral’ manually, which denotes polarity of the articles.

Step 2: Data Pre-Processing :

Cleaning up data

  1. Remove duplicates so that there is no biased or useless processing of the data.
  2. Refinement of the text, i.e. to remove HTML tags that might be present due to scraping. We removed such tags using regular expressions.
  3. Replacement of stemming words and removal of stop words. This helped us save space by reducing the dimensionality of the data. It also reduced the extra processing of the unimportant words in the context or analysis of the article.

Stop Word: words like ‘And’, ‘For’, ‘But’, etc
Stemming words: words like ‘Playing’, ‘Plays’, ‘Played’ refer to the same action ‘Play’

Step 3: Feature Extraction

The data that we have is textual in nature, which can’t be used directly for machine learning modelling purposes. Since the modelling would involve lots of numbers, the model will consider every data point from a dataset as a vector. Therefore each news article in our dataset is data point and would have to be converted into Vectors for further processing. This step of converting raw text into vectors can be informally considered as feature extraction. There are basically two methods involved in converting word or text to vectors:
1. Bag of Words Model
2. Term Frequency-Inverse Document Frequency aka Tf-idf model

The only difference between tf-idf and Bag of words technique is that instead of putting 1’s or 0’s in the vector of text we basically put Tf-idf score for each word for each document. Our vector is ready, we can apply our model over it.

Step 4: Choosing Machine Learning Models :

We have:

xi data points where each data point is the text of news articles

yi = {‘Positive’ , ‘Negative’ , ‘Neutral’} (corresponding label)

We had to train our machine learning model such that for corresponding news article it predicts its class that is yi. The values of yi category are nearly equal. This is basically a supervised learning problem related to multiclass classification, there are many algorithms can help in achieving this task. However, we tried the following:

  1. Logistic Regression
  2. Naive Bayes
  3. Random Forest
  4. XgBoost

We had small data set so we chose not to go with Neural Networks as they can easily overfit a small data set. However, if the dataset is large enough, typically more than ten thousand data points, Sequence to Sequence models (Lstm or Grus) will be a better choice.

Step 5: Training

  1. We split our data into 80:20, that is 80% of data points were used for training and the remaining 20% were used for testing.
  2. For Cross-validation and Hyperparameter Tuning we used Grid Search with 5 -fold cross-validation.

Step 6: Results

Using different text feature extraction and different machine learning models, we had very small data nearly 3 thousand news articles and maximum accuracy we were able to achieve was 68%

Improvements and Suggestions :
Although we had got satisfying results we can improve these results further by taking the following steps:

  1. Collecting more data typically 50k to 100k news articles.
  2. Adding more features like the length of news articles.
  3. Using sequence to sequence models like Lstm and Grus.

--

--

Sayed Athar
Codalyze
Writer for

I am a Machine Learning , Deep Learning enthusiast who routinely reads Self Help Books , I would like to share my knowledge by writing blogs . Sky is the limit!