One-Stop News

Tirth Patel
SFU Professional Computer Science
12 min readApr 20, 2020

--

Article By: Tirth Patel, Miral Raval, Utsav Maniar

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/pmp}.

One-Stop News is all in one news portal. This portal provides the user with a summary of similar articles fetched from multiple news websites along with its sentiment, related tags, and categories. It also visualizes the trending topics.

Motivation and Background:

Who doesn’t read the news? History of reading newspapers began in the 17th century. But today in this 21st century we are living in the world of the internet where people are more inclined towards news websites. But still, if people want to read about a particular topic they have to go through different news websites to find similar articles. Here comes our portal to rescue. Through one-stop news user needs to select a particular category and, as a result, they will get a list of similar news articles along with their summary, tags and sentiment from different news websites. For now, we have taken two news websites (New York Times and Guardian). Hence the user will save time by just skimming through the summary and tags of the articles of the selected category and if they like the article they can visit the original website to read the whole article. Also, our portal provides an additional feature to classify the category of trending news articles and generates a word cloud of trending topics. This way user can know which categories and topics are currently trending.

Problem Statement:

Following are the questions that this project addresses:

1. What are the similar articles from other news sources?

2. Can we get a summary of the articles?

3. What is the sentiment of the article?

4. What are the relevant tags?

5. Which category do the trending articles belong to?

6. What are the trending terms?

Why are these questions important?

1. Users need to switch and search between websites to read the same article from multiple sources

2. It is difficult to skim over the whole article and get a summary

3. One needs to read the entire article to know about the sentiment of the article

4. Without the tags of the articles, it is hard to guess what the article is all about

5. Trending articles are generally unclassified

6. In order to know about trending terms, the user needs to go through over each trending articles

Target Audience:

1. This product is useful for people who would prefer reading news from multiple sources. They would get similar news articles of a category from different resources along with their summary, sentiment and relevant tags

2. The product would be time-saving as one can easily glance over tags and summaries of news articles from different resources in one place

3. They will also get to know which category the trending news articles will belong to and get word cloud of trending topics

Data Science Pipeline:

Data Science Pipeline

1. Web Scrapping and Data Storage:

We have scraped two news websites namely ‘The Guardian’ and ‘The New York Times’. For scrapping, we have used the BeautifulSoup library and have stored scrape data into Amazon S3 web instance.

2. Data Cleaning:

We have performed various types of data cleanings using NLTK library such as tokenization, stop word removal, special characters and punctuations removal, word lemmatization, etc.

3. Exploratory Data Analysis (EDA):

EDA is an integral part of any data science project. For our project, we needed to check whether the dataset for classification tasks is balanced or not, as an imbalanced dataset affects the accuracy of the classification model. We also used EDA to compare the average length of articles of each category and removed lengthy articles. Data used for sentiment analysis was also analyzed to remove any anomalies. We used Plotly and Seaborn libraries to visualize the results.

Exploratory Data Analysis

4. Feature Engineering:

Feature engineering is an essential part of building any machine learning model. Feature engineering is the process of transforming data into relevant features to act as inputs for the machine learning model. Good and relevant features boost model performance. We have transformed textual data into Tf-Idf vector. Tf-Idf is a score that represents the relative importance of a term in the document and the entire corpus.

5. Model Creation:

Model creation is the process of building models for the predictive tasks that we want to perform. For our case, we created models to classify news headlines into different categories, Summary generation, Sentiment analysis of news articles and topic modeling. We have built machine learning models such as LDA, Random Forests, SVM. For model building, we have used python libraries Gensim and Scikit-learn.

6. Results:

We have developed a web application to deploy our project. This web-app is the front end of our project where users can access all the functionalities that our project offers under a single web page. For developing our web app, we have used the Django web framework.

Feature implementation and evaluation:

Following are our features and methods used to implement them:

1. Article Similarity:

Text similarity has to determine how ‘close’ two pieces of text are both in surface closeness [lexical similarity] and meaning [semantic similarity]. Here we have used lexical similarity to gather similar articles. Article similarity feature draws the similarity between scrapped articles from multiple sources and outputs them to the dashboard. This functionality enables users to read about similar categories from multiple sources.

Method:

We perform cleaning and pre-processing of the scraped articles using NLTK library. This involves filtering, tokenization, part of speech tagging, lemmatization, removing stop words and so on. We used Doc2Vec package of genism library for model training and the dataset used is BBC News Dataset. Doc2vec is an extension to the word2vec-approach towards documents. Its intention is to encode (whole) document, consisting of lists of sentences, rather than lists of ungrouped sentences. The next step is to feed pre-processed articles into the model to generate vectors for each article. Now, the similarity between these articles was calculated using cosine similarity and the most similar articles from each source were produced as a result. There are other measures like Euclidean distance but here we have used cosine similarity as it measures the angle between two-word vectors in multi-dimension space. It focuses on the orientation of documents whereas euclidean distance focuses on the length of the documents. Hence even if two documents are oriented close but if their length carries a lot than euclidean distance gives less similarity compared to cosine similarity.

Evaluation:

Since it is unsupervised learning, model was tested by giving multiple similar articles manually.

2. Summary Generation:

Text summarization is the problem of creating a short, accurate, and fluent summary of a longer text document. People often get bored while reading long paragraphs of text. Summaries are always useful to get a gist of article before diving deep into it. This feature generates a summary of similarly scrapped articles. Summary is generated using an extractive and abstractive approach. Here we have used an extractive approach.

Method:

Extractive text summarization involves the selection of phrases and sentences from the source document to make up the new summary. Techniques involve ranking the relevance of phrases in order to choose only those most relevant to the meaning of the source. During our first try, we generated summaries using the LSTM model but the results were not that good. In our final approach, we used the pre-trained BERT model for generating an extractive summary. This tool utilizes the HuggingFace Pytorch transformers library to run extractive summarizations. This works by first embedding the sentences, then running a clustering algorithm to find the sentences that are closest to the cluster’s centroids.

Evaluation:

To evaluate generated summary we checked it manually and compared them with the original articles and it performed really well.

3. Finding Sentiment:

Sentiment analysis is the interpretation and classification of emotions (positive, negative and neutral) within text data using text analysis techniques. Sentiment analysis models detect polarity within a text (e.g. a positive or negative opinion), whether it’s a whole document, paragraph, sentence, or clause. News sources are often positively or negatively inclined towards the topic. Thus we provide functionality that predicts the sentiment of the article. We classify the given article into four categories negative, slightly negative, neutral, slightly positive and positive.

Method:

We used Kaggle movie review dataset for the training of machine learning models. EDA was performed on the dataset to remove data with access length and to balance out the categories. The input articles are pre-processed and are converted into vectors. This vector model is then dumped to reuse it during the prediction of new articles. The vectors are given as input to the model along with the labels to train the model.

Evaluation:

We tried two models Naïve Bayes and Random forest. Naïve Bayes gave an accuracy of about 57% and that of the random forest was 68%. Thus we chose random forest as our classifier.

4. Topic Modeling:

Topic modeling is an unsupervised machine learning technique that takes a set of documents as input, detects words and phrase patterns within them, and automatically clusters word groups and characteristics that best describes the set of input documents. For our project, we have used topic modeling to extract important words for a given news article. These extracted words give users an idea about the topic that the news article is talking about. We have used the Latent Dirichlet Allocation (LDA) model to extract words. LDA is an unsupervised machine learning model that takes documents as input and provides topics and important words describing that topic as output in terms of probability with weights attached to each word.

Method:

We have used a subset of BBC News Dataset, which contains 308 articles of different languages. We have filtered English articles using ‘langdetect’ library. Then we tokenized sentences and words of each article using the NLTK library on which lemmatization was performed. Lemmatizing is the process of generating the root form of given words. We also removed stop words as these are the words that are not important for model building. We tried bi-gram and tri-gram words to feed as input. After some experimenting, we choose tri-gram word input as it gave better accuracy. We performed fine-tuning by experimenting with different parameter values of LDA model.

Evaluation:

As LDA is an unsupervised machine learning model, there is no defined evaluation metrics for it. Here we had to use Human Judgment to evaluate the model. For this, we printed the top 5 topics predicted by our LDA model for the given document and evaluated how well the model predicted topic and related words.

Classification:

News article classification aims to classify the news articles into pre-defined categories. News article classification can be seen as a text classification which is an application of NLP. Document Classification is a supervised machine learning problem. We have classified the news articles into five different categories: Business, Entertainment, Politics, Sports, Technology. We are scraping headlines from two renowned news websites namely ‘The Guardian’ and ‘The New York Times’. Then we are predicting the category of each headline using our trained model. We have noticed that usually news website does not give the category of trending news. So, with the help of our project, users can see categories of the trending news and choose whether to read the headline based on their interests.

Method:

We have used BBC News Dataset to train our model. BBC News Dataset consists of 2,225 documents with corresponding categories labeled. The dataset contains five different categories: Business, Entertainment, Politics, Sports, Technology. First, we performed Exploratory Data Analysis to get an idea about the dataset. We found out the dataset is balanced as it contained approximately the same number of documents of each category. We also plotted the distribution of the average length of articles per category. By doing this, we found out that Politics and Technology news article lengths are bigger than other categories. So, we filtered those two categories by retaining articles with a length of 1000 words and discard articles with more than 1000 words. We used matplotlib and Seaborn libraries to visualize the data.
Before extracting features from the input dataset, we performed a few text cleanings tasks such as Special Character removal, removal of Punctuations, Lemmatization, and Stop Words removal. We performed all the text cleaning tasks using NLTK library. Then we tokenized each cleaned document into words using NLTK library. We converted tokenized words into Tf-Idf vectors as machine learning models only take numerical data as input. Tf-Idf gives a score to the terms which represent the importance of that term in the document and entire corpus. We used Scikit-learn library to generate Tf-Idf vectors.
For the classification task, we compared the performance of two machine learning models: Support Vector Machines and Random Forests. We performed hyperparameter tuning by defining parameter values for both the models and running the randomized search. By this, we get to know the best performing model, which was SVM for our case. Then we performed Grid Search to tune the parameters more thoroughly by searching deep into hyperparameter space. We performed these tasks using Scikit-learn library.
The final model predicts the category of the given news article with an accuracy of 95 %.

Evaluation:

For the classification task, we used accuracy as the evaluation matric. The accuracy metric measures the ratio of correct predictions over the total number of predicted instances. For our case, SVM performed best with testing accuracy of 94 %. We have also visualized the Confusion Matrix for both the models for model interpretation purposes.

Confusion matrix for Random Forest (left) and SVM (right)

Word Cloud:

Word cloud is an image composed of words used in a particular text or subject, in which the size of each word indicates its frequency or importance. Here we have visualized word cloud on trending news to get an idea of which word or term is the most trending.

Method:

First, we scrape the trending news and then pre-process it. These entire articles are then combined into a single list which is then visualized using python’s word cloud library.

Data Product:

We have developed our product using Django, HTML, Bootstrap and CSS. One-stop news scrapes news article of selected category from two websites (The Guardian and New York Times) and supports the following functionalities:

1. Collecting similar articles

2. Summary Generation

3. Finding Neutrality

4. Tag Generation

5. Classification of trending news

6. Generating word cloud

Data Product Demo

Challenges faced:

Following are the obstacles that we faced during the project:

1. Initially, we wanted to scrape three websites namely The Guardian, New York Times and Daily Mail but in daily mail the class of articles changes in every update and hence it is difficult to scrape

2. We had to trade off accuracy with processing time for article similarity as it compares several articles at the same time to draw comparisons

3. Integrating all the features and implementing them in a stack was time-consuming

Learning:

Through this project, we learned a lot as it was our first project towards NLP. Following are some of the learning that we gained:

1. Cleaning and pre-processing of textual data

2. Feature engineering on textual data

3. Use of NLP libraries like genism and NLTK

4. Tuning of hyperparameters

Future Work and Improvements:

In the future we would like to make the following improvements to our project:

1. We would like to incorporate more than two websites

2. We would like to improve the sentiment classification model

3. We would improve the processing time of similarity comparison

4. We would allow scrapping of an article from the topic entered unlike the categories selected from dropdown

Summary:

Our final data product is an all in one news portal which provides summarized similar news articles aggregated from two websites (New York Times and The Guardian) with relevant tags and its sentiment. Also, it provides the classification of trending news articles into their relevant categories and produces a word cloud of trending terms.

--

--