News Classification using Machine Learning

5 min readApr 8, 2020

Given the large amount of information generated in the world today, here we will focus on news (more specifically on short news), we can find several sources of this news ( Reuters , CNN , G1 , Diário do Nordeste and etc).

The world has been changing for a few years now, and the vast majority of these news are currently in digital format, Jornal em Papel is losing its strength every day (as we can see in the image below) and giving way to news on electronic devices. News in digital format allows us to be more comprehensive when it comes to news sources, we can see news from several news sources at the same time and we can even specify our tastes (news recommenders for example — Luppar News-Rec and Flipboard ) .

I can’t help but comment… Beware of Fake News .

THE PROBLEM

These news sources usually classify news in categories (here we will use the term: labels ) and this can generate a BIG PROBLEM, because not all news that these sources produce or receive, arrive correctly labeled and due to the large flow of these news, this is impossible for a human to manually label each one that arrives, not to mention old news that may not have been properly labeled in the past, in addition to other aspects mentioned below:

An issue that may seem relevant is that today all news agencies are organized into editorials and news is usually labeled by subject. In this context, what would justify the project of a News Recommender (news classifier)? The answer to this question can be explained as follows: First, among news generators there is no common ontology and neither are the agencies organized in the same way. Second, editorial is not the same as subject. Third, usually this labeling is single- label and the true classification is known to be multi-label, so a recommender system aggregating news from multiple generating sources needs to have its own classification system. Source

THE GOAL

It is to create an algorithm using Machine Learning to classify short news in labels automatically, that is, the algorithm receives a news and informs which label (category) that news is from.

News Example : “Datafolha survey published by the newspaper “Folha de S.Paulo” this Wednesday (8) points out that 69% of respondents say they will lose income during the health crisis caused by the coronavirus epidemic in Brazil. The survey also shows that 76% advocate that people stay at home to prevent the virus from spreading.”

Labels : Health, Finance.

THE PROPOSED SOLUTION

In this first phase, the focus was on classifying the news in just one label (mono-label), the second phase (multi-label) is already under study and will soon make it available!

What was used in this first phase:

Web Scraping by searching the site: G1 Notícias and performing the “scraping”, then cleaning and processing this data. As a result, the data source was created: z6News , containing 34,327 news divided into 6 categories (esporteNews, politicaNews, TecnologiaNews, personal finance, educationnews, science and naturesaudenews).

More information on Web Scraping with an example from Google Colab can be found here !

2. Natural Language Processing or NLP — Natural Language Processing is a field of Artificial Intelligence that gives machines the ability to read, understand and extract meaning from human languages. Source

3. Classification Algorithms : SVM , KNN , Decision Tree and Random Forest (we use Python ‘s Scikit-Learn library )

4. Document Representations : BoW, TFIDF, Word2Vec and FastText (the last 2 using the GENSIM library and applying 2 approaches: Average (traditional) and using the E2V-IDF. ( Source )

Notes : The classification algorithms were merged with the Document Representations, where the best evaluated combinations were: SVM(RBF)+W2V-IDF and SVM(RBF)+BoW .

As we can see below (using the f1-score metric with 10 Folds Cross-Validation ):

Results

The best results obtained in this first phase were for the combination SVM(RBF)+W2V-IDF (that is, using the classification algorithm: SVM together with the document representation: Word2Vec using the E2V_IDF Approach ), the results were as follows when we use 80% of the dataset for Training and 20% for testing: