Turkish News Category Classification Tutorial

Published in

Kodiks

5 min readJul 14, 2021

In this tutorial, we will train a machine learning model for the text classification problem using the Interpress News Category Dataset containing 273K Turkish news in 10 categories. That dataset splited train and test sets. Train set contain 218K news and test set contain 54K news.

Download and preprocess the data

Firstly we have to download a database package for access to data. You can do that via below code.

$ pip install datasets

Since Interpress Turkish News Category Dataset is very large, we need to do a quick configuration. That configuration will allow us to access the data properly. Before download, the dataset, just use below code.

How to read large dataset

We have done our configuration so let’s download the dataset. This step may take some time depending on your internet connection.

Interpress lite dataset download

Let’s start to analyze train and test data.

train_categories = df_train.groupby('category')
print("Train Total Categories:", train_categories.ngroups)
print(train_categories.size())

Train data output has to be like below (Figure 1).

Now, let’s see test data via the below code for more details (Figure 2).

test_categories = df_test.groupby('category')
print("Test Total Categories:", test_categories.ngroups)
print(test_categories.size())

Let’s visualize the pie chart to be more descriptive by the following code. Then check the output (Figure 3).

Interpress dataset pie chart visualize

We are going to deep a little bit more now. Time for pre-processing. We are going to delete stopwords, blacklist, URL, e-mails, punctuation, numbers, and post-fix. So I am not gonna share codes function by function. I am just gonna show you the trigger here and explain how it works. You can find the source code end of the tutorial. So let's talk about a little bit of processing.

For this problem, we created our specific blacklist and stopwords. Also, I made my own hierarchy for this problem. Tried many ways for Turkish language and I got a better result as I’ll explain. Firstly I had to convert text(news) into lowercase then I followed the hierarchy as below:

Delete e-mail address
Delete URLs
Convert string to list for split data with point(“.”)
Split separated data into whitespaces(“ ”)
Delete post-fix(This is specific for Turkish Language)
Delete punctuation
Delete stopwords
Delete blacklist

I prefer to see cleaned data with unprocessed data at the same time. Therefore I created a new column for I can see better-cleaned data in the new column. The output was like below (Figure 4).

2) Feature extraction

Classifiers and learning algorithms cannot directly process text documents in their original format, as they expect fixed-size numeric attribute vectors instead of raw text documents of variable length. Therefore, during the preprocessing phase, texts are converted into a more manageable representation.

We are going to use one old, common and better accuracy approach for this problem, TF-IDF. We used TfidfVectorizer via sklearn. The reason is TF-IDF and SVM models get better accuracy[1]. Also, I experienced that when I used max_features values a little bit high whereas I got better accuracy. Let's see how extracted features are on the below code.

TF-IDF word and char vectorizer

In here, I got word and char vectorizers. Then, I transformed train news for I can increase features by stacking the word and char features. Everything looks good so far. Let's train our model.

3) Train

As I mentioned before, I used the SVM model but I needed to change some default configurations to get better accuracy. SVM model comes up with kernel=rbf and gamma=scale by default. When I changed these parameters like kernel to linear and gamma to auto, I got better accuracy. Let's look at the code.

TF-IDF and SVM training

The accuracy is %93. We got this accuracy via test data.

We are going to look at the confusion matrix and show the discrepancies between predicted and actual labels. Check classification report (Figure 5).

Now, let’s see the confusion matrix (Figure 6).

4)Prediction

I copied a piece of news from NTV News[2]. You can see the raw news and processed news as shown below.

news = r"Çin biyoteknoloji şirketleri China National Pharmaceutical Group (Sinopharm) ve Chongqing Zhifei Biological Products'ın bir yan kuruluşu tarafından geliştirilen iki corona virüs aşının Güney Afrika mutasyonuna karşı bağışıklığı tetiklediği açıklandı. BioRxiv adlı dergide ön baskısı yayımlanan laboratuvar araştırmasında, aşıyı yaptıran kişilerin kan örnekleri analiz edildi. Araştırmacılar, aşının tetiklediği antikorların Güney Afrika varyantına karşı nötrleştirme aktivitesini koruduğunu söyledi. Makale, Sinopharm'a bağlı Pekin Biyolojik Ürünler Enstitüsü, Çin Bilimler Akademisi Mikrobiyoloji Enstitüsü araştırmacıları tarafından yazıldı. AŞILARIN ETKİNLİĞİ DÜŞTÜ Bununla birlikte,  alınan örneklerdeki antikorların mutant virüse karşı aktivitesinin  orijinal virüse göre daha zayıf olduğu ifade edildi. Bilim insanları, aktivite azalmasının aşıların klinik etkililiğine olan etkisi dikkate alınmalıdır açıklamasını yaptı. Ancak, aşılardaki etkinliğin hangi oranda düştüğü belirtilmedi. Öte yandan, yüzde 79 oranında  etkili olduğu açıklanan Sinopharm aşısı Çin'de genel kullanım için onaylandı  ve Birleşik Arap Emirlikleri (BAE) de  dahil olmak üzere diğer birçok ülkede de kullanıllıyor. "cleaned_news = clean_text(news)
cleaned_newsOUTPUT:
'çin biyoteknoloji şirketleri china national pharmaceutical group sinopharm chongqing zhifei biological products yan kuruluşu geliştirilen corona virüs aşının güney afrika mutasyonuna karşı bağışıklığı tetiklediği açıklandı biorxiv adlı dergide baskısı yayımlanan laboratuvar araştırmasında aşıyı yaptıran kişilerin kan örnekleri analiz edildi araştırmacılar aşının tetiklediği antikorların güney afrika varyantına karşı nötrleştirme aktivitesini koruduğunu söyledi makale sinopharm bağlı pekin biyolojik ürünler enstitüsü çin bilimler akademisi mikrobiyoloji enstitüsü araştırmacıları yazıldı aşilarin etki̇nli̇ği̇ düştü bununla alınan örneklerdeki antikorların mutant virüse karşı aktivitesinin orijinal virüse zayıf ifade edildi bilim insanları aktivite azalmasının aşıların klinik etkililiğine etkisi dikkate alınmalıdır açıklamasını aşılardaki etkinliğin oranda düştüğü belirtilmedi yandan yüzde oranında etkili açıklanan sinopharm aşısı çin genel kullanım onaylandı birleşik arap emirlikleri bae ülkede kullanıllıyor'

let's transform this news and predict.

Prediction

You can access all codes on GitHub!

kodiks/turkish-news-classification

Turkish News Category Classification Tutorial.

github.com

If this article was helpful for you, you can follow us on Medium and Twitter. If you have any questions or app ideas you’d like to discuss, feel free to contact us via email.

REFERENCES

Buluz, Başak, Yavuz Kömeçoğlu, and Merve Ayyuce Kizrak. “Voting-Based Multiple Classification Approach for Turkish News Texts.” In 2019 Innovations in Intelligent Systems and Applications Conference (ASYU), pp. 1–6. IEEE.
https://www.ntv.com.tr/saglik/cin-asilari-mutasyonlu-corona-viruse-karsi-bagisikligi-tetikledi,0CB7F1_-9ka-hxgxoDQh5A, cited: 03.02.2021

Turkish News Category Classification Tutorial

kodiks/turkish-news-classification

Turkish News Category Classification Tutorial.

Written by Serdar Akyol