Multi-Class Text Classification on Khmer News Articles
In an earlier article, we use text classification to identify if a news article is a car accident. Here, we will cover a multi-class classification on news articles by classifying them into eight different categories. We will look into the different machine learning algorithms and how well they perform.
Our domnung.com news portal site has eight different main categories that we want to classify those articles. Some of the articles that we crawled already have associated categories identify by the source sites. But some articles do not belong to any category. So we will use our machine learning algorithm to identify those. The eight categories we want to identify are Cambodia, Economy, Entertainment, Health, Life, Sports, Technology, and World.
Data Cleansing
To recap from the previous article, we first crawled many of Khmer news sites for the latest articles. Then we segment the text into words since Khmer text does not have spaces separation between words. We will go in-depth on word segmentation in later articles. We need to segment text into words so we can score them to create features for the algorithms to learn from. We use an approach called bag-of-words. You can see more detail about this approach under TF-IDF in our previous article.
Training Labels
In addition to article data, we also need training labels. This is to identify which category does each article belongs to. Unlike the car accident task where we manually identify the articles, these eight categories already identified by the original sites. We infer the category based on the site link from their menu. There are some articles that are incorrectly classified due to the ambiguity of the content or crawling issue but they generally look pretty good.
We label each article with category number 0 to 7 to associate with the category name below.
{'cambodia': 0, 'economy': 1, 'entertainment': 2, 'health': 3, 'life': 4, 'sports': 5, 'technology': 6, 'world': 7}
We use about 10 thousand articles for this experiment. We limit the maximum count of categories that has the most data to 2 thousand while some categories have very low count such as health and economy. We just can’t find more articles for those categories. Here is the chart of the article count for each category we collected.
Algorithm Setup
We set aside 20% of the article for a test set. These are the articles that only used for validating the performance. They will not be seen during the training step.
The other 80% of the articles are used for training which gives us 10508 distinct terms or vocabularies with TF-IDF setting of min_df=0.005. This min_df setting limits the size of vocabularies on rarely seen terms. The higher the number, like 0.1, would eliminate more terms thus giving us fewer vocabularies. We fine-tune this number and found that 0.005 which gave us a better result.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(min_df=0.005, tokenizer=tokenizersplit,
encoding='utf-8')
The tokenizersplit is a custom function that split the word by white space as segment the data using space. See the previous article for detail.
Now, we are ready to test different algorithms. The list of algorithms is the same from the previous articles on text classification.
Performance Comparison
We can look at different performances on different document sizes. We can vary the training size by changing the test_set value in the train_test_split method. Notice that we use stratify to help with classes that have very little data. This helps ensure that a proportional number of those classes get distributed in test and training set.
test_sizes = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i, test_size in enumerate(test_sizes):
train_x, valid_x, train_y, valid_y =
model_selection.train_test_split(df['text'], df['cat'],
test_size=test_size, random_state=1, stratify=df['cat']) # label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y) xtrain_tfidf = tfidf.transform(train_x)
xvalid_tfidf = tfidf.transform(valid_x) # Naive Bayes on Word Level TF IDF Vectors
nb_accuracy = train_model(naive_bayes.MultinomialNB(),
xtrain_tfidf, train_y, xvalid_tfidf, valid_y)
...
The result shows that Logistic Regression is still improving at 9K articles, while XGBoost flat out around 0.88 accuracies starting around 5K articles. SVM does not seem to work well here with an accuracy of 0.5.
train_count nb lr svm rf xgb
0 9457.0 0.867745 0.903901 0.512845 0.867745 0.882017
1 8406.0 0.863463 0.897716 0.514748 0.847288 0.879638
2 7355.0 0.862036 0.894703 0.506185 0.842055 0.885506
3 6304.0 0.857992 0.890580 0.504520 0.849429 0.881541
4 5254.0 0.854968 0.886372 0.507804 0.836696 0.881233
5 4203.0 0.849485 0.881998 0.500238 0.831879 0.878668
6 3152.0 0.846656 0.879826 0.499320 0.823953 0.869630
7 2101.0 0.839420 0.873201 0.502795 0.810396 0.863804
8 1050.0 0.810636 0.856101 0.510890 0.783147 0.842461
Algorithms:
nb=Naive Bayes
lr=Logistic Regression
svm=Support Vector Machine
rf=Random Forest
xgb=XgBoost
It looks like we can still slightly improve the Logistic Regression performance further by having more data.
Performance Analysis
Below is the detail different algorithm performance on 25% test split.
NB accuracy: 0.862
LR accuracy: 0.895
SVM accuracy: 0.184
RF accuracy: 0.853
Xgb accuracy: 0.882
So the Logistic Regression (LR) is the best performer with 0.895 accuracies. While the XGBoost (Xgb) is the most computationally expensive, its performance is slightly worst than Logistic Regression. XGBoost was performing better on auto accident classification from the previous article.
Now, let's focus on LogisticRegression algorithm. To give a detail report on this algorithm, we print the classification_report using the sklearn linear_model library with the following code:
model=linear_model.LogisticRegression()
model.fit(xtrain_tfidf, train_y)
y_pred = model.predict(xvalid_tfidf)
print(metrics.classification_report(valid_y, y_pred,
target_names=df['cat'].unique()))
The model.fit trains the algorithm using our training data. Then the model.predict is to predict test set data that we set aside for 25%. Then we call classification_report to generate the accuracy metric by comparing the result y_pred with the original label valid_y.
Here is the result:
precision recall f1-score support cambodia 0.86 0.93 0.89 511
economy 0.00 0.00 0.00 5
entertainment 0.91 0.96 0.94 513
health 0.97 0.89 0.93 188
life 0.78 0.51 0.61 164
sports 0.93 0.97 0.95 474
technology 0.89 0.84 0.86 288
world 0.88 0.90 0.89 484 accuracy 0.90 2627
macro avg 0.78 0.75 0.76 2627
weighted avg 0.89 0.90 0.89 2627
We got 90% accuracy overall. The entertainment and sport category got high accuracy of 94% and 95% respectively.
Notice that we did not get any predictions in the economy category correctly. The training set data is very small for this category. The good approach for imbalanced data set is to get more data on this and scale back others. But we did not go further into this.
We can also look at how it predicts the data that it trains from. This gives us some idea of how well it fit on training data to see if it overfit or not. This is done by just pass in training data to the predict function. As expected the result from training set accuracy is higher. But it just slightly higher which implies that it does not overfit. Notice that even the economy category was also incorrectly predicted in the training set. The algorithm tends to ignore category with a low number of data.
precision recall f1-score support cambodia 0.89 0.95 0.92 1489
economy 0.00 0.00 0.00 10
entertainment 0.93 0.96 0.95 1487
health 0.94 0.90 0.92 499
life 0.91 0.64 0.75 543
sports 0.94 0.97 0.96 1476
technology 0.91 0.90 0.90 861
world 0.91 0.93 0.92 1516 accuracy 0.92 7881
macro avg 0.80 0.78 0.79 7881
weighted avg 0.92 0.92 0.92 7881
Conclusion
The overall performance that we got is a 90% accuracy for the Logistic Regression algorithm. This is lower than the result we have seen in the previous classification for the car accident prediction of 98% accuracy using XGBoost. But the car accident classification has to predict only 2 classes (accident or not-accident) where a random guess would get an accuracy of 50%. While with 8 different classes, a random guest only gets a 13% accuracy. So it is much more difficult to predict accurately for an 8-class than a 2-class classification.
With the performance of 90% accuracy, Logistic Regression is the algorithm we currently use for our site. As we get more data, we can still further improve this algorithm performance based on our result from the different document size experiment.