Text Processing 1 — Old Fashioned Methods (Bag of Words and TFxIDF)

Prof. Dr. Deniz Kılınç

Published in

Deep Learning Turkey

9 min readMay 31, 2018

Conventional Text Processing Methods

Text Data

The amount of unstructured and noisy text data is increasing with each passing day. To apply any machine learning (ML) tasks with unstructured text, we need to convert text data to features and features to vectors.

In this article, I will attempt to introduce some Natural Language Processing (NLP) techniques for feature extraction, two models for feature representation, and a topic model that have been used in the fields of information retrieval (IR) and ML for many years. The obtained features can then be employed to construct ML or deep learning (DL) models easily.

Dataset

In this article, we will use TTC-3600 dataset from well-known UCI Machine Learning Repository which is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.

Figure 1. UCI Machine Learning Repository

The original dataset consists of a total of 3600 documents including 600 news from 6 categories: economy, culture-arts, health, politics, sports and technology are obtained from 6 well-known news portals and agencies (Hürriyet, Posta, İha, HaberTürk, and Radikal). Documents of TTC-3600 dataset are collected between May-July 2015 via Rich Site Summary (RSS) feeds from 6 categories of the respective portals. Samples are in Turkish language which is derived from the Latin alphabet consisting 8 vowels (a, e, ı, i, o, ö, u, ü) and 21 consonants (b, c, ç, d, f, g, ğ, h, j, k, l, m, n, p, r, s, ş, t, v, y, z). You can download the dataset using this link.

As shown in Table 1, 8 sample documents are selected from TTC-3600.

Let’s make some coding practice…

In the first step, we will load required dependencies: nltk, re, pandas, and numpy.

NLTK — Natural Language Toolkit is a suite of open source Python modules and data sets supporting research and development in NLP.
Re — supports regular expression matching operations.

import re
import nltk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.display.max_colwidth = 8000
nltk.download('stopwords')

Then, we construct a document array named docs which includes sample documents.

# TR: Örnek Türkçe dokümanlar 
# EN: Sample documents in Turkish
docs = ['Açıklama projenin ortaklarından Rus enerji devi Gazprom dan geldi. Yıllık 63 milyar metreküp enerji',
        'ilk günündeki 20 yarış heyecanlıydı, 109 puan toplayan Türkiye, 12 ülke arasında 9. oldu ve yarış tamamlandı',
        'Cortananın yeni işletim sistemi Windows 10 un önemli bir parçası olduğunu belirten Microsoft ; Google Android ve iOS cihazlarındaki Dijital',
        'Teknoloji devi Google, Android in MMM sürümüyle birlikte bir çok sistemsel hatasının düzeltileceğini',
        'Siroz hastalığı ile ilgili detaylara dikkat çekerek, sağlıklı bir karaciğere sahip olmak hastalık için',
        'Hastalık çoğu kez yıllarca doğru tanı konmaması veya ciddiye alınmaması sebebi ile kısırlaştırıcı etki yapabiliyor, kronik ağrı,',
        'ilk 4 etaptan galibiyetle ayrılan 18 yaşındaki Razgatlıoğlu, Almanya daki yarışta 3. sırayı alarak ',
        'Helal gıda pazarı sanki 860 milyar doların üzerinde'    
]
# TR: Dokümanların sınıfları
# EN: Classes of documents
classes = ['ekonomi', 'spor', 'teknoloji', 'teknoloji', 'saglik', 'saglik', 'spor', 'ekonomi']

After initialization of doc array, we construct pandas data frames that is a 2-dimensional labeled data structure.

docs = np.array(docs)
df_docs = pd.DataFrame({'Dokuman': docs, 
                        'Sinif': classes})
df_docs = df_docs[['Dokuman', 'Sinif']]
#df_docs

Pre-processing

Pre-processing is one of the most important step to prepare text documents “before any ML or DL task”. Tokenization, stop-words elimination and stemming are the most widely used pre-processing methods.

To analyze a text document, tokenization must firstly be performed and groups of word must be obtained.
All common separators, operators, punctuations and non-printable characters are removed.
Then, stop-words filtering that aims to filter-out the most frequent words is performed. Samples: “but, perhaps, wonder” — “ama, belki, acaba”.
Finally, stemming and/or lemmatization is applied to obtain the stem of a word that is morphological root by removing the suffixes that present grammatical or lexical information about the word. In the article, we skip the stemming step.

Pre-processing code block is shown below. norm_doc function gets a document as input and applies pre-processing steps mentioned.

WPT = nltk.WordPunctTokenizer()
stop_word_list = nltk.corpus.stopwords.words('turkish')def norm_doc(single_doc):
    # TR: Dokümandan belirlenen özel karakterleri ve sayıları at
    # EN: Remove special characters and numbers
    single_doc = re.sub(" \d+", " ", single_doc)
    pattern = r"[{}]".format(",.;") 
    single_doc = re.sub(pattern, "", single_doc) 
    # TR: Dokümanı küçük harflere çevir
    # EN: Convert document to lowercase
    single_doc = single_doc.lower()
    single_doc = single_doc.strip()
    # TR: Dokümanı token'larına ayır
    # EN: Tokenize documents
    tokens = WPT.tokenize(single_doc)
    # TR: Stop-word listesindeki kelimeler hariç al
    # EN: Filter out the stop-words 
    filtered_tokens = [token for token in tokens if token not in stop_word_list]
    # TR: Dokümanı tekrar oluştur
    # EN: Reconstruct the document
    single_doc = ' '.join(filtered_tokens)
    return single_docnorm_docs = np.vectorize(norm_doc) #like magic :)
normalized_documents = norm_docs(docs)
print(normalized_documents)

Figure 3 shows that how the documents change after pre-processing. Punctuations, special characters, and numbers are removed. All words are converted to lower case and tokenized with space. Turkish stop-words defined in NLTK are also removed (Eg. the word “sanki” in the last document is filtered out).

Feature/Term Representation and BoW Model

Bag of Words (BoW) model is a way of representation of text which specifies occurrence (Eg. counts) of terms in a document. In this model, order and the sequence of words are not considered.

Vector Space Model (VSM) is the improved version of BoW where each text document is represented as a vector, and each dimension corresponds to a separate term (word). If a term occurs in the document, then its value becomes non-zero in the vector.

In the article, we will apply CountVectorizer that converts a collection of text documents to a matrix of term counts.

# TR: 1.Terim Sayma Adımları
# EN: 1.Term Counting Steps
from sklearn.feature_extraction.text import CountVectorizer
BoW_Vector = CountVectorizer(min_df = 0., max_df = 1.)
BoW_Matrix = BoW_Vector.fit_transform(normalized_documents)
print (BoW_Matrix)

Some selected parts of screen output for the BoW_Matrix is shown below.

Figure 4 shows two documents (Doc-0, Doc-1), their terms and corresponding term counts.

The first two rows of Doc-0 specifies that terms having indices 46 and 48 are belonging to this document and their term counts are 1. We need to run the following code block to see the values of these terms.

# TR: BoW_Vector içerisindeki tüm öznitelikleri al
# EN: Fetch al features in BoW_Vector
features = BoW_Vector.get_feature_names()
print ("features[46]:" + features[46])
print ("features[48]:" +features[48])

The output is shown below:

features[46]:metreküp
features[48]:milyar

In order to create a data frame, we should convert BoW_Matrix to an array and use it as the input of DataFrame (with the names of the terms/features).

BoW_Matrix = BoW_Matrix.toarray()
# TR: Doküman -öznitelik matrisini göster
# EN: Print document by term matrice
BoW_df = pd.DataFrame(BoW_Matrix, columns = features)
BoW_df

Figure 5 shows the screen output of data_frame named BoW_df.

We can also get information about data frame using info() method.

print(BoW_df.info())

TF x IDF Scoring Model

The problem with counting term frequencies is that the frequently used terms become dominant in the document and begin to represent the document.

Even if these terms are not very informative, they probably passivate other terms in the feature set.

To solve this problem we will use anoher scoring model,“TF x IDF”, which stands for Term Frequency x Inverse Document Frequency. Model uses two metrics in its computation: term frequency (tf) and inverse document frequency (idf). Mathematical equations of TF x IDF is as follows:

TF x IDF score for term “i” in document “j” = TF(i, j) * IDF(i)
TF(i, j) = (Term i frequency in document) / (Total terms in document)
IDF(i) = log2(Total documents / documents with term i)

To convert our documents to a matrix of TF-IDF features, we will use TfidfVectorizer.

# TR: 2.TFxIdf Hesaplama Adımları
# EN: 2.TFxIdf Calculation Steps
from sklearn.feature_extraction.text import TfidfVectorizer
Tfidf_Vector = TfidfVectorizer(min_df = 0., max_df = 1., use_idf = True)
Tfidf_Matrix = Tfidf_Vector.fit_transform(normalized_documents)
Tfidf_Matrix = Tfidf_Matrix.toarray()
print(np.round(Tfidf_Matrix, 3))
# TR: Tfidf_Vector içerisindeki tüm öznitelikleri al
# EN: Fetch al features in Tfidf_Vector
features = Tfidf_Vector.get_feature_names()
# TR: Doküman - öznitelik matrisini göster
# EN: Print document by term matrice
Tfidf_df = pd.DataFrame(np.round(Tfidf_Matrix, 3), columns = features)
Tfidf_df

To see the difference of term representation (weighting / scoring) models mentioned above, we simply need to interpret the output of Tfidf_df in Figure 6.

For example, considering the document at index 2;

The score value of the term “android” is 1 in the term-frequency scoring method while it is 0.210 in the TFxIDF scoring method.
While the score value of the term “new” is still 1 in the term-frequency scoring method, this value is 0.251 in the TFxIDF scoring method.

Topic Models

Topic models are the key concepts extracted from a corpus of documents. Topics are represented as a collection of terms. They are very valuable to summarize large corpus of text documents, further, they reveal latent patterns in the data. Latent Dirichlet Allocation (LDA) is an example of topic model and was introduced by David Blei, Andrew Ng, and Michael I. Jordan in 2003. LDA is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

Figure 7. Representation of LDA topic model

In python, you can apply LDA model using gensim and sklearn libraries. In the article, we will use sklearn. Since LDA model is not able to automatically determine the number of topics, number_of_topics (n_components) must be set manually. Other parameters of LDA can be accessable using this link. As shown in the following code block, LDA model takes “bag of words matrix” (BoW_Matrix) as the input and generates smaller matrices.

#LDA: Topic Modeling
from sklearn.decomposition import LatentDirichletAllocation
number_of_topics = 4
BoW_Matrix = BoW_Vector.fit_transform(normalized_documents)
LDA = LatentDirichletAllocation(n_components = number_of_topics, 
                                max_iter = 10, 
                                learning_offset = 50.,
                                random_state = 0,
                                learning_method = 'online').fit(BoW_Matrix)
features = BoW_Vector.get_feature_names()
for t_id, topic in enumerate(LDA.components_):
    print ("Topic %d:" % (t_id))
    print (" ".join([features[i]
          for i in topic.argsort()[:-number_of_topics - 1:-1]]))

By running the above code block, 4 topics containing the top (related) 4 terms are printed as shown in Figure 8. The fact that the number of documents in our data set is low makes some subject groups meaningless.

Conclusion

BoW (Term Counting, TF-IDF etc.) and topic models are used in many ML tasks such as text classification and sentiment analysis. They are easy to understand and implement. Further, it’s fun to work with text applying NLP techniques. Despite these advantages, models have some shortcomings:

Because of the space and time complexity of sparse representations, models are hard to compute.
They discard the order and sequence of the terms in documents, and so we lose the semantic.

In the next article, we will introduce more recent and advanced methods (Doc2Vec, Word2Vec, FastText) for text processing and feature extraction.

Note: You can access the source code and dataset used in this article from my GitHub.

Dataset: TextProcessing
Source code: TextProcessingPart1.ipynb

References

Deniz Kılınç