Preprocessing Text for any language for Deep Learning language models

5 min readJul 30, 2023

Text preprocessing is an integral part of working with text data in language models. Though we are equipped with different natural language tool kits to deal with preprocessing the data, but what about preprocessing a text written in a language that is still not available in the library tool kit? This article deals with such a project. I will walk you through an example of a data file written in Bengali and show you how to tokenize that data into word or character and assign a numerical index to that data according to the frequency of occurrence using Python. Now most of the readers might not understand the language, but the code is same for any language and I have purposely picked up a language other than English to show you how the code works. I have a .txt file of a famous Bengali fairy tale Thakumar Jhuli which has 28127 words with 10772 of them being unique.

Reading the data file into a list

The following function takes the file name as input and returns a list where each item is a line in the book. Here utf-8 has been used for encoding since it supports most of the languages and scripts world wide. A list of the languages supported by utf-8 can be found here.

def readbook(filename):
    with open(filename, 'r', encoding='utf-8') as b:
        lines = b.readlines()
        return [line.strip() for line in lines]
filename = "THAKURMAR JHULI.txt"
text = readbook(filename)
text[40:45] #printing lines 40 to 44 of the text loaded

The output of the above is as follows:

Text Tokenization

There are two types of tokenization available for text data, one is character and the other is word. The tokenise function we are about to use takes the list returned by the readbook function defined above and the token type and returns a list of tokens. By default the tokenization has been set to type word.

def tokenise(lines, token_type = 'word'):
    assert token_type in ('char', 'word'), 'Unknown token type :' + token_type
    if token_type == 'word':
        return [line.split() for line in lines]
    else:
        return [list(line) for line in lines]
word_tokens = tokenise(text) # returns a list of word type tokens
char_tokens = tokenise(text, 'char') # returns a list of character type tokens

Building the Vocabulary

Next, we need to build a vocabulary on the tokens to assign a numeric value to each, since Machine/Deep learning algorithms cannot process string type data. All the unique tokens or corpus is counted and based on the frequency of occurrence of the token in the reference text, a numeric index number is assigned to them. Whenever a token is encountered that does not belong to the corpus, an unknown token <unk> is assigned to it. While building the vocabulary, you can pass some user defined tokens like <eos> or end of sequence, <bos> or beginning of sequence as reserved tokens. The unknown token <unk> is a reserved token which is added by default to the corpus. The following code builds the class vocabulary.

import re
import collections
import numpy as np

class Vocabulary():
    def __init__(self, tokens=None, min_freq=0, reserved_tokens=None):
        if tokens is None:
            tokens = []
        if reserved_tokens is None:
            reserved_tokens = []
        counter = count_corpus(tokens)
        self._token_freq = sorted(counter.items(), key = lambda x: x[1], reverse = True)
        self.idx_to_token = ['<unk>'] + reserved_tokens
        self.token_to_idx = {token: idx for idx, token in enumerate(self.idx_to_token)}
        for token, freq in self._token_freq:
            if freq < min_freq:  
                break
# above snippet deletes very rare characters. Since the min_freq is set to 0, the code deletes none
            if token not in self.token_to_idx:
# new token appended
                self.idx_to_token.append(token)
                self.token_to_idx[token] = len(self.idx_to_token) - 1
    
    def __len__(self):
        return len(self.idx_to_token)
    
    def __getitem__(self, tokens):
        if not isinstance(tokens, (list, tuple)):
            return self.token_to_idx.get(tokens, self.unk)
        return [self.__getitem__(token) for token in tokens]
    
    def to_tokens(self, indices):   # returns token on passing an index or list of indices
        if not isinstance(indices, (list, tuple)):
            return self.idx_to_token[indices]
        return [self.idx_to_token[index] for index in indices]
    
    def to_index(self, tokens):   # returns index of the token or list of tokens passed
        if not isinstance(tokens, (list, tuple)):
            return self.token_to_idx[tokens]
        return [self.token_to_idx[token] for token in tokens]
    
    def unk(self):    # the <unk> token is assigned index 0.
        return 0
    
    def token_freqs(self):
        return self.token_freq

# This function counts the frequency of occurrence of the tokens passed in a 1D or 2D list
def count_corpus(tokens):
    if len(tokens) == 0 or isinstance(tokens[0], list):
        tokens = [token for line in tokens for token in line]
    return collections.Counter(tokens)
vocab = Vocabulary(word_tokens)
print(list(vocab.token_to_idx.items())[:10])

The output of the above is as follows:

Finally, we pack all the functions above to define a single function to build the vocabulary.

def build_corpus(filename):
    lines = readbook(filename)
    tokens = tokenise(lines)
    vocab = Vocabulary(tokens)
    corpus = [vocab[token] for line in tokens for token in line]
    return corpus, vocab
corpus, vocab = build_corpus(filename)
print(len(corpus))
print(len(vocab))

The output of the above shows:

The output of the length of corpus returned as 26328 and the length of vocabulary to be 8529 as returned by the code.

Stop words Removal

The code for this section will depend largely on the vocabulary itself. The numerical index assigned to the tokens is as per the frequency of occurrence, larger the frequency, lower is the numerical value assigned to it. In the following code, the index number of each token is fetched and divided by the total number of unique tokens. If the value of the quotient is less than a pre assigned value, the token is dropped. This pre assigned value is a hyper parameter whose value will depend on how many tokens one needs to drop. The code is as follows;

def del_stops(sentences, vocab):
    hyper_param = 6e-4
    sentences = [token for token in sentences if vocab[token] != vocab.unk]  #<unk> with value 1 excluded
    indices = vocab.to_index(sentences) # get the list of indices for the input
    counter = list(zip(sentences, indices))
    num_tokens = len(vocab)
    return ([token for token in counter if token[1]/num_tokens > hyper_param], counter)

In this code, the hyper parameter hyper_param value is chosen in a way that the first four tokens apart from <unk> are dropped. I choose a sentence from the text file and the tokens with lower numerical values are dropped.

As can be observed, the first three tokens of the sentence had indices 3, 2 and 4 respectively, when the sentence is fed into the function, the first three tokens are removed. This function thus can be used to remove the stop words by carefully choosing the hyper parameter.

If you have reached till here, I sincerely thank you and hope the article has given you a fair idea on how to do text preprocessing. If you have any doubts or further questions, feel free to ask me.

Preprocessing Text for any language for Deep Learning language models

Reading the data file into a list

Text Tokenization

Building the Vocabulary

Stop words Removal

Written by Koyela Chakrabarti