Bengali POS(Parts-Of-Speech) Tagging using Indian corpus

Abu Kaisar
Analytics Vidhya
Published in
4 min readAug 30, 2019

Bengali is the fourth maximum spoken language internationally. In the Indian subcontinent, almost two hundred million people talk in the Bengali language. Therefore, in addition to expanding the field of Bangla language research for this large number of people, it is necessary to enhance modern artificial intelligence-based technology.

Bengali Language Processing

Natural language processing is an important and subpart of Artificial Intelligence. It offers the interaction between computers and people the use of the natural language. Any type of language is processed by natural language processing. It is almost ended to discover all the modern methods for all the important languages of the world but in comparison, the Bengali language is a little behind.

Some research work has been done and some tools have been developed in the field of Bengali language processing.

List of some recent works

  • Text Summarization.
  • Sentence Generation.
  • Sentence similarity.
  • Text Analysis.
  • Word2Vector.

Natural Language Processing Technique

There are several methods by which language is processed e.g.

Some NLP technique

Lemmatization: Lemmatization is the technique of grouping together the different inflected sorts of a word in order that they may be analysed as a particular item.

Stemming: Stemming is the technique of producing morphological variations of a base word.

POS: The technique of classifying words into their elements of speech.

Word Segmentation: Segmentation is the process of dividing the written textual content into significant units, consisting of phrases, sentences, or subjects.

Bengali POS tagging

A part of speech is a class of words which have similar grammatical residences. In English Part of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection etc. Which is similar in the Bengali language. Here we will discuss how to easily create a Parts Speech Tagger using the Indian corpus for the Bengali Language.

Generally, a corpus is a large collection of data. It gives grammarians, word specialists, and other invested individuals with better descriptions of a language. Indian corpus contains a collection of Bangla, Hindi, Marathi, and Telugu language data. To work with any language, first we have to import NLTK(Natural Language Tool Kit). Then the Indian corpus has to be imported from the NLTK.

Define tnt from nltk.tag for tagging each token in a sentence with supplementary information. TnT() is a statistical tagger which follows second-order Markov model. This model is used for probability prediction of time series and sequence.

import nltk
from nltk.corpus import indian
from nltk.tag import tnt
import string

If the Indian corpus is not downloaded then the corpus is not downloaded. The punkt needs to be downloaded.

nltk.download('indian')
nltk.download('punkt')

Then we place a variable(tagged_set) where pre-trained Indian corpus is stored(bangla.pos).From Bengali corpus read the Bengali sentence and put them variable word_set. Using a for loop count all sentences which present in the corpus. startswith()-function is used to check the string is started with String “ ‘ “. Here set the training percentage is 0.96 since the dataset is not sufficient.

tagged_set = 'bangla.pos'
word_set = indian.sents(tagged_set)
count = 0
for sen in word_set:
count = count + 1
sen = "".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in sen]).strip()
print (count, sen)
print ('Total sentences in the tagged file are',count)

train_perc = .96

train_rows = int(train_perc*count)
test_rows = train_rows + 1

print ('Sentences to be trained',train_rows, 'Sentences to be tested against',test_rows)

train()-method is used for explicitly use of TnT. After the train data using evaluate() method check the performance of the trained dataset. Our evaluation score was 0.51 for using methods of Bengali data.

data = indian.tagged_sents(tagged_set)
train_data = data[:train_rows]
test_data = data[test_rows:]
pos_tagger = tnt.TnT()
pos_tagger.train(train_data)
pos_tagger.evaluate(test_data)

For the test, result user needs to provide a Bengali text here. Which is stored in a variable. Then using word_ tokenizer() split the sentences and check parts of speech of the words.

sentence = input()
tokenized = nltk.word_tokenize(sentence)
print(pos_tagger.tag(tokenized))

Output

The output of Pos tagger

Github link: https://github.com/AbuKaisar24/Bengali-Pos-Tagger-Using-Indian-Corpus

Reference for recent works

[1] Sheikh Abujar, Mahmudul Hasan, “A Comprehensive Text Analysis for Bengali TTS using Unicode”.

[2] Sheikh Abujar, Mahmudul Hasan, MSI Shahin, Sayed Akter Hossain “A Heuristic Approach of Text Summarization for Bengali Documentation”.

[3] Sanzidul Islam, Sadia Sultana Sharmin Mousumi, Sheikh Abujar and Syed Akhter Hossain, “Sequence-to-sequence Bangla Sentence Generation with LSTM Recurrent Neural Networks”

[4] Sheikh Abujar, Mahmudul Hasan, Sayed Akter Hossain “Sentence Similarity Estimation for Text Summarization Using Deep Learning”.

--

--