Bengali POS(Parts-Of-Speech) Tagging using Indian corpus
Bengali is the fourth maximum spoken language internationally. In the Indian subcontinent, almost two hundred million people talk in the Bengali language. Therefore, in addition to expanding the field of Bangla language research for this large number of people, it is necessary to enhance modern artificial intelligence-based technology.
Bengali Language Processing
Natural language processing is an important and subpart of Artificial Intelligence. It offers the interaction between computers and people the use of the natural language. Any type of language is processed by natural language processing. It is almost ended to discover all the modern methods for all the important languages of the world but in comparison, the Bengali language is a little behind.
Some research work has been done and some tools have been developed in the field of Bengali language processing.
List of some recent works
- Text Summarization.
- Sentence Generation.
- Sentence similarity.
- Text Analysis.
- Word2Vector.
Natural Language Processing Technique
There are several methods by which language is processed e.g.
Lemmatization: Lemmatization is the technique of grouping together the different inflected sorts of a word in order that they may be analysed as a particular item.
Stemming: Stemming is the technique of producing morphological variations of a base word.
POS: The technique of classifying words into their elements of speech.
Word Segmentation: Segmentation is the process of dividing the written textual content into significant units, consisting of phrases, sentences, or subjects.
Bengali POS tagging
A part of speech is a class of words which have similar grammatical residences. In English Part of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection etc. Which is similar in the Bengali language. Here we will discuss how to easily create a Parts Speech Tagger using the Indian corpus for the Bengali Language.
Generally, a corpus is a large collection of data. It gives grammarians, word specialists, and other invested individuals with better descriptions of a language. Indian corpus contains a collection of Bangla, Hindi, Marathi, and Telugu language data. To work with any language, first we have to import NLTK(Natural Language Tool Kit). Then the Indian corpus has to be imported from the NLTK.
Define tnt from nltk.tag for tagging each token in a sentence with supplementary information. TnT() is a statistical tagger which follows second-order Markov model. This model is used for probability prediction of time series and sequence.
import nltk
from nltk.corpus import indian
from nltk.tag import tnt
import string
If the Indian corpus is not downloaded then the corpus is not downloaded. The punkt needs to be downloaded.
nltk.download('indian')
nltk.download('punkt')
Then we place a variable(tagged_set) where pre-trained Indian corpus is stored(bangla.pos).From Bengali corpus read the Bengali sentence and put them variable word_set. Using a for loop count all sentences which present in the corpus. startswith()-function is used to check the string is started with String “ ‘ “. Here set the training percentage is 0.96 since the dataset is not sufficient.
tagged_set = 'bangla.pos'
word_set = indian.sents(tagged_set)
count = 0
for sen in word_set:
count = count + 1
sen = "".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in sen]).strip()
print (count, sen)
print ('Total sentences in the tagged file are',count)
train_perc = .96
train_rows = int(train_perc*count)
test_rows = train_rows + 1
print ('Sentences to be trained',train_rows, 'Sentences to be tested against',test_rows)
train()-method is used for explicitly use of TnT. After the train data using evaluate() method check the performance of the trained dataset. Our evaluation score was 0.51 for using methods of Bengali data.
data = indian.tagged_sents(tagged_set)
train_data = data[:train_rows]
test_data = data[test_rows:]
pos_tagger = tnt.TnT()
pos_tagger.train(train_data)
pos_tagger.evaluate(test_data)
For the test, result user needs to provide a Bengali text here. Which is stored in a variable. Then using word_ tokenizer() split the sentences and check parts of speech of the words.
sentence = input()
tokenized = nltk.word_tokenize(sentence)
print(pos_tagger.tag(tokenized))
Output
Github link: https://github.com/AbuKaisar24/Bengali-Pos-Tagger-Using-Indian-Corpus
Reference for recent works
[1] Sheikh Abujar, Mahmudul Hasan, “A Comprehensive Text Analysis for Bengali TTS using Unicode”.
[3] Sanzidul Islam, Sadia Sultana Sharmin Mousumi, Sheikh Abujar and Syed Akhter Hossain, “Sequence-to-sequence Bangla Sentence Generation with LSTM Recurrent Neural Networks”