Bengali Word Spelling Correction Using Pre-trained Word2Vec

Abu kaisar
Dec 10, 2019 · 5 min read

Correct spelling is very important for any kinds of documentation. Many of automatic spell check is available in the online for different language. It helps us to correct the written wrong word or replace the correct automatically. Also, helps to find out the grammatical mistake and syntax error. If we look at an example such as Grammarly automatic spell checker is the best example for everyone. There is no automatic spell checker are present for our Bengali language. But automatic spell checker looks like Grammarly software is badly need for every Bengali language users.

Here is discussed an approach for making an automatic Bengali correct word replacement for building an automatic spell checker. The whole procedure depends on word2vec. Pre-trained word2vec file is used for this which has a small vocabulary. But effective for this work. Given a short description in below for making a spell checker.

Library Function

gensim library function is used to load the Bengali pre-trained word2vec file from pc.

import gensim

Word2Vec

Word embedding is one of the most significant strategies in common language processing, where words are mapped to vectors of genuine numbers. Word embedding is fit for catching the importance of a word in a report, semantic and syntactic closeness, connection with different words. It additionally has been broadly utilized for recommender frameworks and content arrangement.

‘bnword2vec’ is a pre-trained word2vec file for the Bengali language and ‘.txt’ is the extension of the loaded file.

model = gensim.models.KeyedVectors.load_word2vec_format('bnword2vec.txt')

Words Rank

Words rank-ordering archive significance dependent on the area of a looked through watchword in the sentence. Here it’s discovering the centre word from the Word2Vec document.

words = keeps the word index number from Word2vec file.

w_rank = It is a dictionaries which put all words when the loop is working.

enumerate() = Enumerate is a method adds a counter to an iterable and returns it in a form of enumerate object.

WORDS = This varible carry the value of w_rank dictionaries.

words = model.index2word

The len() the function returns the number of items in an object.

len(words)

Function

A function is a square of sorted out, reusable code that is utilized to play out a solitary, related activity. The function gives better seclusion to an application and a high level of code reusing.

P() = This methods returns the value for the given key, if present in the dictionary using the get() method.

Dictionary.get(key, default=None) this is the syntax of the get() method.

def P(word):
return - WORDS.get(word, 0)

max() = This function is used to compute the maximum of the values passed in its argument and lexicographically largest value if strings are passed as arguments.

correction() = It returns the maximum candidates words with a key which is defined by P.

def correction(word):
return max(candidates(word), key=P)

candidates() = The absolute candidate of the wrong word could found from known() methods is the actual work of this function.

def candidates(word):
return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

known() =This method is used to find out the set of a word which is present in the dictionary.

set()= A set is an unordered collection of items. Every element is unique (no duplicates) and must be immutable.

def known(words):
return set(w for w in words if w in WORDS)

edits1() = Many parameters such as deletes, transposes, replaces, inserts are used in this method. Those parameters return the correct word of an incorrect word in a sentence. A set() function is used to find out the unordered collection of words.

edits2() = This method is returend the word which is edited by in edits1() functions.

letters = The Bengali script has a total of 9 vowels. Each of which is called a ‘স্বরবর্ণ. Also, have 35 consonants that are known as ‘ব্যঞ্জনবর্ণ .

splits = It working as a list which has both forward and reverses orders of the word sequence.

deletes = Also, a list which checks the left and right of a word in splits list and deletes the incorrect syntax.

transposes = It is used to change word places with each other words using the splits list.

replaces = Is a list it put words back in a previous place or position.

inserts = It helps to place and fit the correct words into the replacing with the incorrect word, especially with care.

def edits1(word):

Now the code is ready to replace the correct word. If a user could put an incorrect word in the variable then the corresponding correct word will be output. This code is built for only single word spelling checking. But need a spell checker which checks the spelling of a whole paragraph or a document continuously. There are given some output demo in below.

a=input()
output demo

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Abu kaisar

Written by

Core Researcher at DIU NLP and Machine Learning Research Lab

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade