Word Embeddings for Indian Languages

Karan Malhotra
4 min readNov 20, 2021

NLP for Indian Languages

It is known that around 10% of the Indian population speak English [1], and the rest 90% speaks languages like Hindi, Marathi, Gujarati, Punjabi etc. So, the multilingual society in India would not be able to fully leverage the benefits of AI without the state of the art NLP systems for the regional languages.

In this article, I will discuss various resources of Word Embeddings specifically created for Indian languages. Before that, let’s check out what actually a word embedding is and some of its applications.

Figure 1: Different Languages Spoken in India. Credits Wikimedia

Word Embeddings

Word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning [2].

Nowadays it is common to use pre-trained word embeddings like word2vec, gloVe in solving several NLP problems rather than to train embeddings from scratch. Let’s now discuss what are the possible options if one is looking for such embeddings for Indian languages.

fastText [3]

fastText is a library for pre-trained word embeddings and text classification created by Facebook AI. The models were trained with words being represented as bag of character n-grams and the character n-grams have vectors associated to them. The words are then represented by sum of these vector representations [4]. fastText pre-trained word vectors are available for 294 languages trained on Wiki data and can be downloaded from here. Below is a code snippet that shows loading fastText word embeddings for Hindi.

import fasttext
import fasttext.util
ft = fasttext.load_model('wiki.hi.bin')
word = "नृत्य"
print("Embedding Shape is {}".format(ft.get_word_vector(word).shape))
print("Nearest Neighbors to {} are:".format(word))
ft.get_nearest_neighbors(word)

Embedding Shape is (300,)

Nearest Neighbors to नृत्य are:

[(0.8913929462432861, 'नृत्य।'),
(0.8440190553665161, 'नृत्यगान'),
(0.8374733924865723, 'नृत्यगीत'),
(0.8336297869682312, 'नृत्यों'),
(0.8265783190727234, 'नृत्यरत'),
(0.7971948385238647, 'नृत्यकला'),
(0.7879464626312256, 'नृत्त'),
(0.7682990431785583, 'नृतक'),
(0.7622954845428467, 'नृत्यरचना'),
(0.7602956295013428, 'नृत्यग्राम')]

The word “नृत्य” means dance in English, and the nearest words to it in the embedding space seems to be similar in meaning ( like नृत्यगीत means dance song ). Most of the Indian languages like Hindi, Marathi, Tamil, Telugu are supported by fastText.

iNLTK [5]

iNLTK is an open-source library consisting of various resources like pre-trained language models, word embeddings, sentence embeddings and a few more for 13 Indian languages [6]. The word embeddings are obtained from the embedding layer of the pre-trained language model. Below code snippet shows how to load the embeddings:

from inltk.inltk import setup
setup('hi') #### Needed only once
from inltk.inltk import get_embedding_vectors
word = "नृत्य"
vectors = get_embedding_vectors(word, 'hi')
print("Embedding Shape is {}".format(vectors.shape))

from inltk.inltk import get_sentence_encoding
text ="मैं भाग रहा हूँ"
encoding = get_sentence_encoding(text, 'hi')
print("Sentence Embedding Shape is {}".format(encoding.shape))

Embedding Shape is (400,)

Sentence Embedding Shape is (400,)

The sentence embeddings are obtained from encoder outputs of pre-trained language models and can be loaded in a similar way. As the embeddings are obtained from encoder outputs, these are contextual in nature. For more details, you can refer to this link.

IndicFT [7]

These are the fastText models trained on IndicNLP [8] corpus and are available for 11 Indian languages. You will need to download the model binaries from here and use them easily with the fastText library.

import fasttext
import fasttext.util
ft = fasttext.load_model('indicnlp.ft.hi.300.bin')
word = "नृत्य"
print("Embedding Shape is {}".format(ft.get_word_vector(word).shape))
print("Nearest Neighbors to {} are:".format(word))
ft.get_nearest_neighbors(word)

Embedding Shape is (300,)
Nearest Neighbors to नृत्य are:

[(0.8393551111221313, 'नृत्यों'),
(0.8289133906364441, 'लोकनृत्य'),
(0.7881444096565247, 'लोकनृत्यों'),
(0.7832040786743164, 'लोकनृत्य,'),
(0.7764051556587219, 'नृत्य,'),
(0.7572897672653198, 'नृत्य।'),
(0.7543818354606628, 'नृत्य’'),
(0.7490015625953674, 'नृत्यांगनाओं'),
(0.7466280460357666, 'कत्थक'),
(0.7412323355674744, 'गायन')]

If we compare the nearest neighbors obtained from this technique to the default fastText one, then here various other words which don’t contain the string “नृत्य” but are related to it came as output, so it seems relationships other than meaning among words are better captured in IndicFT models. For more training details please refer to [6].

MuRIL [9]

MuRIL is a pre-trained BERT based language model released by Google and is trained on 17 Indian languages [10]. The training strategy is similar to that of multi-lingual BERT with few modifications. The model can be easily downloaded from here and can be loaded by a few lines with the HuggingFace library.

path = 'google/muril-base-cased'## Loading the model
from transformers import AutoModel, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModel.from_pretrained(path,output_hidden_states=True)
## Embeddings
text ="कोई अच्छी सी फिल्म लगायो"
print(tokenizer.convert_ids_to_tokens(tokenizer.encode(text)))
input_encoded = tokenizer.encode_plus(text, return_tensors="pt")
with torch.no_grad():
states = model(**input_encoded).hidden_states
output = torch.stack([states[i] for i in range(len(states))])
output = output.squeeze()
print("Output shape is {}".format(output.shape))

['[CLS]', 'कोई', 'अच्छी', 'सी', 'फिल्म', 'लगा', '##यो', '[SEP]']
Output shape is torch.Size([13, 8, 768])

The first dimension of output represents the number of layers (12 layers + embedding layer), the second represents the number of tokens, and the third is the hidden size. The sentence embedding can be extracted by averaging over the layers and tokens ( usually the last four layers are considered, but one can take the average over all the layers as well ). The contextual word embeddings can be extracted by the sum over the corresponding token outputs (as input tokens here are sub-word units and not the words) with an average over the layers.

IndicBERT [11]

IndicBERT is released by AI4Bharat under IndicNLPSuite [8]. IndicBERT is a multi-lingual ALBERT based model pre-trained on 12 major Indian languages. The model can be easily downloaded from here and can be loaded by a few lines with the HuggingFace library similarly as is done for the MuRIL model. The sentence and word embeddings can also be extracted using the above code by just changing the path variable from ‘google/muril-base-cased’ to ‘ai4bharat/indic-bert’.

References

  1. https://en.wikipedia.org/wiki/Multilingualism_in_India
  2. https://en.wikipedia.org/wiki/Word_embedding
  3. https://fasttext.cc/docs/en/crawl-vectors.html
  4. Enriching Word Vectors with Subword Information
  5. https://inltk.readthedocs.io/en/latest/index.html
  6. iNLTK: Natural Language Toolkit for Indic Languages
  7. https://indicnlp.ai4bharat.org/indicft/
  8. https://indicnlp.ai4bharat.org/home/
  9. https://huggingface.co/google/muril-base-cased
  10. MuRIL: Multilingual Representations for Indian Languages
  11. https://indicnlp.ai4bharat.org/indic-bert/

--

--

Karan Malhotra

Data Scientist at Flipkart India| ex-Microsoft | Masters from Indian Institute of Science, Bangalore