Deep Learning Language Model for Telugu Corpus using LSTM

Published in

Analytics Vidhya

3 min readMar 7, 2020

Source: Psiĥedelisto / CC0[Wikimedia Commons]

Language Modeling is one of the most important parts of modern Natural Language Processing (NLP) which is used for Text Generation. There are boundless number of areas in which it can be applied, like: Machine Translation, Spell Correction Speech Recognition, Summarization, Question Answering, Sentiment analysis etc.

Here, we build a language model using a Long Short Term Memory (LSTM) RNN dealing with a corpus which is based on one of the Indian Languages- Telugu. RNNs use networks with hidden layers of memory to predict the next step using the highest probability. RNNs use backpropagation to use the text we input to learn how to generate the next letters or characters as its output.

The LSTM part of the model allows us to build an RNN model with improved learning of long-term dependencies i.e., better memory which facilitates an improved performance for those words that we will generate.

About the Corpus

The entire corpus that has been used in this post is in the form of a text file. You can create your own corpus from various sources like Wikipedia or news articles, etc.

Load Data

You can see, the corpus contains sentences in Telugu. Now, we tokenize it to obtain the unique words.

Encoding

We now encode the tokenized words.

Declaring a function that returns the encoded list:

Now we take a window size of 4 words and try to predict the next word that occurs in a sentence. With a stepahead of 3, we split the sentences into small arrays containing the encoded values of the 4 input words and the next word to be predicted. We append both of them into two different lists for the entire corpus.