Emotion Detection from Hindi Text Corpus Using ULMFiT

GreyMatter.ai

Published in

Saarthi.ai

7 min readFeb 13, 2019

Written by-Ankit Singh, Dhairya Patel, Kaustumbh Jaiswal

Introduction

Deep Learning has charged up the space of Image recognition and Speech processing for some time now.

We are witnessing a similar trend in Natural Language Processing.

Deep Learning for NLP was less impressive at first, but with the introduction of techniques like ULMFiT, ELMo,Transformers, BERTetc., it has become an impact driver, yielding state-of-the-art (SOTA) results for common NLP tasks.

Named entity recognition (NER), part of speech (POS) tagging, Sentiment analysis, etc., are some of the problems where neural network models have outperformed traditional approaches. The progress in machine translation is perhaps the most remarkable amongst all.

In this blog we will showcase a ULMFiT model and use it for Emotion Detection. ULMFiT is the technique of using transfer learning for text classification task.

Let’s begin!

Transfer Learning

Transfer learning is the technique of using weights from a pre-trained deep neural network and tweaking them a bit to suit our application. In other words, it is applying the knowledge of an already trained model to a different but related problem.

It is suited to applications having a small dataset and also reduces computation time.

What is ULMFiT?

ULMFiT stands for Universal Language Model Fine-tuning for Text Classification, a technique introduced by Jeremy Howardand Sebastian Ruder. It is a technique to incorporate transfer learningin NLP tasks.

USPs of ULMFiT is-

Discriminative fine-tuning
Slanted triangular learning rates
Gradual unfreezing

Discriminative Fine-Tuning

Different layers of a neural network capture different types of information so they should be fine tuned to different extents. Instead of using the same learning rates for all layers of the model, discriminative fine-tuning allows us to tune each layer with different learning rates.

Slanted Triangular Learning Rates

The model should quickly converge to a suitable region of the parameter space in the beginning of training and then later refine its parameters. Using a constant learning rate throughout training is not the best way to acheive this behaviour. Instead Slanted Triangular Learning Rates (STLR) linearly increases the learning rate at first and then linearly decays it.

Gradual Unfreezing

Gradual unfreezing is the concept of unfreezing the layers gradually which avoids catastrophic loss of knowledge possessed by the model. It first unfreezes the top layer and fine-tunes all the unfrozen layers for 1 epoch. It then unfreezes the next lower frozen layer and repeats until all the layers have been fine-tuned until convergence at the last iteration.

For a detailed explanation on ULMFiT we strongly suggest you to go through this paper.

Let’s Code!

Installation

For running the code explained in the subsequent sections make sure fastai version 0.7 is installed in your system. To install fastai follow the instructions given here.

from fastai.text import *
import html

Getting Started

We start up by creating different folders for classification and language models.

PATH = Path('') # path to the dataCLAS_PATH=Path('emotion_hindi_clas/')
CLAS_PATH.mkdir(exist_ok=True)LM_PATH=Path('emotion_hindi_lm/')
LM_PATH.mkdir(exist_ok=True)

Dataset

The dataset is created manually as there’s no pre-existing dataset for Hindi Emotion Detection. It comprises of 5 labels Angry, Happy, Neutral, Sad and Excited.

Each entry of the dataset is then converted to a text file which is stored in a folder of the class to which it belongs. Now, let’s load the dataset.

CLASSES = ['angry','excited','happy','neutral','sad']def get_texts(path):
    texts,labels = [],[]
    for idx,label in enumerate(CLASSES):
        for fname in (path/label).glob('*.*'):
            texts.append(fname.open('r', encoding='utf-8').read())
            labels.append(idx)
    return np.array(texts),np.array(labels)trn_texts,trn_labels = get_texts(PATH/'train')
val_texts,val_labels = get_texts(PATH/'test')

The get_texts() function loads the data and stores all the texts in trn_texts and val_texts and their respective labels in trn_labels and val_labels.

Data Pre-processing

Now we convert our data into csv format having two columns labels and texts.

col_names = ['labels','text']df_trn = pd.DataFrame({'text':trn_texts, 
                       'labels':trn_labels}, 
                        columns=col_names)
df_val = pd.DataFrame({'text':val_texts, 
                       'labels':val_labels}, 
                        columns=col_names)df_trn.to_csv(CLAS_PATH/'train_hindi.csv',header=False, index=False)
df_val.to_csv(CLAS_PATH/'test_hindi.csv', header=False, index=False)(CLAS_PATH/'classes_hindi.txt').open('w', encoding='utf8').writelines(f'{o}\n' for o in CLASSES)

We also create a different csv to train our language model having all the labels as 0 (to train the language model labels are not required).

trn_texts,val_texts = sklearn.model_selection.train_test_split(
                        np.concatenate([trn_texts,val_texts]), 
                        test_size=0.1)df_trn = pd.DataFrame({'text':trn_texts, 
                       'labels':[0]*len(trn_texts)}, 
                        columns=col_names)
df_val = pd.DataFrame({'text':val_texts, 
                       'labels':[0]*len(val_texts)}, 
                        columns=col_names)df_trn.to_csv(LM_PATH/'train_hindi.csv', header=False, index=False)
df_val.to_csv(LM_PATH/'test_hindi.csv', header=False, index=False)

Language Model

We create a language model which is trained on Hindi wikidump corpus. Language model is created to give the model a better understanding of the language. For example, if the language model is given an incomplete sentence, the model will try to complete the sentence by predicting the next word.

re1 = re.compile(r'  +')def fixup(x):
    x = x.replace('#39;', "'").replace('\\"', '"').replace('#146;', "'")
    return re1.sub(' ', html.unescape(x))

The fixup function replaces some of the weird things present in the dataset.

def get_texts(df, n_lbls=1):
    labels = df.iloc[:,range(n_lbls)].values.astype(np.int64)
    texts = f'\n{BOS} {FLD} 1 ' + df[n_lbls].astype(str)
    for i in range(n_lbls+1, len(df.columns)): 
        texts += f' {FLD} {i-n_lbls} ' + df[i].astype(str)
    texts = list(texts.apply(fixup).values)tok = Tokenizer().proc_all_mp(partition_by_cores(texts))
    return tok, list(labels)

The get_texts function applies the fixup function and inserts xbos and xfld tags to mark the beginning of sentence and sentence tag respectively.

def get_all(df, n_lbls):
    tok, labels = [], []
    for i, r in enumerate(df):
        print(i)
        tok_, labels_ = get_texts(r, n_lbls)
        tok += tok_;
        labels += labels_
    return tok, labels

The get_all function tokenizes the data and returns the tokenized text and the labels.

tok_trn, trn_labels = get_all(df_trn, 1)
tok_val, val_labels = get_all(df_val, 1)(LM_PATH/'tmp').mkdir(exist_ok=True)

We create a list itos which maps the tokens obtained to integers.

# freq.most_common contains the tokens along with their frequency of # occurencemax_vocab = 60000    # The maximum size of vocablury
min_freq = 2
itos = [o for o,c in freq.most_common(max_vocab) if c>min_freq]

Also, a dictionary stoi is required to convert the integers to back their respective tokens.

stoi = collections.defaultdict(lambda:0, 
        {v:k for k,v in enumerate(itos)})

Fine Tuning the Language Model

Next we load the pre-trained language model and fine tune it on our dataset. We also load the itos file of the pre-trained language model to map the vocab of the dataset to the pre-trained language model’s. For example, if खुश maps to 7 in the pre-trained language model then खुश in the dataset should also map to 7.

wgts = torch.load(PRE_LM_PATH, map_location=lambda storage, loc: storage)itos2 = pickle.load((PRE_PATH/'itos_wiki_hindi.pkl').open('rb'))
stoi2 = collections.defaultdict(lambda:-1, 
        {v:k for k,v in enumerate(itos2)})

To fine-tune the language model, we create a data model object of our dataset which is further used to create an instance of the language model.

trn_dl = LanguageModelLoader(np.concatenate(trn_lm), bs, bptt)
val_dl = LanguageModelLoader(np.concatenate(val_lm), bs, bptt)
md = LanguageModelData(PATH, 1,vs, trn_dl, val_dl, bs=bs, bptt=bptt)learner= md.get_model(opt_fn, em_sz, nh, nl, dropouti=drops[0], 
                       dropout=drops[1], wdrop=drops[2], 
                       dropoute=drops[3], dropouth=drops[4])learner.metrics = [accuracy]learner.model.load_state_dict(wgts) #loading the pre-trained weights

Fine-tuning of the model is done using gradual unfreezing explained in the above sections.

learner.freeze_to(-1)
learner.fit(lrs/2, 1, wds=wd, use_clr=(32,2), cycle_len=1)learner.unfreeze()
learner.fit(lrs, 1, wds=wd, use_clr=(20,10), cycle_len=7)

After fine-tuning, the language model is saved along with its encoder weights which will be used by the classifier.

learner.save('lm_fine_tuned')
learner.save_encoder('lm_enc_fine_tuned')

Classification Model

We begin by pre-processing the data in the same way as done for the language model and then make our RNN classifier.

m = get_rnn_classifer(bptt, 20*70, c, vs, emb_sz=em_sz, n_hid=nh, 
      n_layers=nl, pad_token=1, layers=[em_sz*3, 50, c], 
      drops=[dps[4], 0.1], dropouti=dps[0], wdrop=dps[1], 
      dropoute=dps[2], dropouth=dps[3])# Adam is used as the optimiser
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

RNN_Learnerhandles the whole creation of a learner object with a text data using a certain bptt.

learn = RNN_Learner(md, TextModel(to_gpu(m)), opt_fn=opt_fn)
learn.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learn.clip=.25
learn.metrics = [accuracy]

We load the encoder weights of the fine-tuned language model and train our classifier on that using gradual unfreezing.

learn.load_encoder('lm_enc_fine_tuned')learn.freeze_to(-1)
learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3))learn.freeze_to(-2)
learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3))learn.unfreeze()
learn.fit(lrs, 1, wds=wd, cycle_len=10, use_clr=(32,10))

The pre-trained Hindi language model and the notebook can be found here.

Results

The model achieved a peak accuracy of 90.26 % on validation set.

End Notes

We hope you found this blog post helpful and have understood the concept of ULMFiT. There are still many things to explore in ULMFiT using the fastai library and we encourage you to take a look. For a deeper understanding of the code, we suggest you to go through the fastai course mentioned in the reference section. If you have any doubts/suggestions please feel free to mention them in the comment section.

Thanks for reading. Happy coding! 👨🏽‍💻😊