Consumer Complaints(Financial) with pre-trained GLOVE

Joseph Roy
Sep 3, 2018 · 6 min read

Consumer Financial Protection Bureau(CFPB) regulates the offering and provision of consumer financial products or services under the federal consumer financial laws and educates and empowers consumers to make better informed financial decisions. CFPB receives lots of consumer complaints each day related to various financial products. For this project, I have used consumer complaints data and tried to predict the category of complaints from the consumer complaint narrative.

Website : https://www.consumerfinance.gov/data-research/consumer-complaints/search/?date_received_max=2018-01-09&date_received_min=2017-12-31&from=0&has_narrative=true&searchField=all&searchText=&size=25&sort=created_date_desc

Consumer data looks like this:

Product categories:

We can see there are 18 financial products, across which CFPB received consumer complaints. I am interested to explore the ‘student loan’ data, which looks like this:

Student loan data can be further categorized across 3 sub-products, mentioned below:

The sub- products are considered as my target variable and ‘consumer complaints narrative’ are my features.

At first, I have cleaned the ‘consumer complaint narrative’ data by removing the punctuation, converting the data into lower case and lemmatizing the data.

Data Preprocessing for ML

This data has further been tokenized and vectorized using Tfidf Vectorizer where I have taken n-grams(1–2) as input. I used Logistic Regression and Multinomial NB for my predictions.

The cross- validation mean accuracy for Logistic Regression and Multinomial NB were 68.44% and 62.83 % respectively.

To improve the model further, I have used ‘GLOVE’ pre-trained embedding for the LSTM network. For experimenting purpose, both pre-trained embedding as well as embedding generated from my project has been taken into consideration.

Pre-processing for Deep Learning

For dealing with our training data, we need to know the corpus(total number of unique words in our vocabulary).

After finding the number of unique words, those words have been tokenized.

Padding and Truncating data:

The Recurrent Neural Network can take sequences of arbitrary length as input, but in order to use a whole batch of data, the sequences need to have the same length. But if we take the longest sentence we waste a lot of memory. So, we truncate longer sentences and pad shorter sentences.

First we count the number of tokens in all the sequences in the training data.

The average number of tokens in a sequence is:

The max number of tokens we will allow is set to the average plus 2 standard deviations. It covers around 95% of the data-set.

When padding or truncating the sequences that have a different length, we need to determine if we want to do this padding or truncating ‘pre’ or ‘post’. The choice of ‘pre’ or ‘post’ can be important because it determines whether we throw away the first or last part of a sequence when truncating, and it determines whether we add zeros to the beginning or end of the sequence when padding.

The ‘target’ variable i.e. the sub-product has to be one-hot encoded. At first we do label encoding and then one-hot encoding so that each category gets equal importance.

Embedding

We shall see if pretrained embeddings like GLOVE which are pretrained using billions of words could improve our accuracy score as compared to training our own embedding. We will compare the performance of models using these pretrained embeddings against the baseline model that doesn’t use any pretrained embeddings.

How GLOVE works?

GLOVE(“Global Vectors [for word representation]”) learns by constructing a co-occurrence matrix (words X context) that basically count how frequently a word appears in a context. Since it’s going to be a gigantic matrix, we factorize this matrix to achieve a lower-dimension representation. Here, we will use pre-trained word vectors which can be downloaded from the glove website. There are different dimensions (50,100, 200, 300) vectors trained on wiki data. For this example, I have downloaded the 100-dimensional version of the model.

Here, we find the common words between our corpus and the pre-trained model. The embeddings of those words are taken from the pre-trained model, since they are trained for huge corpus.

With those embeddings, we construct our matrix which will eventually be used as input for our LSTM model.

The embedding matrix looks like this. The weights are derived from the pre-trained model.

The embed size is 100 which is because I have used 100 dimensional model. Our corpus size is 16119.

With this embedding matrix we can construct the embedding layer which is the input for our LSTM model.

We are using LSTM bidirectional layer. But, how does it work?

Imagine that the LSTM is split between 2 hidden states for each time step. As the sequence of words is being feed into the LSTM in a forward fashion, there’s another reverse sequence that is feeding to the different hidden state at the same time. You might noticed later at the model summary that the output dimension of LSTM layer has doubled to 120 because 60 dimensions are used for forward, and another 60 are used for reverse.

The greatest advantage in using Bidirectional LSTM is that when it runs backwards you preserve information from the future and using the two hidden states combined, you are able in any point in time to preserve information from both past and future.

Lets have a look at the model summary:

To reduce the over fitting problem we have introduced Dropouts. The accuracy for this model on test data set is around 70.69 %.

Baseline model

Further, I wanted to experiment with our baseline model i.e. without using the pre-trained embeddings, rather using our text corpus for the embeddings. The accuracy of the LSTM model seems to be quite high(training accuracy) but there is a clear over fitting, since our validation accuracy is quite high compared to the training accuracy. The validation accuracy is around 70.93% which is quite similar to the pre-trained model.

Conclusion:

Although, I could not find any significant improvements by using pre-trained model in terms of accuracy. But, it would be interesting to see if the word2vec model or any other pre-trained model could give us better results. Please find the code for this work in my GitHub profile:

https://github.com/joseph10081987/Machine-Learning_new/blob/master/complaints(1.2017-10.2018).ipynb

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade