Word Embedding

Emily Jaekle
Deep Learning Data 2040
2 min readMay 12, 2018

https://github.com/ejaekle/deeplearning

Based on the notebook from chapter 6.1 of “Deep Learning with Python” we can create word embeddings, which are vectors of words and very useful in analyzing text data with convolutional neural nets. Word embeddings are different from one hot encoding because they learn from the data.

The data for I used for this is about questions asked on Quora and can be found on Kaggle. Quora tries to eliminate duplicate questions on its site and this dataset has pairs of questions and whether or not they are considered the same as a binary value (1 for they are the same and 0 if they are not). Using this dataset I created word embeddings to be used in a convnet.

Since we have over 400,000 pairs of questions I chose to use 200,000 for the training size and 50,000 for validation. The max length of the text is 100 (so hopefully that will include all of the text of two questions). And finally we will only consider the top 5000 words in the dataset.

The dataset has 404351 pairs of questions and values for if they are the same or not. After tokenizing the data we have 95603 unique tokens. We

Now because we have a lot of data I chose to train the model without loading pre-trained word embeddings and without freezing the embedding layer.

Here is my model summary:

I ran this with 5 epochs and a batch size of 1024 and had the following results

Model Output

Plotting the accuracy and loss results we get the following

Accuracy and Loss Plots

While the validation loss is not great in this case, it is better than 50% which would be what we would expect from a random guess. In the textbook their accuracy does not get above about 55% so our accuracy of around 62% overall is actually quite good, especially for text data. The validation loss is not great either, however it does appear increase towards the end which is a positive trend.

--

--