Deep Averaging network in Universal sentence encoder

Aditya Kumar
tech-that-works
Published in
3 min readAug 28, 2019
Photo by Markus Spiske on Unsplash

Word embeddings are now state of art for doing downstream NLP tasks such as text classification, sentiment analysis, sentence similarity etc and provides very good results compared to tf-idf or count vectorizer. Using word embeddings we can find the similarity between words and can apply vector operations and therefore can easily distinguish between cat, dog, car. Here cat and dog will be more similar compared to car.

But obtaining vectors for sentences is not immediate obvious. This post tries to explain one of the approaches described in Universal Sentence Encoder.

Deep averaging network (DAN): Idea of DAN is described in this paper Deep Unordered Composition Rivals Syntactic Methods for Text Classification

Word embeddings are low dimensional vector in N dimensional space which describe a word. To obtain vector space model for sentences or documents, appropriate composition function is required. Composition function is mathematical process of combining multiple words into single vector.

Composition functions are of two types
1. Unordered: Treats as bag of word embeddings
2. Syntactic: Takes word order and sentence structure into account.
Syntactic functions outperform unordered functions on many tasks but at same time it is compute expensive and requires more training time.

Deep unordered model that obtains near state of art accuracy on sentence and document level tasks with very less training time works in three steps:
(a) take the vector average of the embeddings associated with an input sequence of tokens
(b) pass that average through one or more feed-forward layer
(c) perform (linear) classification on the final layers representation
(d) Loss function is cross entropy.

Deep Averaging Network

Two important observations described in this paper are
• Accuracy can be improved by using a variant of dropout, which randomly drops some of words embeddings before averaging i.e. dropout inspired regularizer.

• The choice of composition function is not as important as initializing with pre-trained embeddings and using a deep network.

Here best of both the approaches are taken i.e. training speed of unordered function and accuracy of syntactic functions.
DAN takes very less training time with slightly less accuracy on compared to other approach i.e. transformer encoder.

Observations on Results:
• Randomly dropping out 30% of words from the vector average is optimal for the quiz bowl task and results in 3% improved accuracy, which indicates that p = 0:3 is a good baseline to start with.
• DANs achieve comparable sentiment accuracy to syntactic functions and are trained in very lesser time compared to syntactic functions as RecNN
• 2–3 layers achieves good result for binary sentiment analysis task, but adding more depth is an improvement to shallow Neural bag of word model
• Sometimes it is very important to consider the ordering of words in NLP. Man bites dog and Dog bites man are two different sentences, but as we are just averaging the embeddings, those differentiation in sentences will be missed.
• Also DAN performed poorly on double negation sentences like this movie was not bad. But at the same time DRecNN is slightly better in terms of polarity.

Negation

• On checking similarity of sentences this is toy dog and this is dog toy, DAN encoding of both of these sentences should be same as number of words are same and ordering should not matter, but it turns out that they are not same.

Textual similarity with DAN

This might be due to word dropout while averaging during feed forward pass of DAN.

Colab notebook can be accessed here.

References:

  1. https://arxiv.org/pdf/1803.11175.pdf
  2. https://people.cs.umass.edu/~miyyer/pubs/2015_acl_dan.pdf
  3. https://github.com/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb

--

--