InferSent Explained

Edward Zhang
6 min readMay 4, 2020

--

Background

Natural Language Processing (NLP) is the processing and analysis of large amounts of natural language data by computers. It gives machines the ability to read, understand, and generate human language.

One particularly popular method to allow computers to perform these tasks is by converting words and sentences into representations of their meaning, part-of-speech (POS), and other properties. Specifically, these word embedding methods encode words into vectors that can be used to improve the performance of several common NLP tasks, such as sentiment analysis, subjectivity classification, and question answering based on text.

Due to the costliness of producing labeled training data, many solutions to NLP tasks rely on the use of transfer learning via these pretrained embeddings. Thus, the usage of word embeddings has become prevalent in deep-learning natural language processing applications.

Word embeddings such as word2vec and GloVe have had good performance on many NLP applications. However, further work has found that sentence-level embeddings can match and even outperform word embeddings in transfer tasks. These sentence embeddings like SkipThought and FastSent were trained using an unsupervised learning approach.

Facebook AI researchers have found that supervised training methods also show promise in generating sentence embeddings that generalize well to a broader range of NLP tasks. This is the topic of their paper titled Supervised Learning of Universal Sentence Representations from Natural Language Inference Data.

The Natural Language Inferencing Task

While previously encoding models had been trained using unsupervised learning methods, Conneu et al. hypothesized that a model trained on an NLI task will generate embeddings that will generalize well to other tasks, since NLI involves high-level reasoning about semantic relationships within sentences.

Natural Language Inferencing (NLI) is the task of determining whether a “hypothesis” is true, false, or neutral, given a “premise”.

Textual entailment​ can be thought of as a directional relation between two sentences. A sentence A entails another sentence B when the truth of sentence A follows from sentence B. For example, given the following sentences:

text: A soccer game with multiple males playing.

hypothesis: Some men are playing a sport.

We can say that the text entails the hypothesis, because the truth from the first piece of text holds in the second piece of text.

We can also say that a text contradicts a hypothesis. This can be illustrated in the following example:

text: A black race car starts up in front of a crowd of people.

hypothesis: A man is driving down a lonely road.

Finally, we can also have a case where the text has no relation to the hypothesis. This is exemplified in the following sentences:

text: If you help the needy, God will reward you.

hypothesis: Giving money to a poor man will make you a better person.

For training data, the Stanford Natural Language Inference (SNLI) corpus was used. SNLI is a collection of 570,000 English sentence pairs labeled with ​entailment, contradiction, ​or​ neutral.

Training

The training task is to train a model which takes in the premise and hypothesis as inputs, and output either ​entailment, contradiction,​ or ​neutral​. The common architecture used to train models consisted of separately encoding the premise and hypothesis inputs. These encodings are denoted u and v, respectively. Three matching methods are then leveraged to recognize relations between the premise and hypothesis:

  1. Concatenation of u and v: (u, v)
  2. Element-wise produce: u * v
  3. Absolute element-wise difference: |u — v|

The resulting vector is then fed into several fully connected layers culminating in a 3-way softmax for classification:

Sentence encoder architecture:

A total of 7 different architectures were evaluated to see which one best captures generically useful encodings for sentences:

  1. Standard recurrent encoders with Long Short-Term Memory (LSTM)
  2. Standard recurrent encoders with Gated Recurrent Unit (GRU)
  3. Concatenation of last hidden states of forward and backward GRU
  4. Bi-directional LSTMs (BiLSTM) with mean pooling
  5. Bi-directional LSTMs (BiLSTM) with max pooling
  6. Self-attentive network
  7. Hierarchical convolutional network

These seven architectures were evaluated based on their ability to capture information that is useful to a broad set of NLP problems. Thus, the encodings generated by these sentence encoder architectures were used in the following tasks:

  1. Binary and multi-class classification: These tasks include sentiment analysis, subjectivity and objectivity classification, and finding opinion polarity, among other tasks.
  2. Entailment and semantic relatedness: The two datasets used (SICK-R and SICK-E) classify the relationship of pairs of sentences based on their semantic similarity and entailment, respectively.
  3. STS14 — semantic textual similarity: This dataset contains pairs of sentences that are human-labeled based on their similarity.
  4. Paraphrase detection: This dataset contains pairs of sentences that are human-labeled according to whether or not they capture a paraphrase/semantic equivalence relationship.
  5. Caption-Image retrieval: The goal in image retrieval is to rank a collection of images according to their relevance to a given caption. The goal in caption retrieval is to rank a collection of captions according to their relevance to a given image.

The empirical results show that the BiLSTM with max pooling achieves the highest performance on these tasks.

BiLSTM with Max Pooling

The BiLSTM with Max Pooling has the following structure:

A Bidirectional LSTM is a model that contains two LSTMs: one reads the input sequence in the original order, and the other reads the input sequence in the reverse order. As shown in the diagram above, each word w is fed into two hidden units, one for the forward LSTM and one for the backward LSTM. The output of these two hidden units is concatenated to produce a fixed-length vector.

Now, we have a fixed-length vector for each input word, as shown in the diagram as the second layer from the top. To combine all of these vectors into a single fixed-length output vector u (which will be our final sentence encoding), we simply take the maximum value across each of our vectors for each dimension of u.

Results

As mentioned before, the SkipThought sentence encoder was the best performing sentence encoder at the time of the writing of the paper. The BiLSTM-max model consistently outperforms the results obtained by SkipThought. The notable result from this is that SkipThought was trained using significantly more data (64M sentences versus 570k sentences) and with much less training time. The BiLSTM-max was trained in less than a day on a single GPU whereas the SkipThought network was trained for a month.

Related Works:

Deep contextualized word representations

ELMo is a “deep contextualized word representation” from researchers at AllenNLP. ELMo generates word-level embeddings, but it differs from previous approaches in that it looks also at the word’s surrounding sentence to generate the appropriate embedding. In some works ELMo is shown to perform better on many of the tasks that were used to evaluate InferSent.

Universal Sentence Encoder

A paper titled Universal Sentence Encoder makes use of a similar approach to Conneau et al. to construct a sentence encoder. Their model comes in two variants; one uses the Transformer model, which uses the attention-mechanism to learn contextual information about a sentence. The other one uses a simple deep averaging network, which is faster but less reliable. The models are first trained in an unsupervised manner on training data pulled from multiple sources from the Web, and then further augmented in a similar fashion to Conneau et al. by using SNLI.

Evaluation of sentence embeddings in downstream and linguistic probing tasks

This paper compares several sentence and word embedding methods on multiple tasks such as sentiment analysis, evaluation of opinion polarity, and subjectivity classification. It compares methods such as ELMo, USE, and InferSent (the name of the model from Conneau et al.) on these general tasks. The experimental results show that ELMo performs the best on 5 out of 9 tasks, with USE and InferSent each performing the best on 2 out of 9 tasks.

SentEval: An Evaluation Toolkit for Universal Sentence Representations

SentEval is a toolkit for evaluating the quality of universal sentence representations like InferSent and USE. Since good universal sentence representations must be able to perform well in a wide variety of systems tailored to many different tasks, tools for generating such sentence representations should be tested across multiple domains. SentEval integrates several classification, natural language inference, and semantic similarity tasks.

Skip-Thought Vectors

Skip-Thought vectors are a sentence encoding that was trained in an unsupervised manner using text from books. Kiros et al. use an encoder-decoder framework with Gated Recurrent Units (GRUs). It achieved state-of-the-art performance at the time, but InferSent eventually achieved a better performance.

--

--