Representing course content with Universal Sentence Encoders

Meltem Tutar
Udemy Tech Blog

--

At Udemy, we have over 200,000 courses across ~75 languages, and we wanted to build vectors from the course text to represent curricula. We studied and experimented with various natural language processing (NLP) packages that provide pre-trained models to embed text. This article shares our experiences in building embeddings for a specific use case that can help in other domains.

Our use cases required us to build embeddings with the following qualities:

  • Ability to adapt to multiple tasks at Udemy. The content embeddings may be used in many tasks such as surfacing courses with similar curricula or as input features in other recommendation or search models, so we wanted to choose something versatile and generalizable.
  • Being language agnostic. Maintaining a single model for each language requires too much time and effort. Also, other downstream models may require embeddings in different languages to be in the same space.
  • Ability to handle text of varying lengths. Depending on which portion of the course we are interested in, some input data may be several paragraphs, while others are just a single sentence. We wanted an embedding method that could scale to longer texts.
  • Ease of use. We wanted embeddings that could be created in a few lines of code and used without additional training or fine-tuning on our data.

Given these constraints, we identified the Universal Sentence Encoder (USE) family as a good candidate model. We will briefly discuss the architecture of USE models and how they fare on each of these constraints and present snippets of code to generate multilingual USE¹ embeddings using the Spark NLP library. Then we will discuss key takeaways and other models and packages we believe are worth trying.

Adaptability to different tasks

The original architectures of USE presented by Cer et al. consist of an encoder that is shared in multiple downstream tasks. They experimented with two different architectures for the encoder, the Deep Averaging Network (DAN) and the Transformer model. The tasks are diverse, including both self-supervised and supervised problems from different use cases such as:

The authors state that the encoding model is designed to be as general purpose as possible and can be used for text classification, semantic similarity, clustering, and other natural language tasks. They applied the model to the semantic textual similarity benchmark (STS) benchmark, and as of the writing of this article, it ranks in the top 26.

High level archiecture of USE (from blog post² )

Many other Transformer architectures are trained unsupervised and are not immediately suitable for specific tasks without fine-tuning. For example, ALBERT and BERT are trained using masked language modeling where portions of the sentence are predicted from the surrounding words. The authors of ALBERT state this as a limitation of their model and that it is intended to be tuned on downstream tasks and not used out of the box. Similarly, in our use case, BERT embeddings without tuning had reduced performance compared to USE.

You can use SBERT to fine-tune a given Transformer model on your task, and here are already pre-trained models on sentence similarity tasks (e.g., Hugging Face model with ALBERT embeddings). While these models are near the state of the art nowadays and likely to perform better than USE, we preferred USE due to its ease of use and provided efficiency gains.

In addition, be aware there are self-supervised sentence encoder models whose authors claim to produce highly generic sentence representations, usable directly for tasks (e.g., skip thought vectors). As mentioned earlier, USE was trained on this task, along with others. If you have a large corpus of data without labels and are interested in training on your data, this may be a good model family to investigate.

Language agnostic

Yang et al.¹ introduce multilingual USE models that follow a similar philosophy as described above, where a shared encoder is trained across multiple supervised and self-supervised tasks. They implement two different architectures for the encoder, including a CNN and Transformer model. The tasks trained on include:

  • Predicting question-answer pairs across multiple languages from websites like Reddit, Stack Overflow and Yahoo Answers. A portion of question-answer pairs are translated with Google Translate to ensure even coverage across languages.
  • Ranking translated text in multiple languages. The translation pairs are mined with a similar approach as in Uszkoreit et al. (2010).
  • SNLI natural language inference dataset from above. SNLI is only in English but they translate to additional languages with Google Translate.

As a result, we have a single USE model that can represent 16 languages. With multilingual embeddings, there is some concern that the performance in high-resource languages will decrease. However, the authors show the multilingual USE outperforms the previous English-only USE embeddings on English tasks from SentEval. Similarly, on our evaluation tasks, the multilingual USE outperformed the English-only USE and the other out-of-the-box multilingual models including BERT LaBSE.

Input text of varying lengths

As described previously, multiple architectures can be used to implement encoders in the USE family with varying efficiencies:

The original English-only models by Cer et. al³:

Multilingual models by Yang etl. al¹:

Transformer architecture has a self-attention module in each layer, which is known to scale quadratically in computation time O(n²), whereas DAN encoder consists only of feedforward networks and has O(n) computation time. The performance of DAN in the paper on evaluation sets is shown to be on par or slightly worse than the Transformer architecture. On the STS benchmark test set, which is most similar to our task of document similarity, DAN achieves reduced test accuracy compared to the Transformer, at 0.719 versus 0.782. Though, on other transfer learning tasks and with varying task-specific models, DAN and the Transformer architecture are comparable.

Example of a layer in the transformer architecture (from blog post²)

The multilingual USE models have CNN and Transformer versions available. The authors discuss CNN being more efficient but can have a reduction in performance. They also mention that the Transformer architecture has computations that scale closer to linear with length instead of quadratic for shorter text sequence lengths due to having a larger constant factor. The multilingual CNN performed on par with the Transformer in our evaluation set, so we used the multilingual CNN as the encoder architecture.

Many other state-of-the-art packages limit the length of the input text. For example, when using SBERT to fine-tune Transformer models on a specific task of document similarity. The chosen underlying Transformer architecture limits the input text and causes a quadratic increase in run time with increasing sentence length. For instance, if you choose BERT, there is a limit of 512-word pieces per text. A limit doesn’t exist for USE, and we have had no issues embedding any kind of text. All executions are completed in a reasonable amount of time, even those with 2,000 words or so.

Ease of Use

USE embeddings are available on TensorFlow Hub and can be used directly in TensorFlow or can be imported through other NLP packages such as spaCy and Spark NLP. We chose to utilize Spark NLP due to having previous familiarity with Spark and also for its ability to parallelize the code easily. The downside of utilizing Spark NLP is that it is not possible to train the USE model on our data and update embedding values. However, you could use the fixed USE embeddings as input to another model whose parameters you may learn through fine-tuning for a custom task.

In our case, we simply used the pre-trained multilingual CNN USE model on the course text with no additional training to compute embeddings and calculated cosine similarity between them to discover related courses. For better performance, USE authors suggest using the arccos function to convert cosine similarity to the angular distance. We did not experiment with this, but feel free to do so. Here is a code snippet of computing the embeddings from the text:

Fitting USE embeddings with Spark NLP

Key Takeaways

We hope this article serves as a guide for anyone fitting embeddings for their tasks. The key takeaways include:

  • The USE family shares an encoder on multiple downstream NLP tasks, including natural language inference, question-answer datasets, translation ranking, etc.
  • Given the encoder is shared across multiple supervised and self-supervised tasks, USE embeddings are thought to learn general knowledge about language without fine-tuning.
  • Many self-supervised Transformer architectures may learn the semantics of the language but may not perform well out of the box without fine-tuning.
  • If you want to use a self-supervised Transformer architecture and fine-tune it to your task to extract embeddings, check out SBERT. There are also pre-trained models with SBERT for sentence similarity. See this example.
  • USE was introduced in 2018, and some newer models can achieve better performance with fine-tuning (see the ranking in STS benchmark). Nevertheless, we believe USE is a good option for many use cases due to its ease of use and efficiency.

Leave a comment to let us know if you experiment with USE or any of the other methods mentioned above. We would love to hear your observations, comments, or questions.

Also, I’d like to thank my Udemy colleagues, Onur Gungor, Muhammet Poyraz, Akshay Kamath, and Tigran Ishkhanov, for reviewing this blog post and offering valuable suggestions and edits. If you want to be part of an innovative team that puts learning outcomes first, check out our open positions at about.udemy.com/careers.

Resources

  1. Multilingual Universal Sentence Encoder for Semantic Retrieval https://arxiv.org/abs/1907.04307
  2. Universal Sentence Encoder visually explained https://amitness.com/2020/06/universal-sentence-encoder/
  3. Universal Sentence Encoder https://arxiv.org/pdf/1803.11175.pdf

--

--