Finding similar sentences using Wikipedia and Tensorflow Hub.

Vineet Kumar
Jul 1, 2018 · 5 min read

Let us suppose that you have access to a corpus of text. You want to build a tool to explore this corpus but do not have access to any labels. Can you leverage an existing (and trained) deep learning model to build such a tool?

In this post, we demonstrate how to build a tool that can return similar sentences from a corpus for a given input sentence. We leverage the powerful Universal Sentence Encoder (USE) and stitch it with a fast indexing library (Annoy Indexing) to build our system.

We make the entire code available on github so you could try this on a corpus of your choice. The code also contains details on how to extract sentences from English Wikipedia using gensim. Let us now have a look at key elements, and later at some code.

Universal Sentence Encoder

Google unveiled a powerful model in March 2018 called Universal Sentence Encoder (USE). A simple way to think about USE is as a black box (or function) which takes as input a piece of text (say sentence), and returns a 512 dimensional vector. This model returns similar vectors for similar sentences, and does not require any training. It is important to understand though, that this model is generic and might not work directly for a specific domain.

Image Courtesy: https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/1

Why is this useful? Well, once we can convert a piece of text to a vector, we can compute distances between vectors. Distance can help us determine which vectors (in this case sentences) are similar!

Sounds good, right?

Well, the above approach would work fine when N is small (say 1,000 or even 10,000), but it would be slow when you have an extremely large corpus with millions of sentences. So, what is the solution then? Enter, the wonderful piece of an open source tool called Annoy.

Annoy Indexing

Annoy Indexing allows you to build a fast hashing mechanism to retrieve nearest neighbors. At a high level, Annoy enables you to quickly build an index from millions of vectors, and provide quick nearest neighbor lookup at runtime.

Stitching USE and Annoy

Here is how we can stitch USE and Annoy to build an index of sentences:

Offline Process

  • USE is applied to each sentence in the corpus to obtain a 512 dimensional vector.
  • We add the vector along with unique sentence index using Annoy.
  • This process is done once, and index is stored on disk

Online Process

  • Index from disk is loaded at initialization of runtime
  • USE is applied to a query sentence to obtain a 512 dimensional vector q
  • Annoy is asked to return nearest neighbours for q. These would be indexes of sentences of the original corpus.
  • Return the sentences corresponding to the original indexes.

Examples

Let us see how this works on English Wikipedia on some sentences

Example 1

  • Input Sentence: The Japanese sample-return spacecraft Hayabusa2 arrives at the asteroid 162173 Ryugu.
  • Similar Sentences:

Example 2

  • Input Sentence: Saudi Arabia lifts its ban on women driving.
  • Similar Sentences:

Code Part 1: Tensorflow Hub and USE

For details on how to pre-process English Wikipedia to obtain sentences, look at the github code.

  • Tensorflow Hub Cache: Tensorflow hub specifies a URL for a model. This downloads a trained model to a local directory on your system. USE takes about 1GB. We can enable a cache directory, so that download only occurs once, as follows:

Alternatively, you can put the above line in your ~/.bashrc. Do not forget to do a source ~/.bashrc

  • Embed function: The core method we are interested in is embed. Embed takes as input a list or tuple of texts. It is okay to directly pass sentences if you only plan to use embed once. If however, you want to reuse the embed method, consider using placeholder. You can then bass a new list of sentences for each call. This ensures that your tensorflow graph does not add nodes for each call to embed. Using placeholder will also make your code run faster! An example snippet:

Code Part 2: Annoy Indexing

Each item added to annoy index should be given a unique index. This could be the index of the sentence. Here is how we can build an annoy index once context embeddings are obtained from USE:

Code Part 3: Runtime Querying

Querying is similar to how we build index. We need to use get_nns_by_vector on the vector obtained by USE. Here is how:

Implementation Notes for English Wikipedia

  • Memory and disk space requirements for working with English Wikipedia is 120 GB.
  • We used a corpus of about 53M unique sentences. Extracting sentences took about 2 hours using 8 cores.
  • Annoy index building further takes about 9 hours, and index itself requires 111 GB.
  • Loading a previously built annoy index is fast! It takes about 200 seconds to bring up annoy index (for 53M entries) along with tensorhub model.
  • It takes fraction of a second (0.1–0.5 seconds) to find 10 similar sentences from 53M sentences!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store