Finding similar sentences using Wikipedia and Tensorflow Hub.

Vineet Kumar
5 min readJul 1, 2018

--

Let us suppose that you have access to a corpus of text. You want to build a tool to explore this corpus but do not have access to any labels. Can you leverage an existing (and trained) deep learning model to build such a tool?

In this post, we demonstrate how to build a tool that can return similar sentences from a corpus for a given input sentence. We leverage the powerful Universal Sentence Encoder (USE) and stitch it with a fast indexing library (Annoy Indexing) to build our system.

We make the entire code available on github so you could try this on a corpus of your choice. The code also contains details on how to extract sentences from English Wikipedia using gensim. Let us now have a look at key elements, and later at some code.

Universal Sentence Encoder

Google unveiled a powerful model in March 2018 called Universal Sentence Encoder (USE). A simple way to think about USE is as a black box (or function) which takes as input a piece of text (say sentence), and returns a 512 dimensional vector. This model returns similar vectors for similar sentences, and does not require any training. It is important to understand though, that this model is generic and might not work directly for a specific domain.

Image Courtesy: https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/1

Why is this useful? Well, once we can convert a piece of text to a vector, we can compute distances between vectors. Distance can help us determine which vectors (in this case sentences) are similar!

Sounds good, right?

Well, the above approach would work fine when N is small (say 1,000 or even 10,000), but it would be slow when you have an extremely large corpus with millions of sentences. So, what is the solution then? Enter, the wonderful piece of an open source tool called Annoy.

Annoy Indexing

Annoy Indexing allows you to build a fast hashing mechanism to retrieve nearest neighbors. At a high level, Annoy enables you to quickly build an index from millions of vectors, and provide quick nearest neighbor lookup at runtime.

Stitching USE and Annoy

Here is how we can stitch USE and Annoy to build an index of sentences:

Offline Process

  • USE is applied to each sentence in the corpus to obtain a 512 dimensional vector.
  • We add the vector along with unique sentence index using Annoy.
  • This process is done once, and index is stored on disk

Online Process

  • Index from disk is loaded at initialization of runtime
  • USE is applied to a query sentence to obtain a 512 dimensional vector q
  • Annoy is asked to return nearest neighbours for q. These would be indexes of sentences of the original corpus.
  • Return the sentences corresponding to the original indexes.

Examples

Let us see how this works on English Wikipedia on some sentences

Example 1

  • Input Sentence: The Japanese sample-return spacecraft Hayabusa2 arrives at the asteroid 162173 Ryugu.
  • Similar Sentences:
the extraterrestrial ship from deep space enters the solar system and abducts a boater on earth .
after a comet collides with the ship , dart and his crew discover a new planet beyond the orbit of pluto .
the crew are on an expedition on the mysterious planet krop tor , impossibly in orbit around a black hole .
hurtling on into deep space , jupiter 2 crash lands on an unknown planet .
first flyby of pluto , charon , nix , hydra , kerberos , and styx , first up - close images of pluto system , first images of pluto and charon 's surfaces , first spacecraft to explore the kuiper belt .
the fourth , and only , spaceship to return from mars holds an insane crew and a martian " furball " .
drax plans to fire it at earth from space .
voyager 2 sends back images of neptune and its system
young man floats in escape pod after spacecraft explodes in deep space .
crew returning from first manned moon expedition witnesses atomic war break out on earth .

Example 2

  • Input Sentence: Saudi Arabia lifts its ban on women driving.
  • Similar Sentences:
the campaign aims to ban saudi arabia from the olympics until it allows saudi arabian women to take part in sports .
in september 2017 , the saudi arabian government announced that women would receive the right to drive , effective june 2018 .
saudi arabia in 2015 .
furthermore , the saudis are blocking a proposed causeway project between qatar and the uae and a proposed gas pipeline project between qatar and kuwait , because of saudi objections , the kuwaitis are now turning to the iranians for gas .
saudi officials said that , if successful in qualifying , female competitors would be dressed " to preserve their dignity " .
in 2015 , al - waleed was criticised for offering to buy bentley cars for saudi fighter pilots involved in the saudi arabian - led intervention in yemen .
saudi arabia agreed to allow its women athletes to compete in the 2012 olympics for the first time , amidst speculation that the entire saudi team might have been disqualified on grounds of gender discrimination .
* * 280px defense of saudi arabia 1990–1991
after widespread rumors about saudi arabia going to purchase an entire atoll from maldives , saudi arabian embassy in maldives issued a statement against the rumors .
saudi royal family after welcoming the new king salman of saudi arabia , january 27 , 2015

Code Part 1: Tensorflow Hub and USE

For details on how to pre-process English Wikipedia to obtain sentences, look at the github code.

  • Tensorflow Hub Cache: Tensorflow hub specifies a URL for a model. This downloads a trained model to a local directory on your system. USE takes about 1GB. We can enable a cache directory, so that download only occurs once, as follows:

Alternatively, you can put the above line in your ~/.bashrc. Do not forget to do a source ~/.bashrc

  • Embed function: The core method we are interested in is embed. Embed takes as input a list or tuple of texts. It is okay to directly pass sentences if you only plan to use embed once. If however, you want to reuse the embed method, consider using placeholder. You can then bass a new list of sentences for each call. This ensures that your tensorflow graph does not add nodes for each call to embed. Using placeholder will also make your code run faster! An example snippet:

Code Part 2: Annoy Indexing

Each item added to annoy index should be given a unique index. This could be the index of the sentence. Here is how we can build an annoy index once context embeddings are obtained from USE:

Code Part 3: Runtime Querying

Querying is similar to how we build index. We need to use get_nns_by_vector on the vector obtained by USE. Here is how:

Implementation Notes for English Wikipedia

  • Memory and disk space requirements for working with English Wikipedia is 120 GB.
  • We used a corpus of about 53M unique sentences. Extracting sentences took about 2 hours using 8 cores.
  • Annoy index building further takes about 9 hours, and index itself requires 111 GB.
  • Loading a previously built annoy index is fast! It takes about 200 seconds to bring up annoy index (for 53M entries) along with tensorhub model.
  • It takes fraction of a second (0.1–0.5 seconds) to find 10 similar sentences from 53M sentences!

--

--

Vineet Kumar

Machine Learning and Deep Learning enthusiast. Tensorflow hacker. Love python! Research Software Engineer at IBM Research Labs, New Delhi, India.