ANNOYingly Simple Sentence Clustering

Sumegha Yadav
Bobble Engineering
Published in
5 min readAug 10, 2020
Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point.

Making the machine understand the modalities of any language is the core of almost all the NLP tasks being undertaken. Understanding patterns in the data and being able to segregate data points on the basis of similarities and differences can help in several ways. Some use cases can be :

  • One may scale down massive data which might have many unique data points, but are contextually similar in their meaning. For example in question-answer verification systems, one of the major problems is the repeating of answer statements and these data points are actually duplicate statements and should be removed based on a threshold of semantic similarity.
  • In classification tasks and especially while dealing with macaronic languages one can map a cluster of similar data points to a label, by relying on the similarity metric calculated.
  • Many natural language applications,sentiment and non-sentiment, such as semantic search, summarization, question answering, document classification, and sentiment analysis, plagiarism depend on sentence similarity.

Macaronic Languages and the peril involved:

While working with macaronic languages, where there’s no grammar structure, no defined vocabulary and no basic rules to govern sentence formation, one can only rely on numbers to tell what data speaks. That is and will always be something AI/ML engineers have to keep in mind, feed the data in the language machine understands, because at the end of the day they crunch numbers to tell a story hidden in the data.

Let’s begin with an example here, sentence is in Hinglish(Hindi written using latin chars with usage of english words in between) to understand the difference between semantic and lexical similarity:

Sentence 1 : “Mood kharab hai yaar aaj”

Sentence 2: “Mood kharab mat kar”

While sentence 1 has tones of sadness and disappointment, sentence 2 has more of anger connotations. These sentences are close in terms of lexical similarity but are placed at a distance in terms of semantic similarity.

To make the machine understand the difference between the two and map them to the respective labels, a solution can be devised by using what Annoy (Approximate Nearest Neighbors Oh Yeah) has to offer.(https://github.com/spotify/annoy). It is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.

Let’s understand more with a task example:

Say we want to build a model that predicts emojis depending on the text input given by a user. The data has been prepared in the following format :

  • badhai ho 💐
  • many many returns of the day happy birthday 🎂
  • be the way u r always strong personality 💪

One emoji per sentence, but the issue came when similar context sentences and in some cases same sentences with some minor changes in the words used were mapped to different emojis. Now this would confuse the model, so we tweaked the task from Multi Class Classification to Multi Label Classification. But even in that case, mapping similar context sentences to a unique emoji cluster was the ideal way to go forward.

Strategy Adopted:

First and foremost step to get the word vectors of the words in our corpus, remember it’s a macaronic language so no pre-trained model will work here.

  • Word Vectors using Fasttext:

A popular idea in modern machine learning is to represent words by vectors. These vectors capture hidden information about a language, like word analogies or semantic. It is also used to improve performance of text classifiers. Fasttext model was used to unsupervised train on the User Chat Data gathered with emojis and other special characters removed and only one sentence per line.

  • Sentence Vectors :

The word vectors were used to form a sentence vector by taking an average of all the vectors fetched(one vector for each word) for a sentence and dividing it by the number of words present.

  • Clustering similar sentences using ANNOY:

Using ANNOY we formed a forest of trees that stored index vise the sentences found similar using the sentence vector given. For a query sentence given, it would give us a list of k similar sentences and their index position in the dataset. Depending on the extent of angular distance between the query sentence and the similar sentences, a threshold was decided to collect as many sentences that fulfilled the threshold criterion in our dataset. These clusters of sentences were then mapped to the most frequent 5 topmost emojis in that cluster. However, at times there were less than 5 emojis found, therefore we had a minimum of one emoji and maximum of 5 emoji being mapped to the similar cluster formed depending on the threshold.

  • The below function returns a set of indices of all the sentences that were found to be similar for an angular distance of ≤ threshold. input_df is the dataframe which has sentences.

def get_neighbours(index):
k = 50 #number of neighbours being considered

sentence_vector = tree.get_item_vector(index)
ids,distance = t.get_nns_by_vector(sentence_vector,k,include_distances=True)

similarity = distance[-1] #gives the index and distance of last similar neighbour

threshold = n # to be decided on respective task
while (similarity < threshold):
k= 2*k #seraches for more sentences that lie within the threshold criterion.
ids,distance = t.get_nns_by_vector(sentence_vector,k,include_distances=True)
similarity = distance[-1]

indices = extract_index(ids,distance,threshold)

return indices

Tree is built by calling the given python code:

from annoy import AnnoyIndex
import random

f = 40
t = AnnoyIndex(f, 'angular') # Length of item vector that will be indexed
for i in range(1000):
v = [random.gauss(0, 1) for z in range(f)]
t.add_item(i, v)

t.build(10) # 10 trees
t.save('test.ann')

# ...

u = AnnoyIndex(f, 'angular')
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors
  • Eg:

Query Sentence : happy birthday, the closer the angular distance to 0, better is the similarity.

Similar sentences and respective Angular distance :

birthday happy 0.0

happy birthday happy birthday 0.0

birthday happy birthday happy 0.0

happy birthday happy birthday yaar 0.14156264066696167

happy birthday happy birthday hap 0.15268754959106445

happy happy wala birthday birthday 0.16257968544960022

happy birthday maa happy birthday 0.17669659852981567

This entire cluster was mapped to this emoji mapping : [🎂 ,😘 ,😍, 🙏, 😂 ]

Similarly:

i’m very very happy today [‘😊’, ‘😍’, ‘😁’, ‘😘’]

so i’m very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]

i’m so very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]

yes i m very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]

yes am very happy today [‘😊’, ‘😍’, ‘😁’, ‘😘’]

i also very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]

im so very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]

i am very very hppy [‘😊’, ‘😍’, ‘😁’, ‘😘’]

i am very happy friends [‘😊’, ‘😍’, ‘😁’, ‘😘’]

oh i’m very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]

i am very very happ [‘😊’, ‘😍’, ‘😁’, ‘😘’]

im very happy kal [‘😊’, ‘😍’, ‘😁’, ‘😘’]

i am very haappy [‘😊’, ‘😍’, ‘😁’, ‘😘’]

sister l am very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]

today i’m so very happpy [‘😊’, ‘😍’, ‘😁’, ‘😘’]

This entire exercise helped us to map the similar context in sentences to the same emoji vector. It is important for the model to understand that such context can be mapped to these 5 emojis and not to each one of them as a different class, but as different labels. Thus, it becomes a problem of Multi-label classification. Multi-label classification originated from the investigation of text categorization problem, where each document may belong to several predefined topics simultaneously. In our case each text can belong to several or single emoji.

--

--