Unsupervised Recommendations: A Practical Approach

Published in

Asymptotic Labs

5 min readDec 14, 2020

Vast amounts of unstructured text data are being generated every second. From comments on a reddit thread to medical records to blog posts like this, but how can we make use of all this data? Natural Language Processing (NLP) has been advancing rapidly (GPT-3, BERT, FastText, etc.) and these new advancements can help us understand and make use of our data.

Let’s look at a practical example of a free-text based recommendation engine.

Background

I’ll start with a bit of background on a few concepts in NLP. If this isn’t new to you, feel free to skip ahead.

A document is any collection of text. This could be any sequence of words: a sentence, a paragraph, a book, etc.
A corpus is a collection of documents.
A vector is a mathematically convenient representation of a document.
A model is an abstract term referring to a transformation for one document representation (vector) to another.
Tokenization is the process of chopping up a subsequence of a document into pieces called tokens. A token is an instance of a sequence of characters that are grouped together as a useful unit for processing. Tokenization will vary depending on what is needed for the particular use case.
An Embedding (alternatively vectorization or encoding) is a particular real-valued vector form of a tokenized document. A word embedding is an embedding technique that deals with words with the concept being to use signals from the words adjacent to another word. With larger sizes of text, we can use a sentence embedding which represents entire sentences and their semantic information as vectors. This helps our model map the context, intention, and other nuances in the entire document.

Case Study: Dating Recommendations

To illustrate NLP in action, how could we recommend matches for users of our hypothetical “dating app” based on text they’ve written about themselves?

A general overview of what we’d like to accomplish here is: create a vector representation of each document in our corpus, use these as inputs for training a model to learn the details of the transformation, then apply the model to any new documents we come across to transform them into a form useful to our needs.

Building Our Inputs

First we have to do some preprocessing of our data. We will be using the amazing gensim library for this. Let’s assume our users all have text based profiles with a description and interests they’ve written about themselves.

import pandas as pd
import gensimusers = pd.read_csv('data_app_user_data.csv', names=[
    'user_id', 
    'description', 
    'interests', 
    'match_id'
]))# tokenize our user features
def process(inp):
    return preprocess_string(inp, filters=[
        remove_stopwords,
        stem_text,
        strip_short,
        strip_punctuation,
        strip_non_alphanum,
        strip_numeric
    ]) # Let's combine our data for each user into 1 string
user_features = users['description'].fillna('').apply(str).str.cat(
    users['interests'].fillna('').apply(str), sep=' '
)# Transform and clean the inputs
user_inputs = list(user_features.apply(process).values)

Here we’ve concatenated our user’s text data into one big sentence and tokenized it. This will be the basis of our inputs to the model.

Training

From our tokenized data, we want to produce word embeddings to capture some context in the words and phrases of our users. We have chosen Facebook’s FastText model. The output here we will use as our inputs to our sentence model.

from gensim import models# Configure our word vector feature size
model = models.FastText(size=32, window=5, min_count=1)
model.build_vocab(sentences=user_inputs)
model.train(
    sentences=user_inputs, 
    total_examples=len(user_inputs), 
    epochs=15
)
model.save('./trained_model')

We then build our sentence model. We have chosen a simple sentence embedding technique called SIF (Smooth Inverse Frequency), that computes sentence embeddings as a weighted average of word vectors. The fse library has some great and performant implementations of sentence embedding algorithms. Using our trained word embeddings, we can train our sentence model completely unsupervised.

import fsesentence_model = fse.models.uSIF(model)
lookup = fse.IndexedList(user_inputs)
sentence_model.train(lookup)
sentence_model.save('./sentence_model')

Now we have our trained model for use in the recommendation engine!

Recommendation engine

Finally, we can make use of NLP in our “real world” scenario. The methodology here is when a new user signs up and creates their profile, we want to generate recommended matches for them.

Our trained model transforms the new user data into our feature vector space that we can use in the engine:

import numpy as npsentence_model = fse.models.uSIF.load('./sentence_model')def feature_extractor(user):
    input = process(' '.join([user.description, user.interests]))
    features = sentence_model.infer([(input, 0)])[0]    # Normalize our feature vector
    return features / np.sqrt(features.dot(features))

The intuition we will use next is, users that have had successful matches can provide some meaning to our feature vector space. If user A successfully matched to user B, the traits encoded in user A’s feature vector correspond in some way to the traits encoded in user B’s feature vector. This means for a new user, user X, we can find previous users with similar feature vectors to them that have successful matches then find users that are similar (have “similar” vector representations in our feature vector space) to their matches. Because our model has mapped all our inputs to vectors, we can use cosine similarity for our metric of similarity.

def get_similarity(feature_vector_a, feature_vector_b):
    # We've normalized all our feature vectors 
    # so we can use the dot product
    return np.dot(feature_vector_a, feature_vector_b)

Then generating recommendations becomes:

def recommend_users(user_id):
    # User feature vector we are targeting
    target_user_vector = user_vectors_by_id[user_id]
    
    # Top 10 similar matched users to them
    similar_users = sorted(
        [(i, get_similarity(
            target_user_vector, user_vectors_by_id[i])) 
            for i in matched_user_ids
        ],
        reverse=True,
        key=lambda x: x[1]
    )[:10]    recommended_users = dict()
    for id, similarity in similar_users:
        # Get user vector of the matching users
        target_matched_user_vector = user_vectors_by_id[
            matched_users_id_by_id[id]
        ]        # Top 10 similar users that have not yet matched
        similar_unmatched = sorted(
            [(i, get_similarity(
                target_matched_user_vector, user_vector_by_id[i])) 
                for i in unmatched_user_ids
            ],
            reverse=True,
            key=lambda x: x[1]
        )[:10]        # Add candidates
        for u_id, u_similarity in similar_unmatched:
            recommended_users[u_id] = u_similarity    # Return top 10 candidates, ordered by similarity scores
    return [
        u[0] for u in sorted(
            recommended_users.items(),
            reverse=True,
            key=lambda x: x[1]
        )[:10]
    ]

We now have a set of potential matches for our new user and we’ve put the user generated, unstructured text data, to work for us. In production, we would need to closely monitor the performance of our matches and fine tune our engine.

Conclusion

We took our user data, without any annotating or tagging, and created recommendation engine based on unsupervised learning and sentence embeddings. NLP has some powerful capabilities and tools like gensim and fse make it more accessible to use in your own projects.