Personalized Fishbowl Recommendations with Learned Embeddings: Part 2

Published in

Glassdoor Engineering Blog

16 min readApr 5, 2022

Introduction

In the previous blog post, we saw how we can utilize text based embeddings to help recommend posts to users on Fishbowl, a professional networking community that was recently acquired by Glassdoor, in which working professionals can have workplace related conversations with other peers in industry. On Fishbowl, users can anonymously write what’s on their mind in posts and also comment on posts from other anonymous users in what we call “bowls” or “feeds”; a collection of posts related to a certain industry or topic.

Previously we discussed how in the absence of clear negative signals from click stream data we cannot as easily use a supervised learning setup to rank items to users. Given the inputs we used were text based, we can instead use more unsupervised methods like Doc2Vec [1] to generate post text embeddings. We can treat the text of the posts as individual documents and use those to train a Doc2Vec model to generate a post text embedding. For users, we can take the average post text embedding of the posts the user liked and consider that as the user embedding. We can also add the user embeddings of other users who liked a post into our post embedding calculation so we incorporate some “collaborative” notion of what other similar users liked as opposed to a pure content similarity ranking. We can next compute the cosine similarity between the user and posts embeddings and use the similarity score to rank posts to recommend to users.

However, such an approach can have some shortcomings. First the text of the post is just one of many features we can use for personalization. We can also leverage other user provided information when users agree to sign up like the employer, job title, work city of a user as features. In addition, the bowl or feed name and its description can be considered important features for posts as posts under the same feed tend to be about similar topics.

Second, the previous Doc2Vec approach trained a Doc2Vec content model from scratch on our Fishbowl corpus of posts. But such a dataset may not be large enough to fully capture all the semantic English knowledge which would limit the quality of our trained text embeddings. Finally instead of text based embeddings we can also learn embeddings tailored for our custom recommendation task with neural networks.

Transfer Learning for Text Content Embeddings

In NLP a lot of recent research in Transfer Learning uses pre-trained high quality language models, first trained on much larger corpora of data like Wikipedia, and later fine tuned for other custom tasks on small custom data. In doing so we benefit from all the pre-existing knowledge of a larger language model and then further refine our model’s knowledge base on our custom learning task. For this project, we utilized Transfer Learning by using pre-trained GloVe and BERT models fine tuned on our Fishbowl corpus.

GloVe Based Embeddings

GloVe [2] presumes that ratios of word-word co-occurrence probabilities can encode some form of meaning. It learns word embeddings by factorizing a global co-occurrence word-word matrix where each entry in the co-occurrence matrix indicates the number of times a word co-occurs with another within its context. Words that frequently co-occur like “Google” and “interviews” are more likely to be closer to each other in the embedding vector space than words that co-occur less frequently like “Google” and “running”. The standard 100 embedding dimensional GloVe model available online has already been pre-trained on a large corpus of Wikipedia and Twitter posts which we can next fine tune further on our custom Fishbowl corpus of post texts and comments.

How can we utilize a pre-trained GloVe model for our recommendation purposes? Recall we now have 6 text based features to generate an embedding representation of posts and users and not just post text.

Post Text: the text of the Fishbowl post. In the case of a user this is an empty string.
Company: the company the user claims to work in. In the case of posts this is the company of the user who made the post.
Job Title: the user’s job title shared by the user. In the case of posts this is the job title of the user who made the post.
City Location: the city the user claims to work in. In the case of posts this is the work city of the user who made the post.
Feed Name: the name of the feed or “bowl” the Fishbowl post was made in. In the case of users this is an empty string.
Feed Description: the description of the feed or “bowl” the Fishbowl post was made in. In the case of users this is an empty string.

We can pass each of our 6 text feature inputs above individually into the GloVe model to get 6 GloVe output embeddings each of 100 dimensions for each feature. But GloVe gives us embeddings for individual words and not entire phrases or sentences. Our inputs can indeed be phrases and in the case of post text even multiple sentences. We need a way to generate embeddings for the entire input sentence. We do so by simply taking the mean Glove embedding of all the input words instead.

*Glove Mean Word Embedding Aggregation (N = length of input text)*

Next, we concat each of the 6x100 dimension GloVe input feature’s mean embeddings into one final GloVe content embedding of 600 dimensions. We can also try summing instead of concatenating but in our case concatenation performed better. If a feature doesn’t make sense like the post text in the case of a user we just concat a zero 100 dimensional embedding for such a feature. This approach gives us a glove content embedding for both users and posts of the same dimension and in the same vector embedding space.

One issue with concatenating is that as the number of features increases, repeated concatenation of each new feature can end up causing a very large output content embedding size. So to ensure the output embedding size remains small we utilize Principal Components Analysis (PCA) to reduce the output GloVe content embedding into a smaller 128 dimensional but still useful representation. Reducing a long vector to its top principal components (ideally) retains most of the variance in the data. This is not just a computation and memory optimization. As we will see later this optimization is also useful from a ML theory perspective in improving model performance.

So far we have also only considered content data for a post. Popularity of a post can be a major indicator for someone liking a post. Posts that get many reactions can encourage others to react to them as well and can also signal as a proxy indicator for post quality. At Fishbowl we measure 5 types of reactions on a post: Like, Helpful, Smart, Funny and Uplifting. In addition, we also consider the total reactions, number of comments, number of shares as measures of post popularity. We next concatenate these 8 reaction features after scaling them into our concatenated 600 dimension glove content embedding prior to the PCA reduction.

*Glove Content Embedding Logic with Reaction Counts and PCA Reduction*

Similar to what we discussed in the prior blog post, we can next also incorporate information from other similar users or posts. For the post embeddings, we can add in the average of the GloVe user content embeddings of all users who liked the post into the post embedding calculation.

Equation 1: Post Embedding Aggregation Logic (U = user ids who liked the post and M = Number of users who liked the post, and beta = scalar to weight each component in aggregation)

For user embeddings, we can add the Glove post content embeddings of the other posts the user liked into the final user embedding.

Equation 2: User Embedding Aggregation Logic (P = post ids the user liked and N = Number of posts the user liked, and alpha = scalar to weight each component in aggregation)

These output embeddings from equation 1 and 2 are the final GloVe based embeddings that we use to compute for users and posts for ranking purposes.

BERT Based Embeddings

We can also follow a similar approach using BERT[3]. But BERT can give us multiple contextual embeddings for each word, not just one static word embedding. The basic BERT model has 12 attention layers. Each attention map can be considered encoding one context or representation for a word.

This makes generating a single embedding for an input word using BERT tricky. One approach could be to sum up or concatenate the embedding values of all or the last few attention outputs. This is not really the best assumption as it does defeat the purpose of BERT. A better approach could be to utilize the Sentence BERT [4] modeling paradigm which learns an embedding of the entire sentence instead of individual words. However, for simplicity, in our case, we take the average embedding output of the last 4 attention layers as our word embedding although Sentence BERT is something we want to revisit later. The rest of the embedding aggregation logic is the same as the GloVe approach where the final content embedding is the average of all the input feature words followed by a PCA reduction.

To generate the above word embeddings that are custom to our Fishbowl corpus we also have to first fine tune the BERT model on Fishbowl data. One of the advantages of BERT, and Transformers in general, is that they are very good at learning representations from larger sequences vs just nearby context words. So to take advantage of this additional representational capacity of Transformers, during fine tuning, we don’t just pass in the post text as input, like we would for fine-tuning GloVe or Doc2Vec, but also concatenate the six text inputs as CONCAT(user company, user title, user location, post text, feed_name, feed description) into one long input “document”.

*BERT Fine Tuning Setup for Custom Fishbowl Feature Inputs*

We use this document as the input to the BERT model for fine tuning so it can attend to al other input features as well when learning word representations through the same masked word prediction learning task.

Embeddings via Contrastive Learning

Siamese Network Embeddings

The above approaches are premised on the assumption that text feature embeddings are enough to generate good user or post embedding representations. To factor in information from other similar users we do so in a very manual fashion in equations 1 and 2 earlier by adding in the content embeddings of other users who liked the post into a final post embedding. This however may miss other aspects of user and post similarity. We now instead try to learn embeddings based on what similar or dissimilar users liked automatically through learned weights via a neural network instead of manually aggregating via equations 1 and 2.

One particular approach we want to emphasize is learning embedding representations via a contrastive triplet loss function [5]. A triplet loss, takes a query (anchor) input and computes the euclidean distance between a positive instance associated with the query and the distance between a negative instance associated with the query input. It then takes the difference between the distance values of the negative pair and the positive pair. Instead of euclidean distance we can also use cosine similarity.

Triplet Loss Function (Source: https://lilianweng.github.io/posts/2021-05-31-contrastive/)

The loss will be smaller when the positive pair’s distance is low and the negative pair’s distance is high and it will be bigger when the positive pair’s distance is high and the negative pair’s similarity is small. Therefore minimizing this loss during training encourages the model to learn similar embedding representations of positive instances for a given query input (since that reduces the distance for positive pairs) and dissimilar embedding representations for negative instances of a given query.

*Source: FaceNet — A Unified Embedding for Face Recognition and Clustering [5]*

To generate the positive and negative pairs, we do so implicitly in a self supervised manner. We assume if a user liked a post during our training period then this is a “positive” pair. A negative is a randomly selected post from the same feed the positive post is from. If our corpus of posts in each feed is very large, and the average number of posts liked by a user is small, then the probability of what we consider a negative actually being a false negative is small. So with some confidence we can assume that the user did not like the other posts in a batch. Generating negatives on the fly is critical to make training quick and simple.

Siamese Network with Glove Content Embeddings Input

We have two parallel neural networks with weights not shared between them. One each to learn user and post embeddings in what can be described as a “siamese network” [6]. Each neural network can be any kind of architecture but in our case we have two linear layers of size 256 and 128 with a ReLu activation in between. The output of each neural network is what we consider the embedding representation. For inputs, the precomputed GloVe post content embedding described earlier is passed in the post neural network. And the GloVe user content embedding is passed into the user neural network.

We next compute the cosine similarity between the user and posts network output embeddings for each positive user-post pair, and the same for the negative pairs. After which we compute the triplet loss and back-propagate.

Graph Convolutional Network Embeddings

The above contrastive siamese network can only consider information that is carried in each batch of user-post pairs. In our previous blog post, we had mentioned how a graph structure can carry information from neighbors in a graph. But the problem with Word2vec based context window methods like DeepWalk is that they only learn embeddings of the input IDs in our training input vocabulary. When new users join Fishbowl or new posts are made there is no way to lookup an embedding for them in such dynamic networks.

Instead of learning embeddings for input IDs we instead learn a generalized aggregation and encoding function based on input IDs feature values so that we can generalize to new, unseen posts or users as long as we have features for those. To do so we can extend the previous contrastive learning setup by further pooling feature information from neighbors in a graph when we try to learn the output embeddings.

Such networks are often referred to as Graph Convolutional Networks (GCN). One such architecture we consider first published by Pinterest is PinSage [7]. Based on the ideas in PinSage, we also consider a user-post undirected bipartite graph where each undirected edge is between a user and a post if the user liked the post. To incorporate graph structure information we also pass in feature information from neighbors of an input vertex up to K hops away in the graph. But how do we pool information features from all the K hop neighbors?

One such way could be to take the mean feature values of the K hop neighbors. Alternatively, we can parameterize and learn a weight matrix which automatically learns how to pool these features in a weighted fashion. We try both approaches. Additionally, we want to keep the input vertex’s feature information as well so concatenate the input node’s features and the pooled neighborhood feature values of the K hop neighbors into one longer vector. This is passed into a smaller dimensional linear layer with a non linear ReLu activation. Output from this layer can be thought to encode the final embedding for an input vertex.

To actually learn the weights to generate such embeddings we utilize the same contrastive triplet loss as the siamese network earlier.

“Convolve” operation which aggregates feature information from neighbors in a graph from *“Graph convolutional neural networks for web-scale recommender systems.” [7]*

Source: *“Graph convolutional neural networks for web-scale recommender systems.”[7]*

Sampling Neighbors: There can be a huge number of K hop neighbors for a node. The worst case complexity here is pooling information from the entire V vertices in the graph. This can make computation slow and memory intensive. Instead of pooling information from all K hop neighbors, the authors instead propose to sample up to T neighbors based on their importance to the input node. We determine importance based on the degree of each node. This sampling operation controls the memory footprint of the algorithm. Additionally, it adds an element of randomness every time the same user input is seen during training which makes model generalization better. In our implementation we experiment on small T values up to 50.

Hard Negatives: We introduce the concept of “hard negatives” during training. Now we consider a post as “negative” if it is from one of the bowls a user is subscribed in, is a top 25 liked post in that bowl, and if the user never liked that post. The quality of a negative can be critical in helping the model learn. The previous setup may not be as strong a contrast because there is a possibility that the user may never have seen the post if they were never part of the bowl and as such may not really be a true negative interaction. We choose harder negatives to minimize such exposure bias. By only considering the top posts of these bowls for potential negatives, we also increase the probability the user actually saw the post and did not like it as top posts are more likely to be seen.

Results

We evaluate our models by training them on 12 months of historical data and evaluating them over the next week of data after the training period as the test period. For each user in the test period, we generate its embedding based on training data (to avoid leakage) and we also get the test period embedding of all the test period posts they were eligible to see during the test period. Next, we rank all posts for each user by the cosine similarity of the user-post embedding pairs. We consider the top ranked K posts from the test period as posts a user would have been recommended.

We can evaluate model performance using metrics like Precision@K and Recall@K. Precision@K defined as the percentage of top K posts that the user actually liked during the test period. Recall@K defined as the percentage of all posts that the user liked which were recommended in the top K posts.

Offline Testing Results for models relative to Doc2Vec Baseline (K=10)

Using the original Doc2Vec model as a baseline, we can see that the GloVe embedding model based on the 6 input text features does relatively best. Amongst the self learning Neural Network aggregation algorithms they both do slightly worse than just using GloVe. Adding the PCA reduction to keep only the top 128 components improves performance for the Glove based embeddings model. The GloVe based embedding model with PCA is also the model we shipped for testing in production.

Discussion

We hypothesize the improved generalization with PCA comes from condensing the embedding vector into a smaller dimensional space removing the noise from many less useful feature values. Additionally, many of our inputs and their corresponding embedding values are repetitive (for example, a post with the same user company and the bowl name, or a post asking the same question about a job as the job title of the user who made the post) so many input feature embedding values can be highly correlated.

A point to note about the Neural Network models. Both the Siamese and Graph Convolutional Networks converge very quickly despite adding hard negatives shown below. We stop training after 3 epochs. Further training leads to severe overfitting. Similarly, increasing the representational complexity of the model by either adding more layers, or in the case of the GCN increasing K to pool information from more than just neighbors, leads to worse performance on test data.

Hyper-parameter Tuning for Neural Network based embeddings

This suggests our data doesn’t necessarily warrant an overly complex model. Perhaps just pooling information from immediate neighbors is enough to get good performance. If we look at the original GloVe aggregation algorithm this is kind of what we do although manually by pooling in embeddings of all the users who liked the post into the post embedding calculation or the posts the user liked in the user embedding calculation.

Therefore the GloVe based aggregation model is somewhat similar to the GCN. The major difference being that in the GCN we have additional learned weights to aggregate neighbor features and to encode the output embedding as opposed to just summing neighbor features with equal weight with the Glove based model. Such reduced model complexity can help improve generalization when there is potential overfitting.

We do however suspect that a better method of sampling negatives may have improved the overall quality of the neural network models as some improvement is observed by switching to harder negatives.

Conclusion

This wraps up our two part discussion of generating embeddings for Fishbowl user-post recommendations in the absence of more implicit user clickstream or impression data.

In this blog post we extended our previous recommendation model based on relatively unsupervised or self supervised techniques by utilizing some key machine learning ideas. We showed how we can add more and varied text features to improve recommendations. We explored the power of Transfer Learning via pre-trained language models like GloVe and BERT, and explored how to automatically learn customized embeddings through Siamese and Graph Convolutional Networks with the help of a Contrastive Loss.

All these ideas can be useful in recommendations when negative labels aren’t implicitly available and a supervised learning setup is hard to establish.

Even if such data is present, embedding similarity can still be very useful as a first stage for ranking and filtering via K-NN retrieval over a very large volumes of candidates. We can then always further refine our ranking using more complex supervised learning methods later.

We hope you enjoyed this discussion!

References

Le, Q., & Mikolov, T. (2014, June). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188–1196). PMLR.
Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. “Glove: Global vectors for word representation.” Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.
Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).
Reimers, Nils, and Iryna Gurevych. “Sentence-bert: Sentence embeddings using siamese bert-networks.” arXiv preprint arXiv:1908.10084 (2019).
Schroff, Florian, Dmitry Kalenichenko, and James Philbin. “Facenet: A unified embedding for face recognition and clustering.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. “Siamese neural networks for one-shot image recognition.” ICML deep learning workshop. Vol. 2. 2015.
Ying, Rex, et al. “Graph convolutional neural networks for web-scale recommender systems.” Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 2018.