Personalized Fishbowl Recommendations with Learned Embeddings: Part 1

Published in

Glassdoor Engineering Blog

10 min readJan 7, 2022

Glassdoor recently acquired Fishbowl, a professional networking community where working professionals can have workplace related conversations with other peers in the industry.

Fishbowl users can anonymously write posts and see posts from other anonymous users in what we call “bowls”: a collection of posts related to a certain workplace, industry or topic. Bowls can be an effective way to gain insights into workplace topics and conversations. The anonymous nature of the app can further encourage honest and frank discussion on topics users may otherwise feel uncomfortable discussing. Users can subscribe to different bowls and then see new posts from their subscribed bowls show up in their main home feed when they open the app.

Every day Fishbowl users post thousands of new posts. With a growing and increasingly active user base that number will keep increasing. Surfacing the most interesting content to users can therefore become increasingly challenging with scale and a lack of personalization can detract from the overall user experience. Given the large number of possible posts to recommend and the small number of posts that can be surfaced to the user in app at any time, we have a typical recommendation system problem.

From Global Rankings to Personalized Recommendations

To personalize the posts recommended to a user we decided to use Machine Learning shortly after Glassdoor’s acquisition. Prior to this Fishbowl just used the recency and global popularity of a post to sort what to show users in app.

At the time of starting the project we also did not collect any explicit user click data that could have defined our problem into a classic supervised learning problem (e.g: predicting the probability of a user clicking a post given they saw the post). While we collected data on what posts a user anonymously liked or commented on we did not collect good data on if they saw or clicked a post at the time. Why is this important?

If training a supervised model, we can use such implicit labels based on user activity but we would not only need the implicit positive engagement label (e.g: post liked after user clicked the post) but also an implicit negative engagement label (e.g: post clicked but not liked by user) to allow the model to discriminate between positive and negative engagement. In a more straightforward setting when these binary implicit labels exist we can train a classical supervised ML model (e.g: gradient boosting, logistic regression, etc) and interpret the model output as a probability of positive engagement. We can then use these probabilistic scores for ranking purposes.

However, in the absence of click data we do not have as strong an implicit signal for a negative interaction. If a user did not like a post, it can be either because they did not actually like the post or because they did not see it entirely. This means there is some ambiguity in what actually constitutes a negative training instance. In the absence of clear negative engagement labels, we instead resorted to somewhat more unsupervised embedding methods to generate personalized recommendations for users.

Embeddings for Personalized Recommendations

Our goal is to find similar posts to recommend to users based on what they or users similar to them liked in the past. If we can project users and posts into a shared vector embedding space, we can compute the similarity between users and posts and use the similarity score for ranking purposes. The key assumption being that similar posts and users cluster together in the embedding space. There are many common similarity metrics (e.g: euclidean, manhattan distance) but here we use cosine similarity. Since we want to compare vectors it makes sense to measure their affinity by the strength of their dot product which can give us a sense of how aligned two vectors are. Recall that two vectors will have a bigger dot product if they point in similar directions and smaller if they point in different directions.

But how do we find similar users and posts given posts and users are two distinct entities? Also much of our data is raw text and not numerical feature representations. For Part 1 of this project we consider only the text of the post as the feature used for computing similarity and ranking. So we first need a way to also convert post texts to numerical representations.

From Text to Embeddings

Embedding methods have become really popular especially within the NLP domain. LDA [1] is a popular method to represent text as a vector of “topic” probabilities. And one of the first methods to utilize neural networks to generate embeddings of words was Word2Vec [2]. Later Doc2Vec [3] extended that model to generate embeddings for entire sentences.

Word2Vec and Doc2Vec

A very quick recap on Word2Vec and Doc2Vec. Word2Vec (SkipGram) generates embeddings in an unsupervised manner where each given word is used to predict every other word in its context. The context of a word is the group of K words around it in a sentence. Words that occur in the context of another word more often will end up being pushed closer to each other in the embedding space while those that rarely occur will end up being more distant. Doc2Vec utilizes the same paradigm but instead of learning word embedding learns the embedding of the entire sentence/document as well by adding an extra weight matrix for the document id the words are coming from. By doing this we are not just predicting the words in a context of a word but also if they are coming from a similar document.

Doc2Vec Architecture (Credit: Doc2Vec Paper[3])

Content Doc2Vec Model

We utilize Doc2vec to generate such embeddings of posts. Our first model, the content doc2vec model generates a post content (text) embedding for each post by treating each post’s text as a unique document during training.

The content doc2vec model can learn semantic similarities between posts. Ideally posts that have similar texts will have high similarity scores. However, our goal is to find similarities between users and posts, not similarities between posts. No such embedding for users exists using just a content doc2vec model trained on a corpus of Fishbowl posts. Therefore, to compute a user embedding, we instead take the average post content (text) embedding of all the posts a user liked during the training period. This gives us a representation of both users and posts in the same embedding space.

*where P = posts the user liked and N = number of posts the user liked*

A pure content model is good for capturing some notion of semantic similarity between users or between posts and we can consider it from the lens of content filtering. But it doesn’t capture any aspect of what other similar users liked (i.e: collaborative filtering). Semantic similarity alone can be just one of the dozens of reasons a user may like a post, many of which may have little to do with how similar the content of the post is to previous posts a user liked. It might be that a user is more predisposed to liking content that is from users from their own company or their own industry etc. This is where including some aspect of “collaborative filtering” can be beneficial.

To incorporate such information we also average the embedding of all the users who liked the post into the post embedding calculation in addition to the learned post content (text) embedding of the post itself that we get from the content doc2vec model as shown below. To avoid any accidental leaking we also exclude the post for which the embedding is being calculated from all User Embedding calculations.

where U = users who liked the post and M = number of users who liked the post

Graph Doc2Vec Model

An alternative way to incorporate collaborative information, is to view Fishbowl recommendations as a graph learning problem. We can view Fishbowl as a graph where the vertices are defined as the set of all unique user ids and post ids. And an edge is defined as an undirected edge between a user id and a post id if the user liked a post as shown below on some hypothetical data. We can then aggregate information from a vertex’s neighbors by traversing this graph.

We extend the idea of Doc2Vec to this graph setting. Each vertex in the graph (either a user id or post id) is considered a unique document id. Each document id comprises the following as its document content (words): the list of first 100 vertices (users or post ids) visited from a starting document id via Breadth First Search. Why BFS?

Since Breadth First Search traverses a graph level by level this ensures we capture immediate neighbors of a vertex id first before moving to further distant descendants of a node which would be of lower importance for a user or post’s document representation. Each visited vertex id in this list is considered as a single “word” in the document for which the embedding is learned along with the entire document embedding.

For example for the hypothetical user id 876ylt in the fishbowl graph above the following could be the returned document if we assume a document length of 5

[5ddeb1b2c7, 987abr48y, 5901088cag, 675abr45, 78564kju]

And for the hypothetical post ID 78564kju the following could be the document with length 5

[5901088cag, 675abr45, 987abr48y, 8954563, 8796yit]

Running Doc2Vec on this corpus of vertex ids visited during graph traversal from a starting vertex gives us an embedding representation of each vertex in the graph (user id or post id).

Graph DeepWalk Model

An alternative embedding method that utilizes graphs to generate embeddings is proposed in DeepWalk [4]. DeepWalk initiates a random walk from each node in a graph up to a certain length T. The list of nodes on the random walk originating from the starting node is considered a unique “sentence” and each “word” is a unique vertex visited during the random walk. For each node Deepwalk generates a few random walks so we have multiple “sentences” from the same starting node. After generating a corpus of such random walk sentences, DeepWalk then trains a Word2Vec SkipGram model where the context is defined as the k surrounding vertices for each vertex in each sentence during a random walk. This method also yields powerful representations of each node in a graph.

From Embeddings to Serving Recommendations

Training each of the three models above gives us an embedding representation for each post and user id present during training. For the graph model we decided to keep the Doc2Vec BFS model as it gives slightly better performance.

We concatenate the user embedding and post embedding from each model into one final hybrid embedding that captures both aspects of content and user similarity.

Each of the content and graph models are deployed via MLFlow as a REST API service. To serve recommendations, we are provided a list of candidate posts to rank for each user. The candidate posts for each user are only from bowls a user is subscribed to as users cannot see posts from bowls in which they are not subscribed to. On average each user has about 200 new candidate posts to rank per request. We then query the graph and content models to generate the cosine similarity between the user’s hybrid embedding and each candidate post’s hybrid embedding, and rank all candidate posts for a user by their cosine similarity. We then serve the top-ranked K posts to the user per the client’s request. On average the whole inference process takes about 9 ms/batch request.

Conclusion

In this blog post we saw how in the absence of labeled data we can utilize embedding methods to generate reasonable Fishbowl post recommendations. In the next blog post we will see how we can improve upon this approach even further.

Here we train a content model from scratch using Doc2Vec, but our Fishbowl corpus is still relatively small. This might not be enough data to learn all the semantic nuances of the English language. In the next post we will see how we can leverage transfer learning using pre-trained models like GloVe or BERT pre-trained on much larger data (e.g: Wikipedia) to generate embeddings for our text features that are more semantically meaningful.

Secondly, post text is just one data point we can leverage for personalization. We next also consider other text features like user title, user company, user location, post feed name, post feed description to generate high-quality embeddings.

Finally, the graph embedding methods described above assume all post ids or user ids already exist during prediction requests. This is an unreasonable assumption as the Fishbowl graph is extremely dynamic with hundreds of new posts being added every day and new users joining the network. New posts are also the one that get the most engagement on a given day. As such a graph that did not see a post during training will generate no meaningful embedding for such new highly engaging posts. This makes the graph vertex ID embedding methods not ideal for real-time inference. In the next blog post we will see how we can circumvent this problem and still utilize a graph structure by using feature information of vertices instead of the vertex id itself for training to generate high-quality embeddings.

Stay tuned for more!

References

Blei, D.M., A.Y. Ng, and M.I. Jordan, Latent dirichlet allocation. Journal of machine Learning research, 2003. 3(Jan): p. 993–1022.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Le, Q., & Mikolov, T. (2014, June). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188–1196). PMLR.
Perozzi, B., Al-Rfou, R., & Skiena, S. (2014, August). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 701–710).