Applied Math & ML: A Real Data Science Interview Project

A take-home data science project that I received for a social network…

Published in

The Urban Nerd

17 min readJul 20, 2019

Hey everyone 👋🏾 Today I want to share with you the executive summary for a timed data science interview project that I recently completed. Hopefully, for new data scientists, this can help you prepare for your data science interviews if they include a take-home data science project. Enjoy!

Notes:

This was for a VC funded social networking startup
I have redacted the company name out of respect to their privacy
For mental context, you can simply think of the company as Instagram
Anywhere I used the company name, I changed it to “[The Company]”
Visual formatting of the original executive summary is drastically different/better due to it being written in markdown for a Jupyter Notebook
Non-screenshot images were added specifically for Medium

Executive Summary

“One-Liner”

In 2 to 3 hours, I was able to accomplish topic extraction for posts, in order to create latent feature embeddings for both users and posts. This allowed me to then be able to accomplish post to user recommendations, user to user recommendations, and post to post recommendations via cosine similarity.

Background

The problem presented to me was to create an unsupervised ( i.e. no labels/source of truth) recommender system for [The Company] around users and user posts. One key restraint is to accomplish this within a 3-hour window. Due to that restriction, I decided that the current state of this notebook would focus on building the core of the recommendation system. However, I will mention at the end of this executive summary what improvements I would implement if allowed to continue working on this ( e.g. hyperparameter tuning, LDA2Vec, etc).

Data Exploration

The dataset provided was a compressed csv containing 590k rows. Each row is a user post from [The Company]’s platform. Each row has the following fields: user_id, post_id, fan_is_private, fan_is_verified, fan_follower_count, fan_following_count, ugc_media_type, ugc_caption, ugc_comments_count, ugc_likes_count, ugc_created_time, ugc_campaign_ugc, ugc_objects, and ugc_styles.

My first step was to find the fields that were the least useful. The way I determined if a feature wasn’t useful, was by looking at its variance. A field that has practically the same value for all rows tells us nothing new (unless we’re working with a highly unbalanced dataset doing traditional classification — e.g. fraud detection or click prediction).

Through inspection, most fields could be useful once cleaned of None and NaN values. The one field that had no variance was ugc_media_type (because the dataset consisted of all image posts), so I decided to drop that field. Additionally, I noticed little variance for ugc_campaign_ugc (whether the user post was associated with a brand's campaign). However, I felt that this field may be useful if I decided to improve this recommendation system (maybe posts associated with a brand's campaign may experience different levels of engagement, and should be analyzed by a specialized machine learning model).

Additionally, when looking for bad/invalid values in other fields (e.g. some rows for the fan_is_private and fan_is_verified columns had random text (looked like someone's bio or post caption) instead of being boolean values), I noticed a correlation between invalid values. After a bit more exploration, I came to the definite conclusion that most of the invalid inputs in any column due to a row, also correlated with that row's user_id and post_id columns being invalid ( NaN's). I decided a good portion of my data cleaning for the core recommendation system could be done just by dropping rows with where the user_id or post_id were NaN or None.

Current Model System (LDA + Cosine Similarity)

I decided the best approach for building the core recommendation system under these time constraints is to focus on the core features of a social media post: the post’s textual and visual information in relation to the user. As it relates to this dataset, the post’s textual information can be found in the ugc_caption column, the visual information detected by [The Company]'s proprietary technology can be found in the ugc_objects column. After preprocessing of the post's textual and visual information, you can determine the different topics a post correlates to.

0) NLP Preprocessing — Cleaning and Transforming Post’s Text

I processed a post’s textual information by turning emojis into words that describe the emoji (), hashtags into words (even handling hard hashtags — e.g. #iamgreatlovingmylife) by iterating through the English dictionary, and removing mentions (e.g. @ user_X). The reason I didn't simply remove emojis and hashtags (even hard hashtags), is due to the fact that these pieces may contain vital information since the poster felt compelled into using them.

The post’s visual information was represented in an array of objects. The array can either be an array with the value None in there, or it may have objects detected in the post's image (e.g. shoes). There were several edge cases discovered along the way. One edge case was, even though the array may contain objects, there still may be None values scattered across the array.

An additional edge case was the format that the visual information was given. The visual information (i.e. the array) was given as a string instead of a regular python object (e.g. instead of [1,2,3], I was given "[1,2,3]"). If you weren't cognizant while working with this dataset, you may end up storing the string's characters (including the brackets), instead of just the values in the array (e.g. ["[","s", "h", "o","e","s","]"] instead of ["shoes"]). I assumed that this was just a standard database practice in [The Company] in order to save database space. I parsed the string using the Pythonast module's literal_eval method, in order to turn the string into an object. Note: If you were to mistakenly parse the string using python's reserved eval function, you could potentially activate malicious python code.

Once I had the textual and visual information parsed for each post, I created a new column for the string combination of the two fields called shane_bag_of_words. I then created training and testing sets. Using the SnowballStemmer package from nltk, I was able to reduce every word in shane_bag_of_words for the training set to its root word. This process is called stemming. The reason why we needed to do this was, that later on in our model, if two users used the same word but with different prefixes or suffixes (e.g. happy and happiness), the model will believe the words are completely separate words.

The reason why I used the Snowball algorithm for stemming is, there are generally two well-regarded algorithms for stemming: Porter and Lancaster. Lancaster is the computationally fastest algorithm and is very aggressive with finding the root of words in order to group as many similar words together as possible. However, Lancaster's root version of a word is frequently hard to read when manually analyzing the results, and sometimes it overgroups words ( i.e. the words are similar in root, but you wouldn't really group them together as one word). Porter is less aggressive ( for better or for worse...), but has a high computation time. Snowball, often regarded as Porter V2, is the middle ground between Porter and Lancaster. These words reduced to their stem are frequently referred to as tokens in the Natural Language Processing (NLP) community, but for the simplicity of this summary, we will continue referring to them as words.

All throughout this process, we’re cognizant to filter out stop words ( e.g. “a”, “as”, “the”, etc.), because these words are highly popular filler words, and if these words were utilized, the model may mistakenly think certain posts are similar due to the presence of popular words that have no inherent meaning/distinction.

We then take this processed text and make a dictionary set out of the words, while also keeping track of how often they occur and in which posts. We filter out words that occur in less than 7 posts (because they happen so infrequently, they are useless for clustering posts together by the words they use); as well as, words that occur in over 50% of our posts (because those words occur so frequently, we can’t use them to divide posts into separate clusters). After this filtering step, we only keep the top 50,000 words for the sake of space and computation time.

1) Latent Dirichlet Allocation (LDA) — Topic Extraction for Posts

This dictionary is then transformed into a bag of words format necessary for the machine learning model we plan to use: Latent Dirichlet Allocation (LDA). LDA is a popular model within the NLP community for extracting/learning topics from a group of documents (e.g. legal documents, news articles, social media posts, etc. Basically, any kind of body of text.), and then grouping both seen and unseen documents into the topics it previously learned.

Additionally, it also gives you a probability of the document being in each topic group/cluster (e.g. Document_1 is 60% about topic A, 20% of topic B, and 20% about topic C). This soft/probabilistic clustering approach works significantly better for real-life documents in comparison to algorithms that forces single topics (e.g. hard/non-probabilistic clustering algorithms like k-Means), due to the fact that documents normally have multiple subtopics (e.g. social media post about parenting may be about children food products vs children toys).

For our use case, we use LDA to see what topics a post is about. There are a lot of hyperparamaters when creating an LDA model (e.g. number of topic clusters to create, number of times to go through our bag of words, etc.). The default number of topics is 100. This is due to the fact that more topics we have, the better (generally) accuracy you have when seeing how similar two documents are. However, due to my time limit, I chose 10 topics ( for computation speed) and left the rest of the hyperparamter to their default.

2) Cosine Similarity — Measuring Similarity of Users and Posts

Once the LDA model learns our 10 topics, we can create a 1-dimensional array of size 10 (we will refer to this array as a vector from now on) for each post. Each vector for a post can be thought of as a point on a graph — imagine an X and Y graph, but instead of two dimensions/lines/coordinates (X and Y), we have ten dimensions/lines/coordinates (one dimension/line/coordinates for each topic).

These ten dimensions are called latent features in the recommender systems community (as well as, the latent features for an individual are called an embedding — i.e. a user’s embedding and a post’s embedding), because they show a hidden/better representation of whatever blatant features you’re trying to represent (A good way of remembering this is that, latent is the opposite of blatant even though they’re spelled in a similar fashion. So we changed useless blatant information about our post into useful latent information).

Since we can now represent a social media post on a graph, we can tell how similar two posts are based on either how close they are on a graph, or using more advanced linear algebra theory, we can tell how similar they are if the two social media post vectors look like they are going in the same general direction. This latter option is called cosine similarity, which is what I chose to use to tell if two social media posts are similar.

Cosine similarity is calculated by doing the dot product of two vectors, divided by the multiplied normalized versions of the two vectors. The dot product gives you the magnitude and direction/angle of the two vectors combined ( think vectors in physics), by dividing by the multiplied normalized versions of the two vectors, leaves you with only the angle the two vectors combined.

Additionally, since we have a way to represent a post via post embeddings, we also accomplished a way to represent users via user embeddings. Though we aren’t given which posts a user has engaged with, we are given the knowledge of the posts that a user has posted.

Under the hypothesis, that users like to see the type of content that they themselves post, we can represent a user as the average of the types of posts that they post. For example, if we have two topics, and a user has three posts represented as post embeddings ([[Topic A: 0.9, Topic B: 0.1], [Topic A: 0.8, Topic B: 0.2], [Topic A: 0.95, Topic B: 0.05]]), then the user can be represented as the average embedding of those three post embeddings (([0.9, 0.1] + [0.8, 0.2] + [0.95, 0.05]) / 3 equals [Topic A: 0.88, Topic B: 0.12]). We can then treat a user as a post embedding and perform cosine similarity on a user against all post embeddings to get top posts that they would like ( i.e. the posts that are most similar to what that user usually creates); as well as, perform user to user recommendations ( recommend users who create posts about similar topics). We can also do post to post recommendations if [The Company] wanted to have a discovery section like Instagram (when you click a post on the discovery section of Instagram, you see posts specifically similar to the post you clicked on).

Testing

Even if this is an unsupervised algorithm, we still need a way to measure success. This becomes especially important when we want to know if a change (hyperparameter, new feature, new model, etc.) is an improvement or detriment. Once we have a success metric, the performance of our previous way of doing things ( e.g. an old hyperparameter, feature, model, etc.) is the baseline ( the minimum we’re trying to beat/improve upon).

Since we have no previous model to compete against in this case, my baseline is simply random chance (i.e. if I made a random guess on which option to take — e.g. if I randomly picked which post to show a user).

One of my initial thoughts on how to test this is that I could manually generate a list of top posts for a user. I would then randomly generate the latent features (i.e. randomly pick a decimal number between 0 and 1 for each latent feature) for the posts in that list as well as the user. We’ll also simultaneously generating the real latent features for the posts and user using our LDA model. We would then do cosine similarity on the real user latent features against the real latent features of each posts, as well as do cosine similarity on the random user latent features against the random latent features of each post. Then we’ll see if our real latent features get closer to the manual ranking than the random latent features.

However, there are several issues with this method. The main issue with this method being, it’s manual and doesn’t scale, since we can’t reuse the same exact list — this is due to statistical and scientific reasons.

The second idea that I had was to treat this as a binary classification problem ( i.e. fraud or not fraud). We select X amount of user_ids, and create P amount of distinct pairs of user_ids from the X user_ids --while making sure no user gets paired with themselves. For each pair in P, we would then do cosine similarity of the two users in the pair against all posts by created by the two users in the pair, whoever in the pair that has the highest cosine similarity against a post will "own" that post. Our accuracy percentage would be based on how many times we correctly classified that a user made that post. Our baseline would be us randomly assigning a user to a post.

One tip of improvement I need to incorporate to this most recent idea is, accounting for the potential bias of users who have more posts — but for time purposes, I’m ignoring it. Additionally, I want to do cross-validation on top of this test in order to account for potential skewness in test results due to the users who are randomly selected.

Results

Model: LDA(10 Topics) + Cosine Similarity
Test Scenario: 10 random users -> 45 Pairs -> 96,822 Binary Classifications
Model Accuracy: 69.05971783272396%
Baseline/Random Accuracy: 50.117741835533245%

Here are some visual result samples. For user_id 700, who tends to talk about family and being a mother (first screenshot), some of the top-5 posts recommended for her were about motherhood and relationships (second and third screenshot).

In case you can’t see the screenshot on your device, most of her captions make statements like: “Merry Christmas from our family to yours!” and “And just like that Sophia is 6, 7, 8 months old!”

In case you can’t see the screenshot on your device, it states: “Monday has been amazing . A really positive day . Finally had my daughters parents evening and it was amazing . Going in all the right directions and learning so much . Not only am I proud of her but I’m proud of myself because I’m parenting the shiz out of life and I couldn’t be more happy 😊. Bad days certainly happen but today is a brilliant day to be happy 😃 #parentingtheshitoutoflife #happymum #mumwin #borntobehappy #blessed”

This top recommended post shows the power of the recommender system for utilizing detailed captions

In case you can’t see the screenshot on your device, it states: “Me & J @jennadbland #mums60”

This top recommended post shows the power of the recommender system for utilizing the keywords and hashtags in a caption, even when the caption is unbelievably short

Improvements

I have a list of improvements I would like to implement, but couldn’t due to time and resources:

Hyperparameter tuning
Utilizing the posts they choose to engage with, in combination with the posts they create themselves

Using numerical weights, I would play with tuning how much we should account for posts they choose to engage with vs the posts they create themselveswhen creating the user embedding

Creating an embedding of a user’s affinity to certain post styles (e.g. pics of flatlays, pics of people groups, etc.) (i.e. how often they engage with and created posts containing those styles) using ugc_style column
Optimizing the top posts recommended to a user

We should take into consideration the age of the post, existing engagement of the post, the median engagement of the poster, and the user’s affinity to the style of the post

Optimizing the top users recommended to a user

We should take into consideration the last time that the recommended user created a post, the median engagement of the recommended user (especially recent posts), and the user’s affinity to the styles posted by the user being recommended.

Utilizing the mentions (i.e. @user_2) in comments and post captions when recommending two users (if it’s not a company/object, two users mentioning each other may reveal a negative or positive connection, which may help us decided if we need to either adjust the user embeddings or maybe simply recommend users similar to the user that was mentioned)
Utilizing LDA2Vec, we may be able to get a stronger signal using LDA2Vec rather than using LDA, because we will be able to take into account the context of words instead of just word frequencies. (Though, we then need to think about how we append the textual information with the visual information)
Predicting engagement for a new post for various reasons

We would use the post information (textual & visual information, current engagement, time to be posted, etc.) and user information (representation of their average post, their median engagement, usual creation time of a post, etc.) to predict engagement. This can be done through classical supervised machine learning, but with enough data we can use deep learning in order to take advantage of a deep model’s feature selection and feature extraction capabilities.

Productionizing

Since this is meant to be a constant live feed for a user, we will mainly have to take an online approach, rather than offline approach (e.g. making a prediction as soon as a web request comes in, rather than making predictions at night then storing them for the next day’s use)
The simplest way of productionizing this is, to have the LDA model on a python server that can handle multiple requests at the same
Generate and store the user embeddings any time a new account is created

Since they have no posts to base the model off of, give them the average user embedding of the whole site or the average user embedding of their best/most representive community (e.g. gender is a community, city/state/country is a community, etc.) (e.g. most users from New York have an embedding like this…)

Once embeddings for users are created, we can update them on a nightly basis, update them every time they make a post, and/or update the embedding after every ~10 posts the user engages with
When a user would like to see what other users are recommended for them, we compute it live, serve them, then store it for later use

We should ideally update these any time their embedding gets updated

When a user makes a request to see the posts most recommended to them, we should gather all posts, eliminate posts older than a certain creation time, calculate the cosine similarity between the users and these filtered posts, filter out posts below a certain similarity, do a reranking using extra data (e.g. current engagement of post, median engagement of poster, style of post, etc.), serve the results, and store these results for that user.

At any step in this process, if we don’t have a minimum amount of potential posts to show the user, then we should be willing to go further back in creation time (making sure we don’t recalculate posts the user has already seen in the first list)
Once a user has reached the end of a list, we should go back in creation time to get older posts and do calculations (making sure we don’t recalculate posts the user has already seen in the first list)

When a user request to see similar post to a certain post (and they’re not specifically asking from a pre-specificied category–e.g. more posts by this user, this region, this brand, etc.), then we should gather all posts, eliminate posts older than a certain creation time, calculate the cosine similarity between the seen post and these filtered posts, filter out posts below a certain similarity, do a reranking using extra data (e.g. current engagement of post, median engagement of poster, style of post, etc.), serve the results, and store these results for that post.