Bag of Words to tackle UGC Search Relevancy at Scale ?
A look at the possibility of augmenting search relevancy at scale from a noisy, sarcastic & perhaps toxic user-generated-content (ugc) with bag-of-words model.
Note: The challenge of tackling Relevance at Scale with noisy ugc data is largely generalized for the sake of blog postLets say you are managing search platform for a site like Youtube, Reddit or perhaps even Facebook. And you would like to incorporate Comments data to augment the discoverability of a user post.
Consider an example where a user of the site posted a picture of themselves with following title
“Getting married next week. lost significant amount of weight since I got engaged .” (post accompanied by before-after picture of themselves)
Sample set of comments from the user community for that post is as shown in the snippet below (this example below suffers from only very mild level of noise)
Usual approach is to index comments from that post as an additional searchable meta data (On large scale platforms there’s usually Billions of these comments across millions of user posts).
Problem solved ? May be or may be not!
Challenge is though, with noisy ugc data like comments its hard to generate value with just usual algorithmic relevance. While Youtube suffers from toxic nature of comments, Reddit is largely characterized by sarcastic & circle-jerk behavior of its users. This presents some unique challenges for relevance with scale with noisy ugc data.
Yet at the same time they have very valuable bits of relevant terms that can help make that post discoverable which otherwise would have likely been missed by certain keyword searches.
In Short: When indexed, comments as searchable meta data has the potential to cause significantly more false positives than true positives. In other words increases recall many times over precision besides more importantly slowing down search performance overall.How do we reduce the noise & extract relevant terms from data like this ? We would need to solve for increased recall without a loss in precision.
One of the approaches that I would like to discuss is Bag-of-words model (this is by no means the only way to solve.)
Bag-of-words is one of the simplest but also most widely used techniques in NLP. A great approach to start with for any text-based problem. Bag-of-words representation, is one in which you count the occurrences of words across all texts and use the counts of the top-N words as N new features.
General processing steps:
• cleansing
• stopwords
• tokenization
• stemming
• vectorization
“there are lot of great online resources to read about bag-of-words model if its new for you.”
In this case we implement a variant of the bag-of-words model, it differs in that in the vectorization step we treat each comment as a document on its own right.
In other words we build an simple streaming in-memory tf-idf index for each Post comprised entirely of comments related to that Post. Once complete for every term in the that index we run a query against that index to compute its tf-idf score for that term and the result is top-N terms based on tf-idf score that are highly relevant to that Post document.
Building a in-memory real-time tf-idf based index is relatively easy to achieve on stack like Apache Solr & Apache Spark.
there’s lot of great online resources to learn about classic tf-idf similarity. In this case tf-idf score is calculated simply as follows:
tfidf(t,d,D) = tf(t,d).idf(t,D)
Once we have the top-N terms we could simply index them as an additional searchable meta data for that Post in perhaps a multi-valued field and weight it accordingly to your overall relevance need.
Example bag-of-words top-N terms by tf-idf score:
Possible Search Matches:
amazing weight loss
wedding transformation
keto weight loss
..{more}..You’ll see some queries listed in above snippet thats possible from the top terms generated by bag-of-words model output for that particular post’s user comments.
This can be weighted to be either higher or lower compared to other meta data attributes, depending on the overall relevance need.
Note: With bag-of-words model since its a decomposition of text that ignores the context of these terms we won’t be able to do phrase-matching. (You can force all terms to be present to be considered a match but still its not quite the same)As with any model when it comes to accuracy, depending on how you implement the model it has the potential to drop some highly relevant terms while include some less relevant terms to a varying degree.
There’s number of parameters to perform several heuristic evaluations with:
• minimum-tf
• maximum-idf
• length filter
• engagement filter
• depth filter
Beyond bag-of-words model:
Above is one approach that we can possibly explore but there are more advanced & sophisticated approaches:
• Phrase Extraction.
• Statistically Improbable Phrase.
• n-gram model.
…. and more
In a following post, I’ll talk about how we implemented this in Production and how we measured the performance. Hope you found this post to be a useful read! Love to hear your feedback.

