Improving the Stack Overflow search algorithm using Semantic Search and NLP
If you a person who has ever tried to write a piece of code, you are sure to come across Stack Overflow (it’s that famous). For those who live under a rock, Stack Overflow provides one of the largest QA platforms for programmers. Users post questions/doubts and their fellow peers try to provide solutions in the most helpful manner possible. The better an answer, the higher the votes it gets, which also increase a user’s reputation.
Given its popularity, it’s safe to say that there is a buttload of data on there. However, this huge amount of information also makes it difficult to search for the solution one is looking for. It’s not that big of an issue for programming veterans and other experienced professionals, because they are aware of the correct keywords required to get an appropriate answer. However, for a baby programmer, this poses a great concern. For instance, if he needs to learn ‘how to make a server’ using Python, it is quite unlikely that he would use the terms ‘Django’ or ‘Flask’ in the search box. Thus, this might intimidate the user to use the platform.
Therefore, an optimal search engine is necessary to navigate through this mess. However, as it currently stands, the search engine of Stack Overflow suffers from a few bugs of its own. Let me illustrate it by an example:
Let’s say that I’m a beginner in NodeJS and I want to run my app. So I go over to Stack Overflow and type the following: ‘Node — how to run node app.js?’. This is what I get:
However, there does exist a result for ‘Node — how to run app.js?’:
Ideally, it should have returned the result for both the queries, but it didn’t.
‘Houston, we have a problem!’
But why bother, you ask? A major source of income for Stack Overflow is through Ad Revenue. Therefore, their goal is to maximize user engagement to push more ads and thus earn more money. Due to the suboptimal performance of their search engine, a user would have difficulty getting his doubts cleared through their website and would thus decide to use a more sophisticated search engine like Google for their purposes.
The problem arises when Google suggests a resource other than Stackoverflow. For every user that leaves their website, they lose money they could potentially make.
A possible solution
Our objective is for the platform to actually understand the content of what the user is trying to search for, and then return the most similar results based on that.
Natural Language Processing (NLP) has come a long way since its inception in the 20th century. This subfield of Artificial Intelligence has proven to work very well in the past few years due to fast processors and sophisticated model architectures and thus has immense potential for solving various language comprehension tasks.
(To view the animated version, click here)
The solution is broken into 2 subtasks:
- Tag Prediction: Given the user query, we would like to predict which tags the given query best belongs to
- Information retrieval: Based on a user query, return an existing question which most closely resembles the user’s query
Giving you a bird-eye view, the flow of the solution is as follows:
- Collecting Stackoverflow Question/Answer data from Google BigQuery
- Pre-processing and Normalizing the collected data
- Filter out only the most common tags (Used 500 tags)
- Training Word Embeddings using Word2Vec
- Training a Tag Classifier using an LSTM model (hint: you cannot use binary cross-entropy loss to train it. Read more to know why)
- Using the trained word embeddings to create sentence embeddings for all questions in our database
- Comparing a query to each sentence and ranking semantically similar results using a measure based on cosine distance.
- Creating a web application using ReactJS and Flask to deploy the trained model
For people with flowchart fetishes, here you go:
Nothing too flashy right? If you are familiar with each of these concepts, feel free to jump right into the code(It’s well documented as well *codegasm*).
Baby programmers, please keep reading :)
Little side note: I won’t be explaining each and every line of code in the article (the jupyter notebooks in the repo already take care of that). Instead, I would be focusing on key results and the intuition behind them
1. Data Collection
To understand and learn from the data, I need to gather Questions and Answers that were posted on Stack Overflow. Thus what I need are the following:
- Title
- Question body
- Answers for that question
- Votes for each answer
Due to the sheer abundance of data on Stack Overflow and better sanity checks, I restricted the data to only “Python” related questions. However, the entire process in reproducible for other topics as well
Google BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive and is also available through the Stack Exchange Data Explorer. More info about the dataset is given at Kaggle Stackoverflow Dataset
The required data for our task can be collected through the following SQL Query:
SELECT q.id, q.title, q.body, q.tags, a.body as answers, a.score FROM 'bigquery-public-data.stackoverflow.posts_questions' AS q INNER JOIN 'bigquery-public-data.stackoverflow.posts_answers' AS a ON q.id = a.parent_id WHERE q.tags LIKE '%python%' LIMIT 500000This query joins 2 tables(stackoverflow.posts_questions and stackoverflow.posts_answers) and collects required data for 500000 questions which are related to python. Thus, each row contains a question and one of the answers it has. (Note: There may be rows with identical questions but unique answers).
2. Data Preprocessing and Normalization
Data Preprocessing 101 — Check for missing values:
df.isna().sum()
--------------id 0
title 0
body 0
tags 0
answers 0
score 0
dtype: int64
As I mentioned before, our dataset may contain rows with identical questions but unique answers. We want to combine all these different rows as one while summing up the votes of each answer and combining all answers.
Max score before: 5440
Max score after: 9730Now that we have our raw data finally ready, we can move onto cleaning and preprocessing this raw text data. I implemented basic text processing including the following steps:
- Convert raw text into tokens
- Convert tokens to lower case
- Remove punctuations
- Remove Stopwords
Note: I skipped removal of numeric data since I felt it would remove precious contextual information.
I also skipped a ‘Stemming/Lemmatization’ step because I did not want alter the domain specific terms used in our corpus and risk losing precious information
If you notice the dataset closely, you’ll observe that the raw text for Questions and Answers is given along with the HTML markup with which it was displayed on StackOverflow originally. These refer usually to p tags, h1-h6 tags and the code tags. Since I only need the text section of each post, I performed the following steps:
- I constructed a new feature column called ‘post_corpus’ by combining the title, question body, and all the answers(Will be used later to train Word2Vec embeddings)
- I prepended the title to the question body
- I skipped the ‘code’ sections because they do not offer useful information for our task
- I constructed URLs for each question by appending ‘https://stackoverflow.com/questions/' with the question id
- I constructed 2 features for sentiment using the open Source Textblob library
Each post has a variable number of different tags. To narrow down the vast choices for a more accurate model, I decided to with 20 most common tags. The plan is to filter only the data which contains at least one of most_common_tags
['python', 'python-3.x', 'django', 'pandas', 'python-2.7', 'numpy', 'list', 'matplotlib', 'dictionary', 'regex', 'dataframe', 'tkinter', 'string', 'csv', 'flask', 'arrays', 'tensorflow', 'json', 'beautifulsoup', 'selenium']Alright, just one last quick boring as hell data processing step, then we get to the fun stuff :)
- I created a separate column for the ‘processed_title’ because I wanted to preserve the original title because I wanted to serve the original titles in the app
- I also normalized the numeric ‘scores’
3. Filter out only the most common tags
Although we did filter out a lot of tags in the Data processing step, there still exist a lot of “stray” tags which may appear only once or twice as compared to thousands of occurrences for other tags. This increases the dimensions of the ground truth data, which is not desirable for our model.
Thus we extract the top 500 tags based on their occurrences. (Given that we have ~140k data points, 500 tags seems to be a good number to experiment with)
Finally, we altered the tag data to only include tags from one of these 500 tags, for better model accuracy
4. Training Word Embeddings using Word2Vec
For our model to understand the raw text data, we need to vectorize it. Bag of Words and TF-IDF are very common approaches for vectorizing. However, since I would be using an artificial neural network as my model(LSTM), the sparse nature of BOW and TFIDF would pose a problem. Thus I decided to go for Word Embeddings, which are dense vector representations and thus perfect for our neural network.
The way people converse on StackOverflow is very technical and they use a very specific vocabulary of words, so it is not a good idea to use pre-trained Word Embeddings (although Google has a lot of good ones) because they are trained on plain English text such a Shakespeare and would not be able to understand the relations between the words in our vocabulary. Thus I decided to train a Word Embeddings model from scratch using Word2Vec.
The column post_corpus in our dataset, comes to shine here. Since Word2Vec is an unsupervised model that just requires a corpus to train on, we need to provide it as much text as possible for it to understand the vocabulary. Thus we use post_corpus to train Word2Vec because it is a combination of the title, question, and all the answers for a post.
After successful training, we get the following results:
Terms most similar to "django"
[('flask', 0.5827779173851013), ('project', 0.5168731212615967), ('mezzanine', 0.5122816562652588), ('wagtail', 0.5001770257949829), ('drf', 0.4827461242675781), ('framework', 0.48031285405158997), ('cms', 0.47275760769844055), ('admin', 0.467496395111084), ('database', 0.4659809470176697), ('app', 0.46219539642333984)]
--------------------------------------------------------------------
Terms most similar to "api"
[('apis', 0.6121899485588074), ('webservice', 0.5226354598999023), ('service', 0.49891555309295654), ('framework', 0.4883273243904114), ('postman', 0.47500693798065186), ('webhook', 0.4574393630027771), ('rpc', 0.4385871887207031), ('oauth2', 0.41829735040664673), ('twilio', 0.4138619303703308), ('application', 0.4100519120693207)]
--------------------------------------------------------------------
Terms most similar to "gunicorn"
[('uwsgi', 0.5529206991195679), ('nginx', 0.5103358030319214), ('000080', 0.4971828758716583), ('supervisord', 0.4751521050930023), ('arbiterpy', 0.4701758027076721), ('iis', 0.46567484736442566), ('apache2', 0.45948249101638794), ('web1', 0.45084959268569946), ('fastcgi', 0.43996310234069824), ('supervisor', 0.43604230880737305)]
--------------------------------------------------------------------
Terms most similar to "server"
[('webserver', 0.5781407356262207), ('servers', 0.48877859115600586), ('application', 0.488214373588562), ('app', 0.4767988622188568), ('vps', 0.4679219126701355), ('client', 0.46672070026397705), ('localhost', 0.46468669176101685), ('service', 0.457424521446228), ('apache', 0.4540043771266937), ('nginx', 0.4490607976913452)]Pretty fricking cool, right?
5. Training a Tag Classifier using an LSTM model
This section deals with our first subtask:
“Given the user query, we would like to predict which tags the given query best belongs to”
Preparing the data for training the model, includes:
- One-Hot encoding the ground truth data
- Splitting into training and test sets
- Tokenizing and Padding
- Creating an embedding matrix
(I’ll include the code here, however for detailed explanations for each section refer to the jupyter notebook in the repository)
The model created for solving the subtask at hand used LSTM layers right after the Embedding layer because of their proficiency in working with sequential and text data. Subsequent Dense layers are added along with Dropout regularization to build a robust model
The secret sauce — THE LOSS FUNCTION: Since we are dealing with a multilabel classification problem here, we cannot train the model using a binary cross-entropy loss. This is because binary cross-entropy loss would push your model to predict one or two tags which are included in the ground truth, and won’t penalize it for missing out on the other ones.
Thus I used log loss applied on each class separately and then computed its average based on the number of classes.
We can then get the tags for any sentence using the help of the following function:
Test Case: selecting n1d array nd array numpy
--------------------------------------------------------------------
Predicted: [('arrays', 'numpy', 'python')]
Ground Truth: [('numpy', 'python')]
Test Case: taking info file
--------------------------------------------------------------------
Predicted: [('python',)]
Ground Truth: [('python', 'python-2.6', 'python-2.7')]
Test Case: python find txt continue causing fault
--------------------------------------------------------------------
Predicted: [('python',)]
Ground Truth: [('python', 'python-3.x')]
Test Case: fabric rsync read error connection reset peer 104
--------------------------------------------------------------------
Predicted: [('django', 'python')]
Ground Truth: [('django', 'python', 'windows')]
Test Case: fllter pandas dataframe multiple columns
--------------------------------------------------------------------
Predicted: [('dataframe', 'pandas', 'python')]
Ground Truth: [('pandas', 'python')]
It seems good enough for now 😃. Done with this subtask, moving onto the next lets gooo~
6. Using the trained word embeddings to create sentence embeddings for all questions in our database
Subtask 2 demands the following:
“Based on a user query, return an existing question which most closely resembles the user’s query”
To compare how similar two sentences are, we can find the ‘distance’ between them. To be able to compute such a distance, the sentences must belong in the same vector space. This is done through sentence embeddings. One of the most popular distance measures, which I have also used, is the cosine distance. Lesser the cosine distance, higher is the similarity between the two vectors
To calculate the embeddings for an entire sentence, I defined the following function which averages the embeddings for each valid token, and used to calculate sentence embeddings for all questions in our database:
7. Comparing a query to each sentence and ranking semantically similar results using a measure based on cosine distance
Although cosine distance alone is sufficient enough to give us a good result for this specific task, I decided to define a custom similarity measure. It is given as follows:
where q = user query and t = an existing question
- It considers the cosine distance as a base measure
- It takes into account the popularity of the post based on the votes it has received by users at StackOverflow
- It takes into account the overall sentiment of the responses that people have made. A positive sentiment entails that the answers were helpful and thus is a good post
Running the following code snippet can show you the algorithm in action:
For user query “Combine lists of lists”, we get:
The web application simply translates these results through a user interface
I am not going to discuss how I created the web app in this post, you can check that code out in the repository itself. If you have any doubts or any constructive feedback (toxic haters can go suck a di..), please feel free to leave a response below. I’d love to hear from you 😄
Until then, Ciao~

