Stackoverflow Based Semantic Search Engine

Published in

Analytics Vidhya

10 min readMar 4, 2020

If you have written a code in your life then you must be familiar with StackOverflow. For those who did not familiar with this, StackOverflow provides one of the largest QA platforms for programmers. Users post questions/doubts and their fellow programmers try to provide solutions in the most helpful manner possible. The better an answer, the higher the votes it gets, which also improve the reputation of the user.

Kowing to its popularity, it’s safe to say that there are lots of data to them, However, this huge amount of information also makes it difficult to search for the solution on is looking for. But it is not big programmers who are aware of programming because they know how to optimize your search query with proper keywords. But for someone who is new to this world, it is not that simple. Therefore, an optimal search engine is necessary to navigate through this mess.

Index

The Task
Business Problem
ML Problem
Approach
Data
Library Import
Data Analysis and processing
Tag predictor model
Creating the Search pipeline
Scope of Improvement
References

1. The Task

Our objective is for the platform to actually understand the content of what the user is trying to search for, and then return the most similar result based o that. I need to gather Questions and Answers that were posted on Stack Overflow. Thus what I need are the following:

Title
Question body
Answers for that question
Votes for each answer

2. Business problem

The StackOverflow has a huge amount of information due to which it makes it difficult to search for the solution one is looking for. It’s not that big of an issue for programming veterans and other experienced professionals, because they are aware of the correct keywords required to get an appropriate answer. However, for a baby programmer, this poses a great concern. For instance, if he needs to learn ‘how to make a server’ using Python, it is quite unlikely that he would use the terms ‘Django’ or ‘Flask’ in the search box. Thus there is a need for an optimal search engine.

3. ML Problem

We have to build a semantic-based search engine that uses the StackOverflow data to do the search.

Performance metrics

I am using micro f1_score, as the metric for my Tag predictor model, since it is multilabel classification model micro f1_score is a good metric to choose

constraints

since its a search engine the latency should low.

4. Approach

First I will be going to get the data from https://www.kaggle.com/stackoverflow/stackoverflow and do some preprocessing and feature engineering on that.
Then I train a deep learning model to predict the tag of the input query so that I know which category the input query belongs to, my metric will be micro f1_score. Here I am also using the w2vec model for the vectorizing since I am using cosine similarity along with the number of votes and the sentiment polarity to the answer to select the most voted and positive response and to achieve this I want to represent my query into a vector space.
In the end, I will create a small UI to get the input and when the user enters the new query I will first process it then I will predict the tag form it and after that will filter the main StackOverflow data related to the predicted tag. After doing this I will restrict my search and embedding to that corpus of filtered data only and based on the similarity along with most voted and positive sentiment. I return the results to the user.

5. Data

I will use this https://www.kaggle.com/stackoverflow/stackoverflow dataset. I include an archive of Stack Overflow content, including
posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive.

I will use bq_helper which is a helper class to perform read-only BigQuery Tasks. Reference: https://www.kaggle.com/sohier/introduction-to-the-bq-helper-package
There are many tables on the Stackoverflow database, but we only need concern ourselves with posts_questions and posts_answers.

6. Library Import

Getting the data

Observation

Basically, this query joins 2 tables(stackoverflow.post_questions and stackoverflow.posts_answers) and collects the required data for 10000000 questions. Thus, each row contains a question and one of the answers it has. (Note: There may be rows with identical questions but unique answers).

7. Data analysis and processing

Missing value check

original_data.isna().sum()

Observation

It’s good to see that there is no missing value in the dataset.
Hence no need for the data imputation.

Checking for Duplicate in data if any

original_data.duplicated().any()
--------------------------------
True

As mentioned above the dataset may contain rows with identical questions but unique answers. We want to combine all these different rows as one while summing up the votes of each answer and combining all answers.

Data After combining rows with identical questions

Now since we have prepared our raw data we can now move onto cleaning and preprocessing this raw text data. I am doing some basic text processing including the following steps:

Tokenization of the text
Converting the tokens to lowercase
Removing the punctuation from the tokens list
Removing the stopwords from the tokens list
Tokenization of the code string

If you noticed the dataset closely, you’ll observe that the raw text for questions and Answer is given along with the HTML markup with which it was displayed on StackOverflow originally. These refer usually to ‘p’ tags, ‘h1-h6’ tags and the ‘code’ tags. So my approach is as follows:

I constructed a new feature column called ‘post_corpus’ by combining the title, question body, and all the answers
I prepended the title to the question body
I use the ‘code’ sections to extract some features.
I constructed URLs for each question by appending ‘https://stackoverflow.com/questions/' with the question id
I constructed 2 features for sentiment using the open Source Textblob library

Each post has a variable number of different tags. In order to narrow down the vast choices for a more accurate model, I decided to with 100 most common tags. The plan is to filter only the data which contains at least one of most_common_tags.

One last task in the preprocessing step is as follows:

I have created a separate column for the ‘processed_title’ because I wanted to preserve the original title because I wanted to serve the original titles in the final output.
I am also normalizing the numeric ‘scores’
I have also created a separate column for the ‘code’ because I want to use it for the tag prediction

8. Tag predictor model

Importing library

Filter out only the most common tags

Although we did filter out a lot of tags in the data preprocessing step, there still exist a lot of ‘stray ’ tags which may appear only once or twice as compared to thousands of occurrences for other tags. This increases the dimensions of the ground truth data, which is not desirable for our model.

Thus I have extracted the top 1000 tags based on their occurrences finally, I altered the tag data to only include tags from one of these 1000 tags, for better model accuracy.

Training Word Embeddings using Word2Vec

In order for our model to understand the raw text data, we need to vectorize it, Bag of Words and TF-IDF are very common approaches for vectorizing. However, since I would be using an artificial neural network as my model the sparse nature of BOW and TFIDF would pose a problem. Thus I decided to go for word embeddings, which are dense vector representations and thus perfect for our neural network.

The way people converse on StackOverflow is very technical and they use a very specific vocabulary of words, so it is not a good idea to use pre-trained word embeddings because they are trained on plain English text and would not able to understand the relation between the words in our vocabulary. Thus I decided to train a Word Embeddings model from scratch using Word2Vec.

Thus after successful training, we get the following results:

Terms most similar to "python"
[('python3', 0.6342744827270508), ('perl', 0.5586778521537781), ('bash', 0.5048258304595947), ('python36', 0.4876842796802521),
('pyhton', 0.482374906539917), ('python27', 0.4808519780635834), ('lua', 0.4791009724140167), ('ruby', 0.4676322937011719), ('py
thon2', 0.46161943674087524), ('matlab', 0.46039116382598877)]
--------------------------------------------------------------------
Terms most similar to "Node"
[('nodes', 0.6288038492202759), ('firstnode', 0.5410162210464478), ('nodetext', 0.5174486041069031), ('new_node', 0.51485019922
25647), ('node1', 0.5093669891357422), ('nodedata', 0.4883999228477478), ('newnode', 0.481780469417572), ('nodename', 0.4755
1754117012024), ('node2', 0.4689573049545288), ('node3', 0.45593389868736267)]
--------------------------------------------------------------------
Terms most similar to "java"
[('scala', 0.4836808145046234), ('groovy', 0.46810388565063477), ('c', 0.4596169888973236), ('python', 0.44173622131347656), ('jd
k7', 0.4374290406703949), ('kotlin', 0.4267594814300537), ('jdk', 0.42633625864982605), ('eclipse', 0.4209193289279938), ('netbea
n', 0.41293397545814514), ('jdk6', 0.40997016429901123)]
--------------------------------------------------------------------
Terms most similar to "server"
[('sever', 0.6860402822494507), ('client', 0.666864812374115), ('webserver', 0.5926132798194885), ('servers', 0.5850958824157715
), ('machine', 0.5267062187194824), ('service', 0.5109481811523438), ('remote', 0.5007282495498657), ('backend', 0.49881637096
40503), ('remotely', 0.4766204357147217), ('locally', 0.47090765833854675)]

Training neural models

Preparing the data for training the model includes:

One-Hot encoding the data
Splitting into training and test data
Tokenizing and padding
Creating an embedding matrix

Before splitting I have combined the post_corpus and code columns because I wanted to include the code section of the question also because I also contain information that can be used in predicting the tag.

I am training here three models which are GRU, CNN, LSTM and based on the f1_micro, recall_micro, precission_micro results of all three I choose the final model

Model 1

Model 2

Model 3

Model Comparison

Observation

I trained three neural network model GRU based, CNN based and one LSTM based.
The best recall, precision, and f1 value found with the GRU model so I choose this model.

9. Creating the Search pipeline

Till now, we have trained the model which will predict the tag of the input search query of the user.

Now in this section, I will show you how we are going to use this predicted tag along with so other simple but useful hack with the cosine similarity to get the most similar result for the input search query.

To know how similar the two sentences are we can find the distance between them. I order to be able to compute such a distance, the sentences must belong in the same vector space. This is done through a sentence embeddings.

Observation

Firstly I loaded the data, trained model, title embedding, and W2Vec model.
Then I defined the necessary functions like loss, evaluation metrics, vectorizing, etc.
After that, I calculated the TFIDF of all the titles in the data.
After all this, I defined a function named ‘searchresult’ which takes two parameters the input search string and the other is the number of the search result to return in the output.
Inside the function, I am processing the input search query and after that, I am predicting the tags and once the tag is predicted I am filtering the main data to only the data having the predicted tag also I am calculating the TFIDF of the input query and after that, I am calculating the cosine similarity of the input search query and all the filtered data titles.
After calculating the cosine similarity I am adding the overall score, sentiment polarity and the TFIDF of the input search multiplied with the appropriate weights.
And In the end, I am returning the dictionary of the desired thing which I want to display in the search output.

GUI for the search engine

As a search engine should look like a search bar not like an exposed code so in this section I will going teach you how I build on so the user does not have to dig into code to enter the query.

You can find the code of the GUI below:

For the GUI I have used the ‘ipywidgets’ It is a very simple GUI with nothing fancy in it, it contains title, two input box, and one search button.

I read the two input string from the text box the search query and the number of results to print.
After that, I preprocess the input query using a function named preprocess_text which I have written in the utils notebook to keep the project more organized.
preprocess_text returns me a processed text and then I pass this processed text and number of results to print, as an argument to the ‘searchresult’ method which I have defined in a separate notebook named as search.ipynb and the above-explained operation happened and get the most similar results
And then I display that out as a search result

You can see the demo of the search engine in the bellow video block:

10. Scope of Improvement

As every model has there scope of improvement this project also has, you are free to do changes and try some other models with different architecture to improve the search result.

11. References

You can connect with me on LinkedIn
https://www.linkedin.com/in/vashistnarayan-singh/

Stackoverflow Based Semantic Search Engine

Index

1. The Task

2. Business problem

3. ML Problem

Performance metrics

constraints

4. Approach

5. Data

6. Library Import

Getting the data

Observation

7. Data analysis and processing

Missing value check

Observation

Checking for Duplicate in data if any

8. Tag predictor model

Importing library

Filter out only the most common tags

Training Word Embeddings using Word2Vec

Training neural models

Model 1

Model 2

Model 3

Model Comparison

Observation

9. Creating the Search pipeline

Observation

GUI for the search engine

10. Scope of Improvement

11. References

Written by VASHIST NARAYAN Singh