Search engine based on Stack Overflow questions : Machine learning case study

Image for post
Image for post
Photo by cottonbro from Pexels

Let’s take a quick overview on Stack Overflow, before we dive deep into the project itself. Stack Overflow is one of the largest QA platform for computer programmers. People posts questions-queries associated with wide range of topics (mostly related to computer programming) and fellow users try to resolve queries in the most helpful manner.

SECTION1 : Brief overview
1. Business problem : Need of search engine.
2. 2.1. Dataset
2.2. The process flow
2.3. High level Overview
3. Exploratory data analysis and Data pre-processing
-----------------------------------------------------
SECTION 2 : The attack plan

4. Modelling : The tag predictor
4.1. A TAG Predictor model
4.2. TRAIN_TEST_SPLIT
4.3 Time based splitting Modelling
4.4. GRU based Encoder-decoder seq2seq model
4.5. Model embedding
4.6. Word2Vec embedding
4.7. Multi-label target problem
5. LDA (Latent Dirichlet allocation) : Topic Modelling
6. Okapi BM25 Score : simplest searching technique
7. Sentence embedding : BERT
8. Sentence embedding : Universal sentence encoder
-----------------------------------------------------

SECTION 3 : Productionizing the solution
9. Entire pipeline deployment on a remote server
9.1. A Cloud platform
9.2. Web App using Flask
-----------------------------------------------------
SECTION 4 : Results and conclusion

10. Results and conclusion
10.1 Final Results : BERT
10.2. Final Results : USE
10.3. Final Inferences

1. Business problem : Need of search engine.

‘StackOverflow’ is sitting on the huge amount of information contributed by the public for the public. Nonetheless, large amounts of data also makes it very difficult to retrieve the exact solution one is looking for. Now, It becomes the primary duty of ‘StackOverflow’ (or any such facility provider) to serve the exact solution to which users are querying in the search bar on their website. Otherwise, the worst case scenario could be : even if ‘StackOverflow’ has the exact solution users are looking for, but ‘StackOverflow’ will not be able to serve the solution, just because of the weak search-engine mechanism OR No-searching mechanism. In simple words, out of terribly huge amounts of data, Search engine helps in finding exactly what users are looking for. One can understand the need of a ‘searching mechanism’ by imagining if Google was not there! or no product search bar on Amazon.

search results on stackoverflow.com for a query : ‘how to reversed a linked list in python’
search results on stackoverflow.com for a query : ‘how to reversed a linked list in python’
Actual query search results on StackOverflow.com

One line problem statement : “To build a ranking mechanism on the basis of ‘StackOverflow’ Questions-Answers data, which results in a set of related Questions for provided search-query”.

2. Dataset :

‘StackOverflow’ has a giant amount of Questions asked by various specialized or in general communities, also answers and comments associated with each Question. Usually, Answers and the Comments are contributed on the platform by experts, or community enthusiasts. Basically It’s public Q-A forum. ‘StackOverflow’ has made sample of it’s large ‘Question-Answer’ data publically available for research and development in the data-community, which is available at : https://archive.org/details/stackexchange

The data from below StackExchange properties are used :

1. cs.stackexchange.com.7z
2. datascience.stackexchange.com.7z
3. ai.stackexchange.com.7z
4. stats.stackexchange.com.7z
Image for post
Image for post
pandas dataframe

The process flow :

The flow of solution is mainly broken into two parts :

  1. Cutting down all the available no.of questions into smaller chunk of questions.
  2. Ranking the results based on the semantic similarity. In order to finalize the ranking system with specific scoring mechanisms, it is required to convert text documents into high dimensional vector representations. There are several pretrained sentence embeddings which are popular for their semantic results e.g., BERT, USE.

High level Overview :

  • Acquisition of appropriate data from Archive.org
  • Exploratory data analysis and Data pre-processing.
  • Training a tag-predictor model |gathering chunk of questions associated with predicted ‘tag’.
  • LDA topic modelling | gathering chunk of questions associated with predicted ‘topic’.
  • BM25 results | gathering chunk of questions associated with BM25 results.
  • Combining all three chunks of questions.
  • Sentence embedding : BERT/Universal sentence encoder (Vectorization)
  • Ranking the results based on the metric : ‘cosine_distance’

3. ‘Exploratory data analysis’ and ‘data pre-processing’ :

Lets take a look at all available features and checking for the missing values.

python code
Image for post
Image for post

Our dataset has three prominent features on which we are building the searching mechanism: Question-Title, Question-Body, Question-Tags. Each question can have one to five tags along with the question title and its descriptive text. For the purpose of case study I chose to go with 216k unique questions only.

Now, we are ready to move onto the data cleaning and further data processing steps…

  1. remove html tags, html urls, replace html comparison operators.
  2. remove latex.
  3. all lowercase.
  4. decontraction.
  5. remove all special-characters.
  6. remove extra white-spaces.

All stack exchange properties are nothing but web pages, web based data usually has html tags, html comparisons. These tags and special symbols does not contribute any information to the end task, hence I decided to remove them. If you observe closely, some data points are closely related to mathematics domain, For this experiment I chose to remove latex, formulae too. Its proven empirical result in the previous state of the art NLP experiments that applying ‘decontractions’ tend to to give better results.

Image for post
Image for post
wordcloud : tags
Image for post
Image for post
wordcloud : title
Image for post
Image for post
wordcloud : body

Since StackExchange has questions-answer in the range of technical subjects, Also community uses its own set of words i.e., vocabulary. To avoid losing that subjects specific crucial information avoid using Lemmatization/Stemming.

Full EDA with inferences can be found out here on my github profile.

4. A TAG Predictor model :

Tag can be considered as a kind of topic/subject for any posted question is given by the (publisher) user himself. Sometimes tags are system generated too. It is important to note that our problem is not only a multi-class problem, it’s a multi-label problem i.e., each datapoint can have one or more than tags.

There were some tags in the dataset are too frequent than other tags, since highly frequent target tags will dominate over the model hence I chose to go with a threshold value of 7000 tags. I removed tags which are occurring more than 7000 times in the dataset. Also some tags are very less frequent, so model couldn’t have learnt anything from such target labels. Hence, remove the tags which are occurring less than 500 times in the dataset.

TRAIN_TEST_SPLIT : Time based splitting

The questions and answers on the public QA platforms like Stack Overflow usually evolve over the period of the time. If we would not have considered the dependency of the data over the ‘time function’, generalization accuracy (accuracy on unseen data) will no longer be able to stay close to cross validation accuracy in the production environment. Whenever timestamp is available and data is time dependent — often it’s best choice to split dataset ‘based on time’ than ‘random_split’.

Modelling : GRU based Encoder-decoder seq2seq model

python code
Image for post
Image for post
GRU based Enoder-decoder model

Model embedding : Word2Vec embedding

For a encoder decoder model I trained custom W2V embedding on StackExchange data. Word2Vec embedding technique needs huge amount of textual data to learn good quality of word vectors. To gather huge amount of data, I decided to merge ‘clean_title + clean_body + clean_comments’ together, to train w2v embeddings upon. Also, used skip-gram W2V instead of CBOW, which has ability to provide better results over smaller amount of data too. Manually trained w2v embedding has given good results :

Image for post
Image for post
Manual W2V results.

Multi-label target problem :

In deep learning modelling, There are many ways to handle multi label target features few of them are listed below:

  1. Dense output layer with sigmoid activation units and loss function with ‘binary_crossentropy’.
  2. Use ‘Recurrent neural networks(RNN) + Dense’ at output layer, each time_step as a target label. loss function with ‘categorical_crossentropy’

For the current project work, I tried both methods. In the end 2nd method has been employed as it gave better results.

Note : For the complete data handling and tweaking with code — visit my github_profile.

5. LDA (Latent Dirichlet allocation) : Topic Modelling

There are many approaches to topic modeling, Latent Dirichlet Allocation is the most popular topic modeling technique amongst all. Basically LDA is an application of mathematical matrix factorization technique in order to generate labels or topics for each document in the corpus. LDA takes a word frequency matrix as below, and tries to factorize a given matrix into two more matrices.

Image for post
Image for post

6. Okapi BM25 Score :

Okapi BM25 is simplest search engine introduced in 1990’s for information retrieval. The formulation of BM25 is very similar to TF-IDF. BM25 technique is such a effective searching algorithm used in very popular Elasticsearch software too.

Image for post
Image for post
Image for post
Image for post
reference : wikipedia

7. Sentence embedding : BERT

BERT (Bidirectional Encoder Representations from Transformers) is language representation deep neural networks based technique. The biggest advantage of BERT over most of the previously found embedding techniques (except: w2v, but its word embedding) is ‘Transfer learning’ and ‘sentence-embeddings’. In 2014 VGG16 had made its own place in the space of computer vision community because of its ‘Transfer learning’ advantage, which was not possible in Natural language processing space at early stages. There are two existing strategies for applying pre-trained language representations: ‘feature-based’ and ‘fine-tuning’. Since the dataset has mathematics and computer science specific documents most likely ‘fine-tuning’ will help for semantic searching, But because of computational expenses I am limiting ourselves to ‘feature-based’ embedding only. This model has been pre-trained for English on the Wikipedia and BooksCorpus databases. BERT pretrained model architecture : here we get 2 options

I. BASE : (L=12, H=768, A=12, Total Parameters=110M)

II. LARGE : (L=24, H=1024, A=16, Total Parameters=340M)

where; L = no.of transformer blocks,

H = Hidden layer size | A = number of self-attention heads

8. Sentence embedding : Universal sentence encoder

As the name (UNIVERSAL SENTENCE ENCODER) itself is self explanatory, ‘USE’ takes sentences as input and gives high dimensional vector representations of input text sentences. The input can be a variable length English text and the output will be a 512 dimensional vector. The universal-sentence-encoder model has two variants in it. one is trained with a deep averaging network (DAN) encoder, and another is trained with a Transformer. In this paper authors have mentioned that Transformer based USE tends to give better results than DAN. But it comes with price in terms of computational resources and run time complexity. The USE model is trained on the sources which are Wikipedia, web news, e Stanford Natural Language Inference (SNLI) corpus, web question-answer pages and discussion forums etc. The Author explains the USE model was created by keeping in mind unsupervised NLP tasks like Transfer learning using sentence embeddings, Hence it makes complete sense to use USE in our semantic search task.

Image for post
Image for post
Sentence similarity scores using embeddings from the universal sentence encoder.

9. Entire pipeline deployment on a remote server :

Image for post
Image for post
Running model after deployment on a remote server

A Cloud platform :

To serve the purpose for the current project work, I deployed the entire pipeline on Google cloud platform (gcp). Launching a virtual machine with google cloud platform or Amazon web services (AWS) is way easy. I used very basic virtual machine instance with following specs :

  1. OS image : Linux for deep learning
  2. 8 GB ram
  3. 2 CPU
  4. 50 GB HDD

Web App using Flask :

To develop a web app with python and HTML, Flask is the easy and light weight open source framework. I’ve pasted required web app deployment code for any virtual machine my github, feel free to visit my profile.

10.1 Final Results : BERT

Image for post
Image for post
BERT Results : 1
Image for post
Image for post
BERT Results : 2
Image for post
Image for post
BERT Results : 3

10.2. Final Results : USE

Image for post
Image for post
USE Results : 1
Image for post
Image for post
USE Results : 2
Image for post
Image for post
USE Results : 3

10.3. Final Inferences :

  1. In our case USE embeddings outperformed. A searching mechanism with USE vectors gives great reults with ‘semantic relationship’ between query and resulting results.
  2. The purpose of this case study is to build simple search engine with LOW latency. By using ‘right data structures’ in our mechanism, we make it happen to get results under 800–900 milliseconds on a normal 8 gb machine.

Please find complete code with documentation : https://github.com/vispute/StackOverflow_semantic_search_engine

Feel free to connect with me on LinkedIn : https://www.linkedin.com/in/akshay-vispute-a34bb5136/

Future work :

1. Take up to some million no.of datapoints into use.

2. Pretrain BERT, USE sentence embedding on StackExchange data or downstream task.

3. Tutorial for ‘entire pipeline deployment on a remove web server like gcp or AWS or Azure’.

References :

1. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding — https://arxiv.org/pdf/1810.04805.pdf

2. USE : UNIVERSAL SENTENCE ENCODER https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/4 6808.pdf

3. Topic Modelling — LDA (Latent Dirichlet Allocation) — LDA paper : http://link.springer.com/chapter/10.1007%2F978-3-642-13657-3_43

4. Improvements to BM25 and Language Models Examined : http://www.cs.otago.ac.nz/homepages/andrew/papers/2014-2.pdf

Written by

NLP, computer vision enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store