Building a simple Stack overflow search engine to predict semantically similar posts related to given query post using machine learning

Published in

Analytics Vidhya

15 min readOct 19, 2019

Stack Overflow — is a obvious life saver for many of the developers who use it in a day-to-day basis( saying this as I myself is a developer :) ). It is a question and answer site for professional and enthusiast programmers. It features questions and answers on a wide range of topics in computer programming and also now it is not limited to computer programming alone but answers questions across a wide range of spectrum.

A typical stackoverflow question answer page

This is a quick introduction on what is stackoverflow before dive deep into the project

About the project

Stack overflow website also has this particular sections in their all their question and answers page.

Posts very much related to the post in current page

It lists out all the questions which are related to the current questions which could help the user to navigate through the current page itself to find the suitable questions page if the current page doesn’t cater to their needs. The project aims to predict this related posts for a given query post like in the image for the question: load data txt from pandas, it predicts similar posts as shown in the picture above.

Since we are building this as a search engine in addition to the semantic relevance of the predicted posts with respect to the query post, there are additional constraints to be satisfied

Low latency — prediction time should be less
Scalability — must work even when the volume of data we are searching for increases tremendously

The whole problem is to build a search engine based on StackOverflow questions, the search results should include the semantic meaning, with scalable architecture that return results in very less time.

Approach

1. Data Collection

Data Link: https://meta.stackexchange.com/questions/138356/how-do-i-download-stack-overflows-data

The above link contains all stack overflow data in various topics among which only the these posts files are downloaded.

cs.meta.stackexchange.com.7z
cs.stackexchange.com.7z
datascience.meta.stackexchange.com.7z
datascience.stackexchange.com.7z
ai.meta.stackexchange.com.7z
ai.stackexchange.com.7z
computergraphics.meta.stackexchange.com.7z
computergraphics.stackexchange.com.7z

The downloaded zip files has questions and answer regarding four major topics — computer science, data science, AI and computer graphics. The above files when uncompressed gives a lots of files among which only the posts.xml files in each of the uncompressed directory is obtained since only that particular file contains the post text.

Each post.xml file is named in accordance to their topic and stored in the XML Files directory

XML Files
  -> DataScienceMeta_Posts.xml
  -> ComputerGraphicsMeta_Posts.xml
  -> AI_Posts.xml
  -> CSMeta_Posts.xml
  -> AIMeta_Posts.xml
  -> CS_Posts.xml
  -> ComputerGraphics_Posts.xml
  -> DataScience_Posts.xml

2. Data Preprocessing

The obtained 9 xml files contains lot more data like postId, comments, upvotes etc in addition to the text data in the xml format. Hence it is necessary to parse the xml file and obtain only the post text attribute of each of the xml element present in the files

# specifying the fields for csv file 
fields = ['Id', 'Text', 'Topic']def parseXML(xmlfile, start_count): 
    #create element tree object 
    print("File", xmlfile)
    tree = ET.parse(xmlfile) 
    topic = xmlfile.split("/")[1].split("_")[0]
    # get root element 
    root = tree.getroot() 
    # create empty list for news items 
    newsitems = [] 
    count = start_count
    # iterate news items 
    for each_row in root.iter("row"):
        news = {}
        news["Id"] = count
        news["Text"] = each_row.attrib["Body"]
        news["Topic"] = topic 
        count=count+1
        newsitems.append(news)
    # return news items list 
    print("len", len(newsitems))
    return newsitems

Each of the xml file is parsed to obtain the text and corresponding topic of the post which is the file name of the xml file and stored along with a generated id field

def savetoCSV(newsitems, filename): 
    # writing to csv file 
    with open(filename, 'w') as csvfile: 
        # creating a csv dict writer object 
        writer = csv.DictWriter(csvfile, fieldnames = fields) 
        # writing headers (field names) 
        writer.writeheader() 
        # writing data rows 
        writer.writerows(newsitems)

The result obtained is stored in a same name as a CSV file in a directory named CSV Files

# specifying the fields for csv file 
fields = ['Id', 'Text', 'Topic']  
start_count = 0
for each_file in postfiles:
    print(each_file)
    # parse xml file 
    newsitems = parseXML("XML Files/"+each_file, start_count) 
    csv_filename = each_file.split('.')[0] + ".csv"
    print("csv_filename", csv_filename)
    # store news items in a csv file 
    savetoCSV(newsitems, "CSV Files/" + csv_filename)
    start_count = len(newsitems) + start_count

The CSV Files directory looks like

CSV Files
  -> DataScienceMeta_Posts.csv
  -> ComputerGraphicsMeta_Posts.csv
  -> AI_Posts.csv
  -> CSMeta_Posts.csv
  -> AIMeta_Posts.csv
  -> CS_Posts.csv
  -> ComputerGraphics_Posts.csv
  -> DataScience_Posts.csv

All of the files in the CSV Files are then parsed to data frames which is then merged together and saved in a pickle file which is the final data file which we will be using for further processing.The structure of our final pickle file is

3. Data Cleaning

Since the text obtained is from a website, it typically tend to have a lot of html entities and hence must be subjected to extensive data preprocessing step

The various data preprocessing treatment done are:

Removal of HTML tags
Removal of URLs
Removal of Punctuations ( #,&,*) and stop words etc

Another key and different data preprocessing step done is:

4) Removal of programming code

for(i=0;i<n;i++) { print(i) }

Since this is a majorly computer science programming question and answer portal most of the questions and answer posts tend to have programming code which has a lot of punctuation and unique structure which would lose meaning once subjected to data preprocessing.

Hence programming code should be removed and then feature engineered separately to obtain a good prediction.

3. Feature Engineering

Feature Engineering is one of the most important and crucial step of solving any data science problem.A good Feature engineering done could help the model a lot in the prediction.Here I have come up with three new features.

topic
programming code
Does cites the same link

3a) topic

The topic of each of the posts is used as one of the feature as posts with similar topic tends to more similar than the others.The topic feature is one hot encoded as it is a categorical feature.

3b) Programming code

Though the programming code in each of the posts is removed, it is provides very important information which must not be lost.Hence the programming code is feature engineered separately so that we could extract more information for it. It is done by removing the code portion separately from each of the post and vectorizing it and clustering the code vectors by k-means clusters. By this way, we could obtain beautiful insights about the programming code as shown in the images below

sklearn model train and test code cluster

As we could see, the code is clustered into relevant type. A total of 24 clusters is obtained and the post without code is put into a separate 25th cluster. Thus code cluster is now a categorical feature for similar post prediction which is also one hot encoded

3c) Does_site_the_same_url

Similar to programming code, URL is also one such feature that require special featurization. Since URL too has a special structure, normal data processing treatment will make it loose its meaning. URLs can be feature engineered separately by finding if two posts cites the same url or not, if so then may be both refers to the similar thing.

On analysis, we could find that nearly 31% of the urls cited are repeated ones. Hence there is a descent probability that cites will be repeat and that can be used to find similar posts

3d) Document Embeddings

For vectorization of the preprocessed post text data, Five types of embeddings is used

Glove Vectors embedding
TF-IDF Weighted Glove vectors embedding

Along with these, anew technique called Yet another Keyword Extractor is used to extract keyword from the posts text and then vectorizing it.

YAKE — Yet Another Keyword Extractor is one of the nice approaches for extracting keywords from the text. It has proved to outperform a number of unsupervised methods and a supervised method under a number of collections of different sizes, languages or domains.

To learn more about YAKE look into this page:https://github.com/LIAAD/yake

Hence it is used to extract the keywords in each of the posts text first and then apply embeddings to it

3) YAKE based Keyword Extraction + Glove Vectors embedding

4) YAKE based Keyword Extraction + TF-IDF Weighted Glove vectors embedding

def tokenization(keywords_text):
    tokenizer = RegexpTokenizer(r'\w+')
    return tokenizer.tokenize(keywords_text)#Converting each posts text into corresponding keyword sets
simple_kwextractor = yake.KeywordExtractor()
questions_keywords_text = []
for i in range(preprocessed_text.shape[0]):
    if i%10 == 0:
        print(i,"vectors finished")
    keyword_simple = simple_kwextractor.extract_keywords(preprocessed_text[i])
    keyword_text = ""
    for each_kw in keyword_simple:
        keyword_text += each_kw[0] + " "
    questions_keywords_text.append(set(tokenization(keyword_text)))
questions_keywords_sets = np.array(questions_keywords_text)

5) BERT Embedding

BERT model, which stands for Bidirectional Encoder Representations from Transformer published by the google is currently the state of the art model for doing NLP related tasks. BERT which is an LSTM model works based on the concept of attention trying to predict the next token based on the series of the previous tokens it encountered. BERT based embedding may prove useful for this project to extract more meaningful information from the post text.

def avg_of_token_embedding(result):
  document_embeddings = np.zeros(768)
  count = 0
  for i in range(len(result)):
    for j in range(len(result[i][1])):
       document_embeddings = document_embeddings + result[i][1][j]
            count = count + 1
  return (document_embeddings/count)from bert_embedding import BertEmbedding
bert_document_embeddings = []
for i in range(preprocesses_text.shape[0]):
    sentences = bert_abstract.split(' ')
    bert_embedding = BertEmbedding()
    token_embedding = bert_embedding(preprocesses_text[i],'sum')
    bert_embedding_vector = avg_of_token_embedding(token_embedding)
    bert_document_embeddings.append(bert_embedding_vector)
bert_document_embeddings = np.array(bert_document_embeddings)

4. Building a simple search Engine

This is the most crucial part of the solution as the search engine should be scalable and have less latency.Similarity between the two post text is calculated in the following way:

Post Similarity(i,j) = 
cosine_similarity(post_vector(i),post_vector(j)) + similarity(topic_vector(i), topic_vectorr(j)) + similarity(code_cluster(i), code_cluster(j)) + 
number of common urls

Based on the above formula, similarity between two post is computed.Cosine similarity is used for post text as euclidean distance is prone to curse of dimensionality problem for high dimensions. For topic vector and code cluster vector, the similarity is computed as inverse of euclidean distance

               similarity = 1/ euclidean distance

To build a search engine, an efficient map-reduce like solution is implemented where 8 processes are spawn to run independently.

For a given query post, each processes processes 1/8th part of the total posts data returns an 10 length list which the top 10 highest similarity posts of the total posts which it computed . Then all the 10 length list returned by all the processes are combined together and sorted to get the total top 10 similarity.A simple map reduce like technique enabled the latency to be within 20 secs.

Code for the search engine:

import multiprocessing
alpha = 0.001//returns topic wise similarity of two posts
def get_topic_similarity(ind1, ind2):
    dist = euclidean_distances(topics[ind1], topics[ind2])[0][0]
    if dist == 0:
        return 1/(dist+alpha)
    else:
        return dist//returns code wise similarity of two posts
def get_code_similarity(ind1, ind2):
    dist = euclidean_distances(code_cluster_list[ind1].reshape(-1,1), 
                   code_cluster_list[ind2].reshape(-1, 1))[0][0]
    if dist == 0:
        return 1/(dist+alpha)
    return dist
      def insert_into_top_k_similar_docs(top_k_similar_docs,
vec_similarity, doc_id):
    doc_dict = dict()
    doc_dict["doc_id"] = doc_id
    doc_dict["similarity"] = vec_similarity
    should_insert = False
    for i in range(11):
        if i == 10:
            return
        if vec_similarity > top_k_similar_docs[i]['similarity']:
            should_insert = True
            break
   
    if should_insert:
        docs_place = i
        temp_list = []
        for i in range(docs_place, k-1):
            #print("it 2", i)
            temp_list.append(top_k_similar_docs[i])
        ind = 0
        for i in range(docs_place+1 , k):
            top_k_similar_docs[i] = temp_list[ind]
            ind = ind + 1
        top_k_similar_docs[docs_place] = doc_dict// Finds the top 10 similar posts for a given query post from the 
// posts in the start and end range       
def find_vector_similarity(preprocessed_text_vector, test_query_point, start, end, topic_weight=1, code_weight=1,
                          cites_weight=1):
    top_k_similar_docs = []
    for i in range(10):
        doc_dict = dict()
        doc_dict['doc_id'] = 0
        doc_dict['similarity'] = -1
        top_k_similar_docs.append(doc_dict)
      
    for i in range(start, end):
        if i >= preprocessed_text_vector.shape[0]:
            return
        vec_similarity = cosine_similarity(preprocessed_text_vector[i].reshape(1,-1), 
                                    preprocessed_text_vector[test_query_point].reshape(1,-1))[0][0]
        topicwise_similarity = get_topic_similarity(i, test_query_point)* topic_weight
        codewise_similarity = get_code_similarity(i, test_query_point) * code_weight
        cites_similarity = get_cites_similarity(i, test_query_point)
        if cites_similarity:
            cites_similarity = cites_similarity * cites_weight
            vec_similarity = vec_similarity + topicwise_similarity + codewise_similarity + cites_similarity
        else:
            vec_similarity = vec_similarity + topicwise_similarity + codewise_similarity
        vec_sim_list.append(vec_similarity)
        insert_into_top_k_similar_docs(top_k_similar_docs, vec_similarity, i)  
    return top_k_similar_docsdef simple_search_engine(preprocessed_text_vector, test_sample, topic_weight=1, code_weight=1, cites_weight=1):
    count = 0
    posts_text = pd.read_pickle('Preprocessed_questions_text.pkl')['Text'].values
    for each_test_point in test_sample:
        list_of_top_k_docs = [] 
        start_time = datetime.now()
        #print(each_test_point)
        #print("top_k_similar_docs", top_k_similar_docs)
        vec_sim_list = []
        test_query_point = preprocessed_text_vector[each_test_point]
        print("Query posts",each_test_point, ":" , posts_text[each_test_point])
        print("#"*100)
        start = 0
        step = preprocessed_text_vector.shape[0]/8
        input_list = []
        while start < preprocessed_text_vector.shape[0]:      
            input_list.append([preprocessed_text_vector, each_test_point, start, start+step, 
                               topic_weight, code_weight, cites_weight ])
            start = start+step
        pool = multiprocessing.Pool(processes=4)
        tasks=[]
        for each in input_list:
            tasks.append(pool.apply_async(find_vector_similarity,each))
        pool.close()
        pool.join()
        for each in tasks:
            each.wait()
            result = each.get()
            list_of_top_k_docs.extend(result)
        quickSort(list_of_top_k_docs, 0, len(list_of_top_k_docs))
        list_of_top_k_docs = list_of_top_k_docs[0:10]
        total_time = (datetime.now() - start_time).total_seconds()
        print("Total time taken:", humanize_seconds(total_time))
        print("The Most similar ",k,"posts are:")
        for each in list_of_top_k_docs:
            print(posts_text[each['doc_id']])
            print("#"*100)
        print("*"*200)

5. Comparing performance of five embedding techniques

We don’t have explicit performance metric to evaluate of quality of the predictions we made, we have to evaluate the predictions made by manually reading the 10 post predicted by the model.Based on that, the performance of five embedding which we used is:

In similar posts prediction using glove vectors and TF-IDF weighted glove vectors,The first posts is on an NP complete problem called hamiltonian cycle. We could observe on using the glove and TF-IDF weighted glove vectors, it predicts posts related to the theory of computation, time complexity and some NP completeness posts etc.

An Example prediction by simple glove vectors:

Query Posts:
Hint: Hamiltonian Cycle is $NP$-Complete.Sample predicted posts:1) <p>Hint: This is just the well-known NP-complete problem PARTITION.</p>2) <p>Is there a notion of a context-free complete language (in the analogous sense to a <span class="math-container">$NP$</span>-complete language)?</p>3) <p>For example;</p>
<p>Is {0,1,{a,b,c},d,e} a valid alphabet to form a language over and is it usable in any context?</p>4) <p>How Decision procedure and Decidable problem are different from each other?
Both are having solutions in yes and no form. Is any else solution for both?  </p>

Example predictions by TF-IDF weighted glove vectors:

Query Posts:
Hint: Hamiltonian Cycle is $NP$-Complete.Sample predicted posts:1) <p>Hint: This is just the well-known NP-complete problem PARTITION.</p>2) <p>Is there a notion of a context-free complete language (in the analogous sense to a <span class="math-container">$NP$</span>-complete language)?</p>3) <p>Can an ambiguous context-free grammar be converted into Chomsky normal form? I think the answer is yes.</p>4) <p>Time complexity, like any other nontrivial property of programs, is undecidable: there can be no algorithm to compute it, in Python or anything else.</p>

2) Predictions based on keywords extracted Glove vectors seems to be little bit more refined than the predictions done using the glove vectors.

Example predictions by keyword extracted glove vectors:

Query Posts:
Hint: Hamiltonian Cycle is $NP$-Complete.Sample predicted posts:1) <p>Is there a notion of a context-free complete language (in the analogous sense to a <span class="math-container">$NP$</span>-complete language)?</p>2) <p>Does it exist an algorithm to find a median of table with a complexity nlog(n) Thank</p>3) <p>2^n=O(3^n) : This is true or it is false if n>=0 or if n>=1
since 2^n may or not be element of O(3^n)
I need a hint to figure the problem</p>4) <p>Yes and yes. Look at this graph:</p>

<ul>
<li>$V=\{1,2,3,4\}$</li>
<li>$E=\{\{1,2\},\{2,3\},\{3,4\},\{4,1\}\}$</li>
</ul>

<p>There is neither articulation point nor bridge. But if you delete any node/edge...</p>

Example predictions by keyword extracted TF-IDF glove vectors:

Query Posts:
Hint: Hamiltonian Cycle is $NP$-Complete.Sample predicted posts:1) <p>Is there a notion of a context-free complete language (in the analogous sense to a <span class="math-container">$NP$</span>-complete language)?</p>

2) <p>Does it exist an algorithm to find a median of table with a complexity nlog(n) 
Thank</p>

3) <p>2^n=O(3^n) : This is true or it is false if n>=0 or if n>=1
since 2^n may or not be element of O(3^n)
I need a hint to figure the problem</p>

4) <p>Yes and yes. Look at this graph:</p>

<ul>
<li>$V=\{1,2,3,4\}$</li>
<li>$E=\{\{1,2\},\{2,3\},\{3,4\},\{4,1\}\}$</li>
</ul>

<p>There is neither articulation point nor bridge. But if you delete any node/edge...</p>

3) Predictions based on the BERT embedding seems to be very much more refined than the keyword extracted glove and keyword extracted tf-idf weighted glove vectors, since the post is on NP-complete problem, more posts on NP complete problems are predicted than any other. Also for the second svm posts it predicts more posts on the SVM than any other topics

An Example prediction by BERT Embeddings:

Query Posts:
Hint: Hamiltonian Cycle is $NP$-Complete.Sample predicted posts:1) <p>Yes, this is decidable, because you can do an exhaustive search of all possible paths. There is no need to look at any paths that repeat a vertex, since the "detour" could be skipped. But the length of any non-repetitive path is bounded by the size of the graph, which is finite, and so there are only finitely many such paths, which can be checked one by one.</p>

<p>What is not decidable is the following: given an infinite graph $G$ and two vertices $a$ and $b$, decide whether there is a path from $a$ to $b$. This is not decidable if the graph is given as an oracle and is also not decidable if the graph is given via a program that computes it. </p>2) <blockquote>
  <p>What is the time complexity of finding the diameter of a graph
  $G=(V,E)$?</p>
  
  <ul>
  <li>${O}(|V|^2)$</li>
  <li>${O}(|V|^2+|V| \cdot |E|)$</li>
  <li>${O}(|V|^2\cdot |E|)$</li>
  <li>${O}(|V|\cdot |E|^2)$</li>
  </ul>
</blockquote>

<p>The diameter of a graph $G$ is the maximum of the set of shortest path distances between all pairs of vertices in a graph.</p>

<p>I have no idea what to do about it, I need a complete analysis on how to solve a problem like this.</p>3) <p>When searching graphs, there are two easy algorithms: <strong>breadth-first</strong> and <strong>depth-first</strong> (Usually done by adding all adjactent graph nodes to a queue (breadth-first) or stack (depth-first)).</p>

<p>Now, are there any advantages of one over another?</p>

<p>The ones I could think of:</p>

<ul>
<li>If you expect your data to be pretty far down inside the graph, <em>depth-first</em> might find it earlier, as you are going down into the deeper parts of the graph very fast.</li>
<li>Conversely, if you expect your data to be pretty far up in the graph, <em>breadth-first</em> might give the result earlier.</li>
</ul>

<p>Is there anything I have missed or does it mostly come down to personal preference?</p>4) <p>Breadth-first and depth-first certainly have the same worst-case behaviour (the desired node is the last one found). I suspect this is also true for averave-case if you don't have information about your graphs.</p>

<p>One nice bonus of breadth-first search is that it finds shortest paths (in the sense of fewest edges) which may or may not be of interest.</p>

<p>If your average node rank (number of neighbours) is high relative to the number of nodes (i.e. the graph is dense), breadth-first will have huge queues while depth-first will have small stacks. In sparse graphs, the situation is reversed. Therefore, if memory is a limiting factor the shape of the graph at hand may have to inform your choice of search strategy.</p>

4) Thus predictions based on BERT embeddings seems to be more refined than the predictions based on keyword extracted text embeddings which in turns performs better than the normal glove vectors embeddings. We could also observe TF-IDF weighted glove vectors does almost the same job in terms of the prediction. Hence, TF-IDF values doesn’t add any impact to improve the prediction performance