Stack Overflow Search Engine

Parv Gupta
The Startup
Published in
12 min readOct 27, 2020

GitHub link of the project. LinkedIn profile.

Overview

It is a necessary problem for any business to give their user the most relevant results for any search, so it became a necessary problem to solve for any organization, so we will be solving this problem using some state of the art machine learning and software engineering approaches by using the data of the stack overflow to give relevant search results to the user. One of the major key points which should be taken into the consideration is the latency requirement for the above problem, the user will not spend more than half a second to get the results so the latency requirement for this problem would be less than 0.5 seconds. So the objective of the search engine is to give the most relevant results from the available answered questions to the user using the similarity-based techniques in a computationally optimized manner. So from the design standpoint, we will be building a q/a search engine which on the run time takes the query question as text and give the most similar answers by using the text and the semantic similarity of the query as results to the user in less than 0.5 seconds.

Index

Mapping the Business Problem

  1. Preprocessing and collection of data
  2. Modeling with Universal-Sentence-Encoder Model
  • Loading the Universal-Sentence-encoder Model
  • Connecting to the Elastic Search Container
  • Defining the Structure of the Elastic Search Database
  • Indexing of the data to the elastic search
  • Defining the Search Function
  • Give Structure to the Output
  • Results

3. Modelling Using the Bert

  • Loading the Bert Model
  • Defining the Structure of the Elastic Search Database
  • Indexing of the data into the elastic search*
  • Results using Bert model

4. Comparison

5. Conclusion

6. Deployment of Model

7. Model in Action

8. Future Works

9. References

10. Profile

Mapping the problem

Given a query by the user we have to return the top 10 similar results to the user on the bases of semantic as well as text-based similarity.

We will vectorize the given query into 2 vectors:

  • For the Semantic Vector, we will be using the encoder model to generate the vector embeddings.
  • For the text Vector, we can use the Simple BOW to vectorize the Text.

then we could compute the Similarity using the Cosine Similarity.

and combine the score using the weighted sum of semantic similarity and text-based similarity.

We will be using the Elastic Search to get the search results which will be running in a container of our docker.

Datalink:https://archive.org/download/stackexchange

We will be using the Elastic Search Which will be containerized in a Docker Instance.

We will be executing the series of commands to pull and run the elastic search image from the docker repository.

commands:

  • docker pull docker.elastic.co/elasticsearch/elasticsearch:7.9.3
  • docker run -p 9200:9200 -p 9200:9300 -e “discovery.type=single-node” docker.elastic.co/elasticsearch/elasticsearch:7.9.3

Importing Some Libraries

1.1 ) Some Preprocessing of the Data

We will be doing some preprocessing like removing the HTML tags, converting the text into lower case, expanding the contractions of the text

Below is the Code Snippet

1.2) Collecting the Data from different Files.

We will be picking the 5 most important categories for our work.

math.meta.stackexchange.com
ai.stackexchange.com
datascience.stackexchange.com
data/cs.stackexchange.com
softwareengineering.stackexchange.com

As the data is in the form of XML files so we will be converting the data into a python dictionary.

Each Of the posts in the XML file contains many attributes but only some of them will be of our use.

Below are the useful attributes of the XML post-

  • PostTypeId: This tells us whether the thread contains a question or answer (1 for the question and 2 for the answer)
  • title: It contains the text which is the title of the thread
  • Body : It contains question if PostTypeId is 1 it contains answer if PostTypeId is 2
  • ParentId: For answers thread, it contains the id of their question

We will be creating a unique id for each of the thread we create for example if the thread belongs to ‘ai’ and id is 454 then the unique id we create will be ai_454 for the thread, 1 thread will be containing the unique id, title, question, answers.

below is the code to convert each of the posts in the XML to our thread.

2.) Modelling Using the Universal-sentence-encoder

2.1 ) Loading the Universal-Sentence-encoder Model

  • We will be using the Universal-sentence-encoder to generate our embeddings.
  • The universal sentence encoder model is the state of the art NLP model which encodes sentences into embeddings.
  • the model takes a sequence or sentence as input and returns 512-dimensional data for each row.

2.2) Connecting to the Elastic Search Container

s we will be using a dockerized container on which our elastic search is running so we need to connect with our container using a port number.

Below is the python code to connect to Elastic Search

2.3) Defining the Structure of the Elastic Search Database

As for the elastic search container we need to create a database and we have to define a structure of a database

  • we will be having a text as well as a dense 512-dimensional vector so we will be defining our structure accordingly.

2.4) Indexing of the data to the elastic search

We will feed our data to elastic search

  • will be combining the whole ‘title’ + ‘question’ + ‘answers’ and name it as total_text
  • we will be creating feeding the total_text to our encoder model and get the 512 dims data
  • in the end, we will feed the total_text and total_text encoded vector to our database

2.5) Defining the Search Function

Below is the Search Function which takes a query text and returns the top 10 similar results on the basis of text-based and semantic similarity.

  • We will be using the cosine similarity to generate similarity.
  • As for the weighted sum of text similarity scores and semantic similarity score we will be giving higher weight to the semantic similarity.

2.6) To Give Structure to the Output

As the output returned by the above function is slightly unstructured so we will be using the below code snippet to generate the structured output

Results( Top 3)

---------------------------Answer No:1------------------------------title :  why should leaf nodes in a red-black tree be black?  question : from the property of red-black trees we know that:  all leaves (nil) are black. (all leaves are same color as the root.)(comren et al "introduction to algorithms")  but what is the reason that we should enforce them as black, even though they're nill's?    subanswer 1 : take a uncolored leaf node, now you can color it as either red or black. if you colored it as red then you may have chance that your immediate ancestor is also red which is contradicting(according to basic principle). if you color it as black then no problem even though the immediate ancestor is red. and also no change in the number of black nodes from root to leaf paths(i.e every path get +1). this may be the possible reason behind that. 
subanswer 2 : it's simply a part of the definition of a red-black tree. it is also necessary to maintain one of the other rules associated with red-black trees: if a node is red, then both its children are black.
--------------------------Answer No:2-------------------------------title : red black tree clarification
question : i am quite new to red-black trees, and therefore i am having a bit of difficult time trying to understand them.one of the properties of the red-black tree is that every red vertex must have two black children, meaning one cannot have two consecutive red vertices.can someone explain to me why this is an essential property? what are the benefits of having such a property?
subanswer 1 : the properties of a red black tree allow insertion, deletion and search in $o(log(n))$. i guess you can find a prove online somewhere. when an element is inserted or deleted. a fix-up is done, it involves rotations in a tree, so the properties still hold. this ensures the tree is quitebalanced at all times and search is quite fast at all times.
subanswer 2 : if you check the proof height-balancedness of red-black trees, you'll see that we essentially analyse the black-height $h_b$ of the tree which, by another important invariant, is the number of black nodes from the root to any leaf.the property you cite then gives us the right half of$\qquad\displaystyle h_b(t) \leq h(t) \leq 2h_b(t)$,which allows us to carry over bounds on black-height to usual height.
-------------------------Answer No:3--------------------------------title : root color of a black red tree
question : it is required that, in a black red tree, the color for the root is always black. however, wikipedia argues that this rule can be omitted as a red root can always be changed to black but not vice versa. i get the first half, that a red root can be changed to black at any time, but in what circumstance is it impossible to change a black node to red?for instance, consider the branch: black--red--black--red--black. we can always change it to red--black--black--red--black, since a black node does not need to have red children.
subanswer 1 : the root colour of a red-black tree carries no significance. in fact, you can save memory by not encoding it. didactically, it is meaningful to talk about a root's colour to illustrate what it means to be red or black, because it is a special case because it has no parent which is going to count it when it evaluates the sixth restriction in wikipedia's list (that the path from a particular node to a leaf should contain the same number of black nodes).as for your more general question about when changing a black node to a red one is allowed: a set of nodes can be repainted if afterwards, the black-height criterion is still satisfied. for example, if a row of the tree is saturated (if it is at depth $k$, there are $2^{k}$ nodes), then you can colour that row black. you may not colour only part of a (saturated or not) red row black. another example: in the tree $1:red \rightarrow\{ 2:black, 3:black\rightarrow\{4:red, 5:red\}\}$, you can flip the colours of $\{3,4,5\}$ and still have a legal rb tree. this is true generally for subtrees of even height and alternating colours. maybe you can think of more examples.
.
.
.
.
.

3.) Modeling Using the Bert

Above are the results of the Universal-Encoder model now we will try using the bert model.

3.1 ) Loading the Bert Model

  • We will be using the Bert Model to generate our embeddings.
  • Universal sentence encoder model is a Transformer based state of the art NLP model which encodes sentences into embeddings.
  • the model takes a sequence or sentence as input and returns 1024 dimensional data for each row.

3.2 ) Defining the Structure of the Elastic Search Database

As the model generates 1024 dims embeddings so we have to change the structure of the elastic search database.

3.3) Indexing of the data into the elastic search

  • we will be having a text as well as a dense 1024 dimensional vector for each row so we will be defining our structure accordingly.

All the other code snippets will be the same as the first model.

Results using Bert model(Top 3)

-----------------------------Answer No:1----------------------------title :  why should leaf nodes in a red-black tree be black?  question : from the property of red-black trees we know that:  all leaves (nil) are black. (all leaves are same color as the root.)(comren et al "introduction to algorithms")  but what is the reason that we should enforce them as black, even though they're nill's?    subanswer 1 : take a uncolored leaf node, now you can color it as either red or black. if you colored it as red then you may have chance that your immediate ancestor is also red which is contradicting(according to basic principle). if you color it as black then no problem even though the immediate ancestor is red. and also no change in the number of black nodes from root to leaf paths(i.e every path get +1). this may be the possible reason behind that. 
subanswer 2 : it's simply a part of the definition of a red-black tree. it is also necessary to maintain one of the other rules associated with red-black trees: if a node is red, then both its children are black.
---------------------------Answer No:2------------------------------title : red black tree clarification question : i am quite new to red-black trees, and therefore i am having a bit of difficult time trying to understand them.one of the properties of the red-black tree is that every red vertex must have two black children, meaning one cannot have two consecutive red vertices.can someone explain to me why this is an essential property? what are the benefits of having such a property?
subanswer 1 : the properties of a red black tree allow insertion, deletion and search in $o(log(n))$. i guess you can find a prove online somewhere. when an element is inserted or deleted. a fix-up is done, it involves rotations in a tree, so the properties still hold. this ensures the tree is quitebalanced at all times and search is quite fast at all times.
subanswer 2 : if you check the proof height-balancedness of red-black trees, you'll see that we essentially analyse the black-height $h_b$ of the tree which, by another important invariant, is the number of black nodes from the root to any leaf.the property you cite then gives us the right half of$\qquad\displaystyle h_b(t) \leq h(t) \leq 2h_b(t)$,which allows us to carry over bounds on black-height to usual height.
-------------------------Answer No:3-------------------------------title : root color of a black red tree
question : it is required that, in a black red tree, the color for the root is always black. however, wikipedia argues that this rule can be omitted as a red root can always be changed to black but not vice versa. i get the first half, that a red root can be changed to black at any time, but in what circumstance is it impossible to change a black node to red?for instance, consider the branch: black--red--black--red--black. we can always change it to red--black--black--red--black, since a black node does not need to have red children.
subanswer 1 : the root colour of a red-black tree carries no significance. in fact, you can save memory by not encoding it. didactically, it is meaningful to talk about a root's colour to illustrate what it means to be red or black, because it is a special case because it has no parent which is going to count it when it evaluates the sixth restriction in wikipedia's list (that the path from a particular node to a leaf should contain the same number of black nodes).as for your more general question about when changing a black node to a red one is allowed: a set of nodes can be repainted if afterwards, the black-height criterion is still satisfied. for example, if a row of the tree is saturated (if it is at depth $k$, there are $2^{k}$ nodes), then you can colour that row black. you may not colour only part of a (saturated or not) red row black. another example: in the tree $1:red \rightarrow\{ 2:black, 3:black\rightarrow\{4:red, 5:red\}\}$, you can flip the colours of $\{3,4,5\}$ and still have a legal rb tree. this is true generally for subtrees of even height and alternating colours. maybe you can think of more examples.
.
.
.
.
.

4.)Comparison

We have seen above that both the models results are quite similar but there is a slight difference in the latency of both the models, as the universal encoder model takes less time to generate the results

5.)Conclusion

We will be selecting the universal sentence encoder model for the deployment as it takes less time to compute the results.

6.)Deploying the Model

Till now we have done everything from the experimental point of view but now we have to deploy with clear and concise form, we will be using the flask API to deploy our model locally.

7.)Model in Action

8.)Future Works

  • Feeding more data to our Model to increase the quality of results.
  • Training our Encoder model with programming codes.
  • Apply a different weighting scheme for combining the scores.

9.)References

10.)Profile

--

--