Search Engine in Python from scratch

Dhaval Taunk
Analytics Vidhya
Published in
7 min readOct 10, 2021
Source:- https://www.pinterest.com/pin/847450854860596454/?amp_client_id=CLIENT_ID(_)&mweb_unauth_id=&simplified=true

In this post, I will be going through all the details of building a search engine from scratch using the Wikipedia dump (approximately 84GB in size). I will be going through the step-by-step process of creating a primary index of data and a secondary index—also, how to implement search functionality to output the results in minimum time and all. I will be dividing the post into two parts viz Requirements, Wiki dump details, Indexing and Searching. So tighten your seat belts, and let's get started.

1. Requirements

Python3

NLTK stopwords

PyStemmer

2. Wiki dump details

For creating the search engine, in this post, I will be referring to the Wikipedia dump for the English language, which is approximately 84GB in size. One can download the data from the below-given link:-

https://dumps.wikimedia.org/enwiki/20210720/enwiki-20210720-pages-articles-multistream.xml.bz2

You will need to download and extract the downloaded dump. Alternatively, you can work on the zipped data as well. I will be explaining all the things on the extracted dump.

3. Creating the Index for Wiki dump

(i) Parsing the XML dump

First of all, you will be required to parse the XML and get the necessary data. For this, there are a couple of parsers available in python. Some of them are:-

SAX

Etree

DOM

Here I will be using SAX parser to parse the XML. You can try out different other parsers as well. In the below gist, I have shown how to use the SAX parser to parse the data.

I am using only two fields, i.e. title and text, from XML in the above code. I am giving my own id's from the variable num_pages. How I am using the title, and the text is explained below in different sections.

(ii) How to preprocess text

It is an essential task, as this step will ensure we are not adding unnecessary terms to the Index. Otherwise, it will blow the index size. Majorly, I will remove stopwords, tokenise the text, remove HTML tags, remove non-ASCII characters, etc. It is shown in the below code.

(iii) Extract different fields

There can be different possible fields on which we can do query searching. I will be using six fields: title, body, category, infobox, links, and references. One can search generic queries or field-specific queries using these fields.

(iv) Creating an intermediate index

Creating the final Index directly will be a heavy task and can blow up memory as well. Therefore, we will first create an intermediate index on data sections and then perform the final merging to make a final index. We will be using the SPIMI approach. You can read more about it using the below link.

The below code shows how to create an intermediate index.

Here I am processing the text, token by token, adding it to a dictionary, and then writing it in intermediate files. The format of index files now looks this:-

apple-2314:t3b6i2r1;6432:t5c8b3i1;

Here in the above example, "apple" is a token. Then after the hyphen, every pair is separated by ";". The first value before ":" represents the docID for the document, and then the corresponding string and value represents frequency.

Ex.:- t3b6i2r1 represents token appears 3 times in title, 6 times in the body, 2 times in infobox and 1time in reference.

(iv) Merging the intermediate index

Now when we are done with writing the intermediate Index. We need to merge the indexes because there will be instances where the token will have its info split into multiple files. We need to merge it for creating the final Index. The below code do this:-

The format of the final Index looks like the below example:-

apple-2141:5;1232:1;5432:78;

Here in the above example, "apple" is a token. Then after the hyphen, every pair is separated by ";". The first value before ":" represents the docID for the document, and the second value represents the frequency of the token for that particular token from the field file. Here in the final files are separated fields wise files, unlike in intermediate indexes.

In this way, the final Index will look similar to the above example shown. If you want to see the entire code, you can visit the Github code link given at the last of this blog and also try out different approaches you want to try out.

(v) Secondary Index

You can create a secondary index as and when required. This will help in searching fast by keeping the information of tokens in the secondary Index. I will be using the below format of the secondary Index. One can try out other possible forms as well.

apple-563-1-3-4--2-4-6-

According to the above format, 'apple' will be the token. Then the value 563 will denote the frequency of apple in the entire Wikipedia dump. After that, the number 1 will indicate in which file number the token is present. The token can appear in any field. The number 1 will suggest that if the token is present in that field, it will only be there in that particular file number for that field. After that, all other values will be optional. I will be going about the details of the field values below:-

As mentioned above, I will be using fields title, body, category, infobox, link, reference. So the values between '-' will denote the line number in the final index files. Below example shows it:-

'apple-563-1-3-4--2-4-6-'.split('-') ---> ['apple', '563', '1', '3', '4', '', '2', '4', '6', '']

Here the fourth element ('3') indicates that 'apple' appears at line number 3 of file number 1 for the title file. Similarly, '4' will show that it will be present at line number 4 of the body file; it does not appear in the category file as it is an empty string ("), it appears at line number '2' for the infobox file, it appears at line number '4' of link file, and at line number '6' in the reference file. One thing that will be common for all the files is that if the token is present for any field, it will be present in that particular file number only, nowhere else.

4. Implementing Search functionality

One of the crucial things required for searching functionality is implementing the ranking functionality to rank the document according to its relevance. But before that, few other things are also needed, which I will be explaining below:-

(i) Preprocessing the query

The preprocessing will be the same which we did during the indexing phase. All the steps will be the same as the initial steps. Then we will get the final preprocessed query.

(ii) Identify query type

This is one of the mode useful steps as it will guide us whether we need to the simple query or field query or both. So basically a query can be of 3 types:-

Type 1:- world war II
Type 2:- t:world cup i:2012
Type 3:- Sachin Tendulkar t:world cup i:2012

We can identify the type of query using the below code-

(iii) Ranking functionality

Ranking functionality is used to get the ranked results. So that the relevant results are on top. Usually, tf-idf is a metric to rank documents.

So what is tf-idf? It is composed of two terms TF and IDF.

TF (Term Frequency):- It tells us the frequency of a term in a document.

IDF (Inverse Document Frequency):- In documents, a lot of words that occur have a large frequency. But this large frequency reduces the relevance of the word for that document. So to reduce the effect of that large frequency words, IDF is used. It is basically a log of the total number of documents divided by the frequency of the word for that document.

So the final tf-idf value is the product of tf and idf values.

tf-idf = tf * idf

(iv) Variants of tf-idf

Source:- https://nlp.stanford.edu/IR-book/pdf/06vect.pdf

One can use any one of the above-given tf-idf variants. It's up to their implementation and choice.

Happy reading…….

If you like my article:

--

--

Dhaval Taunk
Analytics Vidhya

MS by Research @IIITH, Ex Data Scientist @ Yes Bank | Former Intern @ Haptik, IIT Guwahati | Machine Learning | Deep Learning | NLP