Web Search Engine for Computer Science

Crawl Wikipedia webpages for Computer Science and use them to build a small web search engine

Evan
NavePnow
6 min readMay 9, 2020

--

Photo by Christian Wiediger on Unsplash

Description

In this project, I crawl Wikipedia webpages for Computer Science first. And then, these fetched webpages are used to build a small web search engine by using multiple methods mentioned in the CS3245.

Crawling

The whole pipeline of crawling can be shown as Fig. 1, which is originally posted at Lecture 12: Crawling and Link Analysis

Figure 1 Crawling architecture
Figure 1 Crawling architecture

To make a small search engine for Computer Science with high webpage quality, I set https://en.wikipedia.org/wiki/Outline\_of\_computer\_science as a root link for crawling.

After fetching and parsing webpages, we should check the category of this webpage whether it is related to computer science. At the bottom of each wiki page, there is a Categories tag that can easily be selected for filtering out non computer science related pages.

Figure 2 Categories
Figure 2 Categories

Based on that, I set some keywords for filtering webpages. Here are some shorten words of categories.

After that, we should also check the content of the webpage whether we have seen it in the previously fetched webpages. All we want do is avoiding crawl duplicated webpages and reduce the size of crawled data.

There are two scenarios of duplicated content:

  1. different URLs but have the same content

2. different URLs but the content almost the same

Figure 3 Webpage screenshot
Figure 3 Webpage screenshot
Figure 4 Webpage screenshot
Figure 4 Webpage screenshot

With many hyperlinks in the webpages, not only do we use these links to build an A matrix in the PageRank part, but also they are used for further crawling. Since we only care about wiki webpages, we can filter out other host links and drop some danger links with robots templates.

With all actions above, we save webpage.url, webpage.title, webpage.content, and webpage.outlinks(hyperlinks)to the file, which will be processed in the indexer part.

Used Tech

  1. Simhash

To finish the content check part, I use an external library called SimHash. The key part of SimHash is to convert weights of each term into hash value and get a fixed length of fingerprint of each document.

PageRank

With fetched URLs and their hyperlinks in the crawling part, we can easily build an A matrix by using Markov chains and refine it by using Teleporting to avid Dead End and Spider Trap

Figure 5 Matrix A
Figure 5 Matrix A

The Fig. 5 is originally posted at Lecture 12: Crawling and Link Analysis

After that, we calculate array a by using pagerank algorithm to represent the probability of each webpage. These probabilities can be used in the searcher part.

Indexer

The structure for storing data can be shown as Fig. 6

Figure 6 Indexer structure
Figure 6 Indexer structure

For each fetched webpages, remove all punctuations from title and content, stem the words and change all characters to lower case. Store the doc_id, term frequency, and position to the postings. (position will be used for phrasal query). To build an A matrix, we fetch hyperlinks of each webpage and assign value corresponding to doc_id and index of hyperlinks.

After processing webpages, we refined the A matrix and calculate array a. And then save all things (array a, dictionary, and postings) to the file.

Used Tech

  1. zone and field
Figure 7 Title
Figure 7 Title

As we can see, in each webpage, title represents the important information about this webpage. Thus, indexing them is also important when users want to search something like “find docs with merchant in the title zone and matching the query gentle rain

As for the field part, I don’t see any useful information like data_published or court, so I don’t process them.

Searcher

Given the user’s input as a query, we tokenize it first. With dictionary.txt and postings.txt stored before, we construct the query vector and document vector by using the VECTOR SPACE MODEL. After calculating the similarity between each document and query, we normalize and combine them with the array value. In general, I just treat the normalized similarity as probability and calculate the average. Finally, all related webpages will be shown with sorted order.

Used Tech

  1. query expansion

In query expansion, additional input is given based on synonyms. After tokenizing the query in the searcher part, I call _expand () function to add synonyms.

Within _expand() function, for each token, we call wordnet.synsets(token) to get a set of synonyms. In order to get more precise answers, we use a dictionary to check all synonyms. If they exist in the dictionary, add them to the query.

Given the query computer science, after expansion, the result is calculate computer science skill.

2. relevance feedback

In this part, we use Rocchio Algorithm to do it. Assume that top 5 documents in the original result is highly relevant to the query, we use these doc vectors to modify the query vectors and search it again.

Figure 8 Relevance Feedback

As you can see in Fig. 8, if the relevant docs are 0 and 5, the result changes and top 2 results are 0,5. All scores of documents are also changed.

Limitation

  1. Less crawling data

In this mini-project, I just crawl 2000 webpages about Computer Science. With a large corpus in Wikipedia, there are plenty of unprocessed CS-related pages. For example, in my dataset, I don’t even have a wiki about “Information Retrieval” and “Computer Science”. Therefore, the result of them may not be very convincing.

  1. Not satisfied A matrix

To build a good A matrix, every fetched webpage should contain as many hyperlinks as possible. However, when I check the A matrix, only 127 webpages have values, which means that all rest sites are dead ends. With this drawback, values in the array are not very meaningful.

Author

👤 EvanCrawling, PageRank, Indexer, query expansion

👤 RulinSearcher, relevance feedback

Reference

Implementation

Source code

  • Crawl wiki webpages

parameter setting

  1. url: root link for crawling
  2. Crawl(url, ‘robots.txt’): robots.txt of wikipedia
  3. crawl.crawl('./webpage', num=2000): folder of saved webpages and number of crawling webpages
  • Index
  1. Documents to be indexed are stored in directory-of-documents
  2. dictionary-file: This file will store array a, average length of total docs, info of docsand dictionary
  3. postings-file: This file will store tf, doc_idand positionof each term.
  • Search
  1. dictionary-file: This file is generated from index.py, which contains array a, average length of total docs, info of docsand dictionary.
  2. postings-file: This file is generated from index.py, which contains tf, doc_idand positionof each term.
  3. query-file: This file contains several queries to be tested.
  4. output-file-of-results: This file contains the result and corresponding urls from queries.txt

Advanced option

--

--