Web Search Engine for Computer Science

Crawl Wikipedia webpages for Computer Science and use them to build a small web search engine

Evan

Published in

NavePnow

6 min readMay 9, 2020

Description

In this project, I crawl Wikipedia webpages for Computer Science first. And then, these fetched webpages are used to build a small web search engine by using multiple methods mentioned in the CS3245.

Crawling

The whole pipeline of crawling can be shown as Fig. 1, which is originally posted at Lecture 12: Crawling and Link Analysis

To make a small search engine for Computer Science with high webpage quality, I set https://en.wikipedia.org/wiki/Outline\_of\_computer\_science as a root link for crawling.

After fetching and parsing webpages, we should check the category of this webpage whether it is related to computer science. At the bottom of each wiki page, there is a Categories tag that can easily be selected for filtering out non computer science related pages.

Based on that, I set some keywords for filtering webpages. Here are some shorten words of categories.

category_list = ['info', 'com', 'tech', 'sci', 'dev', 'alg', 'learn', 'eng', 'math', 'oper', 'data']

After that, we should also check the content of the webpage whether we have seen it in the previously fetched webpages. All we want do is avoiding crawl duplicated webpages and reduce the size of crawled data.

There are two scenarios of duplicated content:

different URLs but have the same content

2. different URLs but the content almost the same

With many hyperlinks in the webpages, not only do we use these links to build an A matrix in the PageRank part, but also they are used for further crawling. Since we only care about wiki webpages, we can filter out other host links and drop some danger links with robots templates.

With all actions above, we save webpage.url, webpage.title, webpage.content, and webpage.outlinks(hyperlinks)to the file, which will be processed in the indexer part.

Used Tech

Simhash

To finish the content check part, I use an external library called SimHash. The key part of SimHash is to convert weights of each term into hash value and get a fixed length of fingerprint of each document.

PageRank

With fetched URLs and their hyperlinks in the crawling part, we can easily build an A matrix by using Markov chains and refine it by using Teleporting to avid Dead End and Spider Trap

The Fig. 5 is originally posted at Lecture 12: Crawling and Link Analysis

After that, we calculate array a by using pagerank algorithm to represent the probability of each webpage. These probabilities can be used in the searcher part.

a = array([[8.30168216e-05, 8.30168216e-05, 1.57926651e-04, ... 3.82857116e-04, 3.82857116e-04, 4.64916368e-04]])

Indexer

The structure for storing data can be shown as Fig. 6

For each fetched webpages, remove all punctuations from title and content, stem the words and change all characters to lower case. Store the doc_id, term frequency, and position to the postings. (position will be used for phrasal query). To build an A matrix, we fetch hyperlinks of each webpage and assign value corresponding to doc_id and index of hyperlinks.

After processing webpages, we refined the A matrix and calculate array a. And then save all things (array a, dictionary, and postings) to the file.

Used Tech

zone and field

As we can see, in each webpage, title represents the important information about this webpage. Thus, indexing them is also important when users want to search something like “find docs with merchant in the title zone and matching the query gentle rain”

As for the field part, I don’t see any useful information like data_published or court, so I don’t process them.

Searcher

Given the user’s input as a query, we tokenize it first. With dictionary.txt and postings.txt stored before, we construct the query vector and document vector by using the VECTOR SPACE MODEL. After calculating the similarity between each document and query, we normalize and combine them with the array value. In general, I just treat the normalized similarity as probability and calculate the average. Finally, all related webpages will be shown with sorted order.

Used Tech

query expansion

In query expansion, additional input is given based on synonyms. After tokenizing the query in the searcher part, I call _expand () function to add synonyms.

Within _expand() function, for each token, we call wordnet.synsets(token) to get a set of synonyms. In order to get more precise answers, we use a dictionary to check all synonyms. If they exist in the dictionary, add them to the query.

Given the query computer science, after expansion, the result is calculate computer science skill.

2. relevance feedback

In this part, we use Rocchio Algorithm to do it. Assume that top 5 documents in the original result is highly relevant to the query, we use these doc vectors to modify the query vectors and search it again.

As you can see in Fig. 8, if the relevant docs are 0 and 5, the result changes and top 2 results are 0,5. All scores of documents are also changed.

Limitation

Less crawling data

In this mini-project, I just crawl 2000 webpages about Computer Science. With a large corpus in Wikipedia, there are plenty of unprocessed CS-related pages. For example, in my dataset, I don’t even have a wiki about “Information Retrieval” and “Computer Science”. Therefore, the result of them may not be very convincing.

Not satisfied A matrix

To build a good A matrix, every fetched webpage should contain as many hyperlinks as possible. However, when I check the A matrix, only 127 webpages have values, which means that all rest sites are dead ends. With this drawback, values in the array are not very meaningful.

Author

👤 Evan — Crawling, PageRank, Indexer, query expansion

Twitter: @NavePnow
Github: @NavePnow

👤 Rulin — Searcher, relevance feedback

Github: @XJDKC

Reference

Implementation

Source code

NavePnow/Web-Search-Engine-for-Computer-Science

📚 Crawl Wikipedia webpages for Computer Science and use them to build a small web search engine …

github.com

Crawl wiki webpages

python3 crawling.py

parameter setting

url = "https://en.wikipedia.org/wiki/Outline_of_computer_science"    crawl = Crawl(url, 'robots.txt')    
crawl.get_rules()    
crawl.crawl('./webpage', num=2000)

url: root link for crawling
Crawl(url, ‘robots.txt’): robots.txt of wikipedia
crawl.crawl('./webpage', num=2000): folder of saved webpages and number of crawling webpages

Index

python3 index.py -i directory-of-documents -d dictionary-file -ppostings-file

Documents to be indexed are stored in directory-of-documents
dictionary-file: This file will store array a, average length of total docs, info of docsand dictionary
postings-file: This file will store tf, doc_idand positionof each term.

python3 search.py -d dictionary-file -p postings-file -q query-file-o output-file-of-results

dictionary-file: This file is generated from index.py, which contains array a, average length of total docs, info of docsand dictionary.
postings-file: This file is generated from index.py, which contains tf, doc_idand positionof each term.
query-file: This file contains several queries to be tested.
output-file-of-results: This file contains the result and corresponding urls from queries.txt

Advanced option

-e  enable query expansion-f  disable relevance feedback-s  enable printing score

Web Search Engine for Computer Science

Crawl Wikipedia webpages for Computer Science and use them to build a small web search engine

Description

Crawling

Used Tech

PageRank

Indexer

Used Tech

Searcher

Used Tech

Limitation

Author

Reference

Implementation

Source code

NavePnow/Web-Search-Engine-for-Computer-Science

📚 Crawl Wikipedia webpages for Computer Science and use them to build a small web search engine …

Written by Evan