Web Search Engine for Computer Science
Crawl Wikipedia webpages for Computer Science and use them to build a small web search engine
Description
In this project, I crawl Wikipedia webpages for Computer Science first. And then, these fetched webpages are used to build a small web search engine by using multiple methods mentioned in the CS3245.
Crawling
The whole pipeline of crawling can be shown as Fig. 1, which is originally posted at Lecture 12: Crawling and Link Analysis
To make a small search engine for Computer Science with high webpage quality, I set https://en.wikipedia.org/wiki/Outline\_of\_computer\_science
as a root link for crawling.
After fetching and parsing webpages, we should check the category of this webpage whether it is related to computer science. At the bottom of each wiki page, there is a Categories
tag that can easily be selected for filtering out non computer science related pages.
Based on that, I set some keywords for filtering webpages. Here are some shorten words of categories.
category_list = ['info', 'com', 'tech', 'sci', 'dev', 'alg', 'learn', 'eng', 'math', 'oper', 'data']
After that, we should also check the content of the webpage whether we have seen it in the previously fetched webpages. All we want do is avoiding crawl duplicated webpages and reduce the size of crawled data.
There are two scenarios of duplicated content:
- different URLs but have the same content
2. different URLs but the content almost the same
- https://en.wikipedia.org/wiki/Template:Web-stub
- https://en.wikipedia.org/wiki/Template:KosciuskoCountyIN-geo-stub
With many hyperlinks in the webpages, not only do we use these links to build an A matrix in the PageRank part, but also they are used for further crawling. Since we only care about wiki webpages, we can filter out other host links and drop some danger links with robots templates.
With all actions above, we save webpage.url
, webpage.title
, webpage.content
, and webpage.outlinks(hyperlinks)
to the file, which will be processed in the indexer part.
Used Tech
- Simhash
To finish the content check part, I use an external library called SimHash
. The key part of SimHash
is to convert weights of each term into hash value and get a fixed length of fingerprint of each document.
PageRank
With fetched URLs and their hyperlinks in the crawling part, we can easily build an A matrix by using Markov chains and refine it by using Teleporting to avid Dead End
and Spider Trap
The Fig. 5 is originally posted at Lecture 12: Crawling and Link Analysis
After that, we calculate array a
by using pagerank algorithm to represent the probability of each webpage. These probabilities can be used in the searcher part.
a = array([[8.30168216e-05, 8.30168216e-05, 1.57926651e-04, ... 3.82857116e-04, 3.82857116e-04, 4.64916368e-04]])
Indexer
The structure for storing data can be shown as Fig. 6
For each fetched webpages, remove all punctuations from title and content, stem the words and change all characters to lower case. Store the doc_id, term frequency, and position to the postings. (position will be used for phrasal query). To build an A matrix, we fetch hyperlinks of each webpage and assign value corresponding to doc_id and index of hyperlinks.
After processing webpages, we refined the A matrix
and calculate array a
. And then save all things (array a, dictionary, and postings) to the file.
Used Tech
- zone and field
As we can see, in each webpage, title represents the important information about this webpage. Thus, indexing them is also important when users want to search something like “find docs with merchant in the title zone and matching the query gentle rain”
As for the field part, I don’t see any useful information like data_published or court, so I don’t process them.
Searcher
Given the user’s input as a query, we tokenize it first. With dictionary.txt and postings.txt stored before, we construct the query vector and document vector by using the VECTOR SPACE MODEL. After calculating the similarity between each document and query, we normalize and combine them with the array value. In general, I just treat the normalized similarity as probability and calculate the average. Finally, all related webpages will be shown with sorted order.
Used Tech
- query expansion
In query expansion, additional input is given based on synonyms. After tokenizing the query in the searcher part, I call _expand () function to add synonyms.
Within _expand() function, for each token, we call wordnet.synsets(token) to get a set of synonyms. In order to get more precise answers, we use a dictionary to check all synonyms. If they exist in the dictionary, add them to the query.
Given the query computer science
, after expansion, the result is calculate computer science skill
.
2. relevance feedback
In this part, we use Rocchio Algorithm to do it. Assume that top 5 documents in the original result is highly relevant to the query, we use these doc vectors to modify the query vectors and search it again.
As you can see in Fig. 8, if the relevant docs are 0 and 5, the result changes and top 2 results are 0,5. All scores of documents are also changed.
Limitation
- Less crawling data
In this mini-project, I just crawl 2000 webpages about Computer Science. With a large corpus in Wikipedia, there are plenty of unprocessed CS-related pages. For example, in my dataset, I don’t even have a wiki about “Information Retrieval” and “Computer Science”. Therefore, the result of them may not be very convincing.
- Not satisfied A matrix
To build a good A matrix, every fetched webpage should contain as many hyperlinks as possible. However, when I check the A matrix, only 127 webpages have values, which means that all rest sites are dead ends. With this drawback, values in the array are not very meaningful.
Author
👤 Evan — Crawling, PageRank, Indexer, query expansion
👤 Rulin — Searcher, relevance feedback
- Github: @XJDKC
Reference
Implementation
Source code
- Crawl wiki webpages
python3 crawling.py
parameter setting
url = "https://en.wikipedia.org/wiki/Outline_of_computer_science" crawl = Crawl(url, 'robots.txt')
crawl.get_rules()
crawl.crawl('./webpage', num=2000)
- url: root link for crawling
Crawl(url, ‘robots.txt’)
: robots.txt of wikipediacrawl.crawl('./webpage', num=2000)
: folder of saved webpages and number of crawling webpages
- Index
python3 index.py -i directory-of-documents -d dictionary-file -ppostings-file
- Documents to be indexed are stored in
directory-of-documents
dictionary-file
: This file will storearray a
,average length of total docs
,info of docs
anddictionary
postings-file
: This file will storetf
,doc_id
andposition
of each term.
- Search
python3 search.py -d dictionary-file -p postings-file -q query-file-o output-file-of-results
dictionary-file
: This file is generated from index.py, which containsarray a
,average length of total docs
,info of docs
anddictionary
.postings-file
: This file is generated from index.py, which containstf
,doc_id
andposition
of each term.query-file
: This file contains several queries to be tested.output-file-of-results
: This file contains the result and corresponding urls from queries.txt
Advanced option
-e enable query expansion-f disable relevance feedback-s enable printing score