Creating Wikipedia Crawler Using Python
Task : Crawling first link of Wikipedia Recursively to reach the philosophy page (https://en.wikipedia.org/wiki/Wikipedia_talk:Getting_to_Philosophy),(http://www.huffingtonpost.in/entry/wikipedia-philosophy_n_1093460)
We will be fetching html using python Request Module
Using Python to get HTML
Refer : Requests
1) First go command line to install
requests
withpip
$ pip3 install requests>>> response = requests.get('https://en.wikipedia.org/wiki/Napoleon')
>>> print(response.text)
>>> print(type(response.text))
(Replace Url in the page with the page of your own choice , it will download whole html and save it as string text)
2) Now Using Beautiful Soup module of python to parse the html code
$ pip3 install beautifulsoup4
Run this on Command line to install Beautiful Soup module
Beautiful Soup is a Python library for pulling data out of HTML and XML files.To parse the data from the content, we simply create a BeautifulSoup object for it.That will create a soup object of the content of the url we passed in.
from bs4 import BeautifulSoup
>>> html=response.text
>>> soup=BeautifulSoup(html,’html.parser’)
>>> return soup.p.a
It will return first paragraph(p attribute in html) and from that first ‘a’ tag is returned.(Keeping in mind you know basic html and CSS).
3) Now so far we have learned to make an HTTP request and parse the data using beautiful soup module.
Moving back to our task Crawling Wikipedia page till you reach Philosophy URL page or number of visits exceeds 25
IMP: If we try to hit Wikipedia page one after another in a while loop , Wikipedia server will block that program. Slow things down so as to not hammer Wikipedia's servers. for that we will use sleep for 2 seconds.
soup.find_all(<attribute>,recursive=False)
it will give list of strings between that attribute,recursive=False parameter will stop recursiveness
time.sleep(2), will,be used to sleep the program
import time
import urllib
import bs4
import requests
start_url = "https://en.wikipedia.org/wiki/Special:Random"
target_url = "https://en.wikipedia.org/wiki/Narendra_Modi"
def find_first(url):
response = requests.get(url)
html = response.text
soup = bs4.BeautifulSoup(html, "html.parser")
article_link = None
# Find all the direct children of content_div that are paragraphs
for element in content_div.find_all("p", recursive=False):
if element.find("a", recursive=False):
article_link = element.find("a", recursive=False).get('href')
break
if not article_link:
return
first_link = urllib.parse.urljoin('https://en.wikipedia.org/', article_link)
return first_link
def continue_crawl(search_history, target_url, max_steps=25):
if search_history[-1] == target_url:
print("We've found the target article!")
return False
elif len(search_history) > max_steps:
print("The search has gone on suspiciously long, aborting search!")
return False
elif search_history[-1] in search_history[:-1]:
print("We've arrived at an article we've already seen, aborting search!")
return False
else:
return True
article_chain = [start_url]
while continue_crawl(article_chain, target_url):
print(article_chain[-1])
first_link = find_first(article_chain[-1])
if not first_link:
print("We've arrived at an article with no links, aborting search!")
break
article_chain.append(first_link)
time.sleep(2) # slow down otherwise wiki server will block you
OUTPUT :(run this code and watch output or watch this gif by wikipedia below)
Credits and References :