Abhishek Jain
Tech Insider
Published in
3 min readFeb 12, 2018

--

Wikipedia Crawler

Creating Wikipedia Crawler Using Python

Task : Crawling first link of Wikipedia Recursively to reach the philosophy page (https://en.wikipedia.org/wiki/Wikipedia_talk:Getting_to_Philosophy),(http://www.huffingtonpost.in/entry/wikipedia-philosophy_n_1093460)

We will be fetching html using python Request Module

Using Python to get HTML

Refer : Requests

1) First go command line to install requests with pip

$ pip3 install requests>>> response = requests.get('https://en.wikipedia.org/wiki/Napoleon')
>>> print(response.text)
>>> print(type(response.text))

(Replace Url in the page with the page of your own choice , it will download whole html and save it as string text)

2) Now Using Beautiful Soup module of python to parse the html code

$ pip3 install beautifulsoup4

Run this on Command line to install Beautiful Soup module

Beautiful Soup is a Python library for pulling data out of HTML and XML files.To parse the data from the content, we simply create a BeautifulSoup object for it.That will create a soup object of the content of the url we passed in.

from bs4 import BeautifulSoup

>>> html=response.text
>>> soup=BeautifulSoup(html,’html.parser’)
>>> return soup.p.a

It will return first paragraph(p attribute in html) and from that first ‘a’ tag is returned.(Keeping in mind you know basic html and CSS).

3) Now so far we have learned to make an HTTP request and parse the data using beautiful soup module.

Moving back to our task Crawling Wikipedia page till you reach Philosophy URL page or number of visits exceeds 25

IMP: If we try to hit Wikipedia page one after another in a while loop , Wikipedia server will block that program. Slow things down so as to not hammer Wikipedia's servers. for that we will use sleep for 2 seconds.

soup.find_all(<attribute>,recursive=False)

it will give list of strings between that attribute,recursive=False parameter will stop recursiveness

time.sleep(2), will,be used to sleep the program

import time
import urllib

import bs4
import requests


start_url = "https://en.wikipedia.org/wiki/Special:Random"
target_url = "https://en.wikipedia.org/wiki/Narendra_Modi"

def find_first(url):
response = requests.get(url)
html = response.text
soup = bs4.BeautifulSoup(html, "html.parser")
article_link = None

# Find all the direct children of content_div that are paragraphs
for element in content_div.find_all("p", recursive=False):
if element.find("a", recursive=False):
article_link = element.find("a", recursive=False).get('href')
break

if not article_link:
return

first_link = urllib.parse.urljoin('https://en.wikipedia.org/', article_link)

return first_link

def continue_crawl(search_history, target_url, max_steps=25):
if search_history[-1] == target_url:
print("We've found the target article!")
return False
elif len(search_history) > max_steps:
print("The search has gone on suspiciously long, aborting search!")
return False
elif search_history[-1] in search_history[:-1]:
print("We've arrived at an article we've already seen, aborting search!")
return False
else:
return True

article_chain = [start_url]

while continue_crawl(article_chain, target_url):
print(article_chain[-1])

first_link = find_first(article_chain[-1])
if not first_link:
print("We've arrived at an article with no links, aborting search!")
break

article_chain.append(first_link)

time.sleep(2) # slow down otherwise wiki server will block you

OUTPUT :(run this code and watch output or watch this gif by wikipedia below)

Credits and References :

  1. Udacity Machine learning FoundationNanodegree
  2. http://docs.python-requests.org/en/master/
  3. http://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python
  4. https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy
  5. http://www.huffingtonpost.in/entry/wikipedia-philosophy_n_1093460

--

--

Abhishek Jain
Tech Insider

Android dev | Software developer at InMobi, Ex- MMT, Tokopedia