Using Python to get HTML

Refer : Requests

1) First go command line to install requests with pip

$ pip3 install requests>>> response = requests.get('https://en.wikipedia.org/wiki/Napoleon')
>>> print(response.text)
>>> print(type(response.text))

(Replace Url in the page with the page of your own choice , it will download whole html and save it as string text)

2) Now Using Beautiful Soup module of python to parse the html code

$ pip3 install beautifulsoup4

Run this on Command line to install Beautiful Soup module

Beautiful Soup is a Python library for pulling data out of HTML and XML files.To parse the data from the content, we simply create a BeautifulSoup object for it.That will create a soup object of the content of the url we passed in.

from bs4 import BeautifulSoup

>>> html=response.text
>>> soup=BeautifulSoup(html,’html.parser’)
>>> return soup.p.a

It will return first paragraph(p attribute in html) and from that first ‘a’ tag is returned.(Keeping in mind you know basic html and CSS).

3) Now so far we have learned to make an HTTP request and parse the data using beautiful soup module.

Moving back to our task Crawling Wikipedia page till you reach Philosophy URL page or number of visits exceeds 25

IMP: If we try to hit Wikipedia page one after another in a while loop , Wikipedia server will block that program. Slow things down so as to not hammer Wikipedia's servers. for that we will use sleep for 2 seconds.

soup.find_all(<attribute>,recursive=False)

it will give list of strings between that attribute,recursive=False parameter will stop recursiveness

time.sleep(2), will,be used to sleep the program

import time
import urllib

import bs4
import requests


start_url = "https://en.wikipedia.org/wiki/Special:Random"
target_url = "https://en.wikipedia.org/wiki/Narendra_Modi"

def find_first(url):
    response = requests.get(url)
    html = response.text
    soup = bs4.BeautifulSoup(html, "html.parser")
    article_link = None

    # Find all the direct children of content_div that are paragraphs
    for element in content_div.find_all("p", recursive=False):
        if element.find("a", recursive=False):
            article_link = element.find("a", recursive=False).get('href')
            break

    if not article_link:
        return

    first_link = urllib.parse.urljoin('https://en.wikipedia.org/', article_link)

    return first_link

def continue_crawl(search_history, target_url, max_steps=25):
    if search_history[-1] == target_url:
        print("We've found the target article!")
        return False
    elif len(search_history) > max_steps:
        print("The search has gone on suspiciously long, aborting search!")
        return False
    elif search_history[-1] in search_history[:-1]:
        print("We've arrived at an article we've already seen, aborting search!")
        return False
    else:
        return True

article_chain = [start_url]

while continue_crawl(article_chain, target_url):
    print(article_chain[-1])

    first_link = find_first(article_chain[-1])
    if not first_link:
        print("We've arrived at an article with no links, aborting search!")
        break

    article_chain.append(first_link)

    time.sleep(2) # slow down otherwise wiki server will block you

OUTPUT :(run this code and watch output or watch this gif by wikipedia below)

Wikipedia:Getting to Philosophy - WikipediaClicking on the first link in the main text of a Wikipedia article, and then repeating the process for subsequent…
en.wikipedia.org

Credits and References :

Udacity Machine learning FoundationNanodegree
http://docs.python-requests.org/en/master/
http://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python
https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy
http://www.huffingtonpost.in/entry/wikipedia-philosophy_n_1093460

Using Python to get HTML

>>> html=response.text>>> soup=BeautifulSoup(html,’html.parser’)>>> return soup.p.a

Wikipedia:Getting to Philosophy - Wikipedia

Clicking on the first link in the main text of a Wikipedia article, and then repeating the process for subsequent…

Written by Abhishek Jain

>>> html=response.text
>>> soup=BeautifulSoup(html,’html.parser’)
>>> return soup.p.a