HTML Scraping 101

Published in

quaintitative

1 min readJul 28, 2018

HTML Scraping 101

[Post is also available at quaintitative.com]

More often than not, easy access to data via an API is not possible. Scraping the webpage might then be the only practical way to get at the data. Doing this in Python is fairly straightforward, with the help of some libraries, and a basic understanding of HTML.

We first import the following libraries -

Beautiful Soup (bs4), to help us tease the good stuff out from HTML and XML
request, to help us make HTTP requests to specific webpages

import bs4 
import requests

Next, specify the webpage which you would like to scrape data off. Say the list of visual art topics on Wikipedia.

websource = ['https://en.wikipedia.org/wiki/Category:Lists_of_visual_art_topics']

Some string manipulation first. I need the wikipedia address later. Simple split the string at ‘wiki/‘ and get the first item that is returned.

domain = websource[0].split("/wiki")[0]

You will get this.

'https://en.wikipedia.org'

Next, we get the content of the page in websource.

html = requests.get(websource[0]).content

Then we parse it with BeautifulSoup. This then allows us to get at all the links using the findAll function.

soup = bs4.BeautifulSoup(html, 'html5lib')
links = set(soup.findAll('a', href=True))

Next, we use what we have to locate a link with ‘mathematical’ inside to find the page with the list of mathematical artists.

for link in links:
	if 'mathematical' in link['href']:
		page = requests.get(domain+link['href']).content
		clean_page = bs4.BeautifulSoup(page, 'html5lib')

The Jupyter notebook with the code is here

HTML Scraping 101

Written by playgrdstar