HTML Scraping 101
[Post is also available at quaintitative.com]
More often than not, easy access to data via an API is not possible. Scraping the webpage might then be the only practical way to get at the data. Doing this in Python is fairly straightforward, with the help of some libraries, and a basic understanding of HTML.
We first import the following libraries -
- Beautiful Soup (bs4), to help us tease the good stuff out from HTML and XML
- request, to help us make HTTP requests to specific webpages
import bs4
import requests
Next, specify the webpage which you would like to scrape data off. Say the list of visual art topics on Wikipedia.
websource = ['https://en.wikipedia.org/wiki/Category:Lists_of_visual_art_topics']
Some string manipulation first. I need the wikipedia address later. Simple split the string at ‘wiki/‘ and get the first item that is returned.
domain = websource[0].split("/wiki")[0]
You will get this.
'https://en.wikipedia.org'
Next, we get the content of the page in websource.
html = requests.get(websource[0]).content
Then we parse it with BeautifulSoup. This then allows us to get at all the links using the findAll function.
soup = bs4.BeautifulSoup(html, 'html5lib')
links = set(soup.findAll('a', href=True))
Next, we use what we have to locate a link with ‘mathematical’ inside to find the page with the list of mathematical artists.
for link in links:
if 'mathematical' in link['href']:
page = requests.get(domain+link['href']).content
clean_page = bs4.BeautifulSoup(page, 'html5lib')
The Jupyter notebook with the code is here