Tackling Forms During a Web Scrape
Continuing my search for data available on the web, I came across a new problem. What if you need to put in a password or fill out a search query before it takes you to the data you need? There is a python library called mechanize that can handle this for you. The GitHub page for this library is available here. I’m sure there are more applications for this library than what I was using it for, but it helped solve the problems I ran into while querying a site that had a built in form.
In my latest search for new data-sets I decided it would be cool to scrape the Internet Broadway Data Base (IBDB)…yes a data-set of every single Broadway musical IS cool! IBDB provides records of productions from the beginnings of New York theatre until today.
Below is the search query page. I wanted a list of all the Musicals, so I checked Musical and hit search.
This was the output:
Whoopie!! I got what I wanted, but notice the url. Between the query search page and the output the url stayed the same. While visually I am looking at what I want, when I tried to request this page in my web scrapper I would only get the html for the query page.
MECHANIZE TO THE RESCUE!!
If we take a look at the html from the search query page, we find a <form> tag. This tag will give you the key/value pairs you need to fill out the form with mechanize.
Different <div> tags in the form will hold the input methods for each box of the form. In my case I wanted to check the box ‘Musical’. Below is the html that corresponds to this check box. Note the attributes ‘name’ and ‘value’.
With this information I wrote out the following code using mechanize and BeautifulSoup to return the html that I could then parse for the list items.
from bs4 import BeautifulSoup
from mechanize import Browser
br['ProdTypeMusical'] = ['true']
response = br.submit().read()
soup = BeautifulSoup(response, 'lxml')
items = soup.find('body').find_all('li')
In this code the Browser function is initiated, and then the url that contains the form to be filled out is requested with the open method. There are multiple ways to select the form. If there are multiple forms on a page the form may have a name; similar to how tables have specific names. If the form has a specified name you would put in :
If there is not a specific form name you can call it with the ‘nr’ attribute. This is the sequence number of the form (where 0 is the first), think normal indexing syntax. The last thing you add are the parts of the form you want the code to fill in. The syntax for this is similar to setting key/value pairs in a dictionary. The key/value is the name and value we got from the html earlier. The value will be a list of a single string. After all of the form parts that need to be filled in are added, read in the response and pass it through BeautifulSoup to extract the list items. Now the list items are ready to be parsed for data.
My current notebook can be viewed on my GitHub. If you have any sites with interesting forms, please share in the comments below. Enjoy and Happy Scraping!