No API? No Worries.
--
When Munzi Codes and I first made the Bechdelerator during a hackathon a while ago, we needed to get movie scripts so we could analyze them.
Unfortunately, we soon discovered that there was only one real source for them and it was a bit of a pain to work with. Here is a screen shot of where the scripts came from:
There was no API, no easy way to predict what the links to the scripts would be, and no apparent way for users to quickly get a script for a specific movie.
Since we had a limited amount of time, we focused on what would be more impressive for the hackathon- analyzing the movie scripts, making a graph using D3 that represented conversations between characters in the movie, and predicting whether the movie passed the Bechdel test.
The only way for people to use our site was to manually copy and paste the script from the website. Clearly a less than ideal solution, and even our own friends and family did not want to use the site.
When we finally decided that copying and pasting needed to end and wanted to find a way to get the information directly from IMDSB, we quickly decided on using Cheerio to scrape the information we needed from the site and the Request library to make requests to the site.
Cheerio is awesome because it allows you to essentially write jQuery and use CSS selectors to get content from a website.
For example, here is part of the HTML from the IMDSB website:
To get the data we need, after installing and requiring the Cheerio and Request node modules (you can do this easily with npm install cheerio
or npm install request
), we can now do things like this:
The code above has previously loaded all of the HTML from IMDSB using the request library. We need to then get the table that contains all the links to movie scripts. Since we can see from the HTML of the page that this table has a child with the text “All Movie Scripts”, we can use
$("h1:contains(\"All Movie Scripts\")").parent();
to get the table element.
Next, since the every movie script link has ‘/Movie Scripts’ in its URL, we can use
$('a[href^="/Movie Scripts"]')
to get an array like object of every <a>
element that has a link to a movie script. Note that while this is an array like object an we can use .map
on it, it is not actually an array and that is why we use Array.prototype.slice.call
on it to convert it to an actual array.
Now that we have the list of all movie script links, we can render a page that uses this data using a template engine (we used Swig).
Once a user selects a script from our list of all scripts, we then use the URL of the script and the cheerio library to actually get the text of the script and then analyze it with our algorithm.
Cheerio makes crawling HTML pages very simple since you can just use CSS selectors (just like you do in jQuery) to select the page elements you want. It is a great option when working with data sources that have no API or other way to easily access the information you need. I definitely recommend checking it out and adding it to your next project.
Originally published at www.seemaullal.com on June 13, 2015.