Web Scraping using Pitchfork’s “Best New Music”

Stephaniecaress
9 min readAug 14, 2020

--

Pitchfork’s “Best New Music” header

In a world of targeted ads and constant recommendations, it sometimes feels like I consume media and buy products passively. It is easy to float through the day without noticing who is recommending what you watch, buy, and listen to. As much as I love getting an amazing suggestion on Spotify, I try to be an active seeker of music rather than a receptacle for my own listening habits. Each week I make a “To Listen To” playlist of albums I’ve found from online publications, friends, social media, Shazaming at the grocery store, etc. This blog post explains how you can use Python to automate the first option, gathering from online publications, with web scraping and put all of the information into an easily digested format. I walk through the process using Pitchfork’s “Best New Music” page as my example and you can check out the finished notebook here.

Web scraping allows you to download the contents of a webpage and sift through the results. There are many useful ways to apply this but I’ve found it to be most useful when I am looking to consolidate information that would otherwise take a long time to manually retrieve. For example, when I wanted to create a visual from past Indy 500 race data, I realized gathering all the historical stats dating back to 1911 would be time consuming. Using web scraping, I was able to visit dozens of pages, gather relevant data points, and organize it into a neat table within seconds.

When I set out to scrape a page, I usually follow the workflow below:

  1. Check for API and site limitations
  2. Decide what content to retrieve
  3. Retrieve information and figure out HTML elements
  4. Store in the desired format

I’ll go into each step in more detail below.

1. Check for API and site limitations

Before you start scraping, you will want to see if there’s an API and if the site has any limitations. An API (Application Programming Interface), is a more formal way you can retrieve the data. This would be a better alternative to use because the information is organized and companies will often have documentation with instructions for utilizing their API.

I think of it like this: an API is like a meal kit with all the pre-measured ingredients set out and instructions to follow. APIs have specific endpoints that tell you exactly what information is stored there. Some even provide examples for what the data will look like when you retrieve it. But this is not to say it will be easy with no bumps along the way (I have messed up many meal kits in my time!). Web scraping is like making a recipe from scratch. It is more open ended and requires more preparation. You need to buy the ingredients, clean, measure, and cut it all. When you run into issues, it will likely require more creative troubleshooting to get to the solution. Simply Googling “website name API” should bring it up if one exists. But this post is about web scraping, so of course when I checked, there was no official API for Pitchfork.

Before scraping, we need to see if the website has any limitations. You can find this out by adding /robots.txt to the base URL.

https:www.pitchfork.com/robots.txt

Here, you will see a list of users and the permissions the website allows for each. In the case of Pitchfork, you and I fall under the user agent “*” (asterisk means all). According to this, anything in the root directory “/” and not specifically called out is fair game.

Some sites also include limitations on how many requests can be sent. If you exceed this limit, it may raise flags on their end and result in you being blocked. You can use the time module in Python to help work around this (i.e. time.sleep() can delay and space out your requests). Additionally, many include sitemaps. This is especially helpful if you are trying to scrape every page they have.

Some sites do not have a “robots” page. I don’t know what the official response is to that but I try to stay away unless I am confident it is allowed.

2. Decide what content to retrieve

My purpose for creating this scraper is to grab all the “Best New Music” recommendations from Pitchfork and put it into a table. In my final table, each row would be an album or track with the corresponding details filling out each column. After brainstorming what information I would want for each entry, here is my wishlist of data to grab:

  • Artist
  • Title of work
  • Link to Pitchfork’s full review
  • Preview of review
  • Pitchfork Author
  • Pitchfork Rating
  • Distinguish if it is an album or a song
  • Genre
  • Artwork

I would also love to add a link to Spotify so I can go to listen directly from the table but since this isn’t included on the Pitchfork site, this will require adding an additional source like the Spotify API. I won’t cover that in this post but the finished repo has a separate notebook where I do just that.

3. Retrieve information and figure out HTML elements

Now that we have a wishlist of data to grab, we’ll want to see where exactly this information lives on the page. First, let’s discuss how the requests and beautifulsoup4 packages work. You can start by installing (if you haven’t already) and importing them. We’ll also import pandas so the scraped data can be organized into a table.

Imports for scraping

The requests module allows you to gather the underlying code from a page. The .get() method sends a request to a website’s server which then sends back information. All you need to do is add the URL as a string in the parentheses. As mentioned before, be cautious of any rate limits as sending too many requests in a short amount of time could look suspicious to them and result in your IP address being blocked.

Checking the status code will let you know if this request was successful or not.

Using the Requests package

We know our request went through but to see the actual contents of what was returned, add .text to the object. In the screenshot below you can see the text we received from the Pitchfork page. Just looking at the first thousand characters, it is a bit overwhelming and not very digestible.

This is where BeautifulSoup comes in! Converting this request object into a BeautifulSoup object allows us to search the text for specific HTML elements.

Converting the text into a BeautifulSoup object

If you are unfamiliar with HTML, I would suggest checking out this overview. For the purposes of web scraping, we just need to know that most of the content lives within some sort of HTML element. Most elements have an opening and closing tag with the displayed text in between (i.e. <p>My paragraph </p> is a paragraph element with the text “My paragraph”).

So if we know the information we want is stored in an h3 tag, we can use the .find() (returns the first h3 ) or .find_all() (returns list of all the h3's) methods to easily retrieve this. Adding .text to a single object returns just the text of that tag.

Using find and find_all to grab text

Pinpointing which HTML elements to gather is crucial for getting the relevant information from all the text you scraped. Keep in mind the text contains everything visible on the page like headlines, paragraph, tables, but also elements that are not displayed like links to images and other parts of the site. There are a couple ways to identify where the information you want is stored on the page. One method to try is using the “Inspect” feature on your web browser.

Pulling up Inspect by right-clicking

This will bring up the underlying code of the page so we can see what HTML tags are used for a specific piece of information.

Inspect window on Google Chrome

Another method is to search through your BeautifulSoup object to see where the text is. I usually employ both of these options simultaneously and mix trial and error with patience until I successfully identify which tags contain the data I want.

In pages that have hundreds or thousands of tags, it may be difficult to extract certain key pieces while weeding out the others. There may be dozens of h3 tags but only one that is important to you. Targeting specific attributes helps you filter. Attributes come in the opening HTML tag and help group or identify certain elements (i.e. <p class=my_class> My paragraph </p> contains the class attribute my_class). Class attributes can be applied to multiple elements while ID attributes are only for one. The code below shows how you can search for a tag and specify the class.

Narrowing your search by specifying class

On the other end of the spectrum, you can broaden your search and include different tags by passing them in as a list. This also applies if you want to look for multiple attributes.

Broadening your search by passing in a list of tags

Similar to how .text shows the text, you can isolate the different attributes of an element by adding .attrs and specifying which attribute in square brackets. This is especially helpful when you want to grab the link from an anchor tag like in the code below.

Retrieving attributes of an element

For the Pitchfork page, I found that each entry of an album or track was contained in a div tag, and all of the important information was included. Using .find_all() and specifying certain classes, I created a list called “reviews” where each item was a div section that would need to be parsed through further.

Creating a list of Pitchfork entires

From there, I was able to break down the div sections further to see where details like “artist” and “title” were stored. Here are a few examples:

Select code of the specific data to grab

The complete breakdown is available on my GitHub. I also realized that the first three entires were different from the rest so the repo includes those slight differences.

Some of the information is not on the main “Best New Music” page but rather on the actual review which is a separate link. This means I did one big scrape on the main page, then additional requests for each review page.

My last step is to bundle all of this into a function that loops through to gather information for each entry, then cleans and exports the data.

4. Store in the desired format

Once I found where each piece of information lived on the page, I combined all the steps and made them flexible enough to work in a loop. Here’s an example of how I changed some of the code from above:

Changing the code from one list item to be more broad

As each album or track is scraped, the information is stored in a dictionary.

Dictionary for each entry

When the dictionary is complete, it is added to a list. The final product is a list of dictionaries where each list item is a different album or track from the Pitchfork website. A list of dictionaries is also the perfect format to convert into a DataFrame.

Storing the dictionaries in a list then converting to a DataFrame

Just like that I have my listening list for the week and it took less than 6 seconds to gather.

Running the final function and a preview of the returned table

Here are some other ideas for how to expand this project further:

  • Add additional sources, like NPR, Complex, Stereogum, to get more recommendations
  • Integrate with Google sheets to have a living document that is expanded each week
  • Utilize the Spotify API to build yourself a playlist

I hope you enjoyed reading this and learned something new. If you have any suggestions, questions, or ideas for other sites to scrape, I would love to hear them!

--

--