Pitchfork reviews and scores: a quick analysis tool in Python and Php
Pitchfork is the most reliable source for music reviews that you can find online. Its website has a great UX and is full of free and interesting information.
As an intensive user, I am missing a very basic list page where I can find, search and filter just the album reviews with their scores (expressed in decimals).
Hence I thought it would be useful to geeky music fans like me to store somewhere those scores, and display/search them together with some basic album information.
Storing scores and albums metadata (genre, label, year, cover) for some years could also led to some interesting data analysis, so I decided to write a simple Python script to read information from Pitchfork’s album reviews Rss and store the content in a Mysql database.
Once I got the import script up and running, I also decided to build a basic web page in Php, so that everybody could read and search throughout the album list.
Here is the final result: http://www.lamorbidamacchina.com/pitchforkscores/
Here is the repository on github:
Why Python and Php together?
If you are asking yourself why I used this mixed Python/Php system, here are some answers:
a. Both Python and Php are available on most common server configurations
b. I wanted to learn some more about Python
c. Python has a wonderful library called BeautifulSoup which makes ridiculously simple to parse data in complex html pages
d. I still prefer Php for rapid frontend development
Part One: creating a table to store data in Mysql
This part is easy :)
Here’s the structure of the database table to store all the data we want from Pitchfork. As you can see, I decided to save basic data like title, author and score, but also more specific stuff like album art, genre, creator (of the review), label and year. Basically everything you can see in a Pitchfork review web page, except the review itself.
To have an idea of the data we are going to store, here’s an example of a typical review on Pitchfork: https://pitchfork.com/reviews/albums/14257-this-is-happening/
We don’t need any other table for this project, so the database part is done.
Part Two: Importing data with Python
Now, let’s create a file named import_albums.py, located in a /batch folder in the project. This file will be called once per hour by a cron job set on my server, and will do the following:
- it reads Pitchfork’s album reviews Rss (here)
- it parses each album review, and check if it has already been saved in the database, in a table called pitchfork_albums
- if not, it writes the new review in a new record
- since the Rss file doesn’t contain all the data I need, especially the final score of the album, it parses the html page of the actual review on pitchfork.com to retrieve data such as score, genre, label, cover, etc.
- it updates each record with this data taken from the review web page
To make this work, we need to import some Python module.
urllib2 is needed to access Pitchfork’s Rss
MySQLdb is needed, obviously, to connect to the database
ElementTree is the module we use to read the xml content of Pitchfork’s Rss
BeautifulSoup helps us to retrieve content from Pitchfork’s html web pages (reviews)
Please note that you could have to install some of these modules on your system: this is a straightforward process if you can use pip or some other python modules manager, a bit more painful if you are doing it by yourself. In any case, I am not a sysadmin and I managed to do it quite quickly.
Reading a remote file and copy it to your server is pretty easy with Python and urllib2: here is how I get a copy of the rss and save it in a file called albums.xml, in the same directory of my python file.
Now I can parse the xml file I’ve just created, using ElementTree features. In this case I loop every item element in albums.xml to get the data I need from each review (title, description, link to the review’s web page, unique id, publication date). I also format date in some more “db friendly” form.
Once I’ve got all the data I need from the Rss, I can start connecting to database, check if the album review is already stored in there, and if it’s not, to insert a new record in the table
Among the data I collected from the rss file, the link variable contains the url of each review, on Pitchfork’s website. We can parse this html web page to retrieve some data which is not exposed in the Rss: not only the score (we know music cannot be reduced to a number, but we want it!:), but album art, author of the review, music genre, label, and album year too.
I also get title and author of the album in two distinct fields (stored in variables named h1 and h2), since the Rss feed has one field for both of them.
To achieve this task, I use Beautiful Soup module, whose select method is so similar to the well known and easy to use jQuery selectors logic. After a quick analysis of the html page, I can get all the data I need by pointing to some distinct html elements, identified by their class or id attribute.
Before saving this data in the same record I have just inserted, I want to print out title, author and score in the output of the file, so I can later email it to me when the job is executed.
Last part of the script updates the record we’ve just created, with the additional information grabbed with BeautifulSoup. (before you ask, yes, I could’ve done just an insert…)
Once the script is tested and working, I want to recall it automatically from my server. Pitchfork usually publishes new reviews once or twice per day, setting a cronjob every hour is more than enough.
To setup a cronjob on VirtualMin, I followed the steps listed here: https://www.virtualmin.com/documentation/tutorial/how-to-setup-a-cron-job
At this point, I am able to see my pichfork_reviews database table filled with the first batch of data.
Part three: show the scores in a webpage
Creating a webpage to show each album with its score and basic data is trivial. I decided to create a Data Object Access Class in a file called dao.php
In this class you can find methods for reading all the stored reviews, for searching by genre, year and score, for reading all the genres and all the albums’ years of publications
A simple index.php file connects to database, initiates a list of all music genres and albums’ years stored in the table to generate two dynamic dropdown menus, and then creates the list of the reviews, with their basic data and score. The method that generates the reviews’ list can accept 3 parameters (genre, year and score range), for a basic search engines.
That’s it! Now I have a clear view on Pitchfork album reviews, searchable by genre, year and score.
As time goes by, I should have enough data to analyze score bias affected by genre, best years, and so on. Of course, when Picthfork will redesign its html templates, some fix will be needed to my import script.
Next steps could be developing a free text search and a simple pagination (right now only the last 100 reviews are displayed).