I was then inspired to do a little web scraping to get at those data on all these great movies, and find some good movies to watch!
Looking at the website and trying to understand the parameters in the URI, I saw that there was an argument for
start=1, and when I went to the next page it became
start=51. I searched the page on a term only the posts would have and when that counted 50, I knew that was how many to expect per page. This also taught me that that was what I would have to specify, in steps of 50, to crawl through each page.
I then initialized my variables and set the pages to cycle through those parameters previously mentioned. Unfortunately as my comment says, after 10,000 movies, the pattern changes to
after= followed by gibberish that is unique to each subsequent page. I suspect this is an anti-web scraping tactic.
Next, I set up my for loop, which is two layers. Here is that whole loop:
For the page defined by the current iteration of the first loop, it cycles through each variable of interest for each of the 50 movies of that page. It stores each respective element in the array associated with that variable.
Once that loop is done, it returns to the previous loop to change the argument for the URI, thereby “clicking” to the next page. And the cycle repeats.
This creates arrays of each variable containing each instance of that variable for each movie in the 10,000 movies that I was able to gather.
At the end I put them into a dataframe and it was off to do some data cleaning!
With my newly created dataframe it was time to check it all out and make sure that nothing was weird before doing exploratory data analysis (EDA) which will come in part II of this article.
The first thing I noticed is that the year variable is in parentheses. We don’t need that, we want an integer! The numbers start from at least 1 character away from the end of the string and up to 5 characters away from the beginning of the string, so a little index slicing will do the trick. Then I call
.astype to turn it to an
I noticed that the IMDB score is on a scale of 10 and the metascore a scale of 100, so I needed to standardize that variable, which I called ‘n_imdb’, a new column.
Runtime was a string in the form ‘number min’ so I needed to get just the number, and remove that space and ‘min’.
I don’t need the original runtime feature anymore so I dropped that and then went to start fixing the ‘genre’ column, which had a ‘\n’ before the string which needed to be removed.
Wrapping the data cleaning portion up, I turned ‘n_imdb’ into integer type, and noticed that the ‘genre’ column still had issues, namely 17 whitespace characters after each string, so I used
.rstrip() on that, and split on the commas because I wanted lists in each row for genre so that I work with individual elements, since the genres were a list of three different genres. I thought that would be useful for EDA to be able to be more granular.
Here’s the full data cleaning code:
You’ll notice that I wrote it to csv at the end. That is because this was all one .py file. My goal was to run this on an instance of EC2 on AWS to let them do the heavy lifting. By my calculations, it would have taken 48 min on my measly computer and I didn’t feel like waiting.