The Rising Cost of Data: Web Scraping with Python

Michael Seman
8 min readJul 10, 2023

--

Not even Justin Fields can outrun the rising cost of data

Around a year ago, I performed sentiment analysis on tweets regarding my favorite baseball team, the Chicago Cubs. I enjoyed the project so much that I even ended up writing an article about the process. My research was nowhere near complete and it is always something I have wanted to revisit. The problem is, gaining access to those tweets that I previously had for free, now comes at a cost. Twitter has put their API behind a paywall. Reddit has been all over the news lately for doing the very same thing. In the world we live in, nothing is free. “Free apps” harvest our data and fight for our eyeballs and clicks. Our personal data is now a commodity that is priced out and traded. As a data analyst trying to do research, this presents a problem. Because storing data and giving access to it is not cheap, more and more companies are charging for access. This is the story of the hard lessons I learned while trying to perform data collection via web scraping for a recent research project.

It started out so simply…

In January of 2023, I had a very clear idea in my head for a research project: I wanted to create a model to predict where a college player would be selected in the NFL draft. I naively thought this would be a relatively easy task to undertake and my model would be something I could use to help analyze the upcoming draft in April. What a fool I was.

My data collection goal was to obtain the NFL draft results from the past twenty years. Then I aimed to get as much personal information (height, weight, age, birthplace etc.), as well as the college football statistics on every player drafted in that time. I was turned on to an application called Octoparse, that automates the web scraping process with no knowledge of coding necessary. I spent a day or two learning the app, setting up my workflow, and I was ready to scrape. After looking through dozens of NFL and college football websites, I decided on Sports Reference. This site is the gold standard for statistics on all major sports. I ran into some issues with Octoparse in that it would scrape just fine normally, but randomly would not give me any results for specific years. So I pivoted to an app called Parsehub.

Parsehub is probably the simplest web scraping application around, while being a bit cumbersome. I cannot recommend it enough for beginner analysts looking to perform simple web scraping without knowledge of Python. It took a day to learn the application and setup a workflow. Then I was able to scrape all twenty years of NFL draft results with ease, including links to each individual players stat page.

An example of a draft results page scraped using Parsehub

The next step was to setup a new workflow in Parsehub to loop through each player’s college stats and personal information webpages and grab the relevant information. This is where I began to learn my lesson on the cost of data. The free edition of Parsehub is amazing, but it had one big limitation that hit me hard: it could only scrape 200 pages per run. Each year’s draft consists of approximately 250 players, each of whom had two pages to scrape. My data collection task just became infinitely times harder, because as a poor college student, I could not afford the paid edition of the app.

Web Scraping with Python

I knew what I had to do: I had to author Python code to scrape the pages myself. I had some previous experience with web scraping and the Beautiful Soup library, but that was just single page, single table scraping. This was going to require the scraping of approximately 10,000 pages. Ever the optimist, I started refamiliarizing myself with how to scrape. I revisited a wonderful article written by Martin Breuss and watched an extremely informative and easy to follow video by Tech With Tim. Then I simply dove in.

An example of the personal information page scraped

It took a lot of trial and error but I was quickly able to scrape a single page of player’s personal information. As always, each new page brought new exceptions that I did not initially spot but eventually I was able to get a nice loop of code that should have been able to scrape all ~5000 pages. It was time to let my code loose and grab the data I needed. Eveything seemed to be working fine, but after a few hundred pages my code started returning nothing. The pages were accessed but nothing was being scraped. I assumed it was an error in my code, and spent hours trying to debug. Finally, I repeated a few simple lines of code that I always used when first accessing a page:

r = requests.get(url)
page = BeautifulSoup(r.content, 'html.parser')
print(page.prettify())

and my folly was revealed…

Go Directly to Jail. Do Not Pass Go.

Because I had never scraped more than a single page, I had never encountered something like this before. How innocent I was back then to think I could scrape data from the web without a cost. Of course I could not send request after request to the pro-football-reference website without them stopping me! I delved further to find out exactly what was going on and was met with this explanation:

What did ESPN do to sports-reference?

I went to their website to figure out their exact limits and rules. There, I found out the exact crimes I had committed that gotten me sent to jail.

Hosting that much data is expensive, so of course they need to limit access

I noticed that the site mentioned they received a lot of student requests for data, so I thought maybe I could explore that avenue… That was until I read this line, “We will not fulfill any requests for data for custom downloads, unless you are prepared to pay a minimum of $1,000 for any such request.” As I mentioned before, I was a poor college student at this time. There was no way I was going to be able to pay that type of fee. I started looking for other sites to scrape and even began writing code when the simple solution to this problem came to me: a time delay.

import time
time.sleep(7) # put this line at the end of your loop to delay 7 seconds

By adding a time delay to my code, I had my loop “sleep” for a set amount of seconds before running again. This way I did not violate the site’s rules for use and I was able to scrape away at the simple cost of A LOT of time. A scraping run that would have probably only taken an hour or two would now take nearly 9 hours. Of course, I did not do this all in one go. I limited my access to a few years of data scraped a day during off peak traffic hours. It took some time but I finally had good, real data to work with.

Conclusion??

I scraped 20 years worth of personal information about all the players drafted into the NFL. I had their height, weight, hometown, college and high school attended, position, as well as all the draft results from all those years. This was to be the jumping off point for the research I was going to do. Unfortunately, I stumbled upon another restriction that ground my work to a halt. The following was not allowed; “copy or use any material or Content from the Site, including without limitation any statistics, data, text, graphics, or images, for purposes of training, fine-tuning, prompting, or instructing artificial intelligence models or technologies in any manner, including without limitation for purposes of (i) generating answers, text, scores, statistics, notes, graphics, images, or any other output; or (ii) supporting machine learning methods used to predict, classify, label, or score inputs into the models.” This is exactly what I had set out to do. All my work, all the data I collected, was useless. I could not train a model with this data without violating the site’s terms of service. Game over.

Turn your failures into successes: My NFL Draft Dashboard

A Lesson Learned

The simple lesson learned from my experience is this: all data has a cost. The days of being able to access and share data without having to pay are pretty much gone. I do not fault sites for putting limits on access to their data, because as I have mentioned, storing and maintaining access to data is expensive. Businesses are meant to make money, not give away their most precious commodity for free. While more and more companies are putting their data behind a paywall, your personal data is being traded and sold like stocks. Looking back, I should have done a lot more research before simply jumping and trying to scrape the data I desired. I am not one to give up though. I found other sites where I was able to obtain my data without violating any rules or TOS. I had learned how to write Python code to web scrape so I was set, but that is a story for another day. The data I initially obtained though, I was able to put to use. My first love when it comes to data is visualization. So I used the personal information I scraped to create one of my favorite Tableau dashboards that I have ever made. Feel free to visit it here, on my Tableau Public site. Thank you for taking the time to read my story, hopefully you can learn from my mistakes. I’ll leave you with a screenshot of something that I could not help but add to my visualization work.

A little Easter egg for all my fellow Chicago Bears fans. Justin Fields has a great profile picture.

--

--