A Guide to Web Scraping and Data Analysis with Python

This is a walkthrough of the tech stack that I use to gather and analyze sports. It should apply just as well to any data on the internet though. I’ve used this tech stack to publish a couple of articles, Dispelling Eagles Injury Myths and Is the NFL serious about concussions?, using data from Pro Football Reference. I’ve also published the code on my nfldata GitHub project.

Libraries and frameworks used in this stack
Libraries and frameworks used in this stack

Why should you care?

⚠️ WARNING This guide assumes that you are comfortable using the command line and Python. If you’re not, then I have some recommendations for what to do first.

TL;DR

How I scrape and analyze data from the web
How I scrape and analyze data from the web
How I scrape and analyze data from the web

Setting up your project

‼️ INFO If you’re interested, you can learn more about Python virtual environments, and how they’re used on the Python website.

Once you’ve installed Pipenv, go to an empty directory, and set up your project with these commands:

pipenv --three
pipenv shell

The instructions would be similar if you use Poetry instead.

Gathering the data

pipenv install scrapy
scrapy startproject <name>

This should scaffold out a new project to get you started. Scrapy works by defining Spiders which crawl a website and emit Items. These Items are data that you can process anyway you see fit. To learn how to create a Spider, go through Scrapy’s tutorial first, and then follow up on some of the pages on the website as needed.

At a high level, Scrapy Spiders start with a URL or set of URLs, and send requests to the URL(s) to get some HTML. From here, you can define CSS selectors or XPath expressions to search for specific data in the HTML. Then, you can choose to yield an Item containing some data of interest, or a new URL to continue crawling the website.

A high level overview of how Spiders crawl the web
A high level overview of how Spiders crawl the web
A high level overview of how Spiders crawl the web

Throttle your spiders

Don’t let your spiders get out of control
Don’t let your spiders get out of control
Don’t let your spiders get out of control (source: https://xkcd.com/427/)

The setup is pretty simple, just set the variable to enable it, and set the target frequency:

AUTOTHROTTLE_ENABLED = True
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 3.0

Set your User-Agent

Retry failed requests

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 820,
}
RETRY_TIMES = 3

‼️ INFO The number associated with the Middleware simply sets the order in which the Middleware is executed. Middlewares with a higher number are executed after those with a lower number.

Cache your responses

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 604800
HTTPCACHE_DIR = 'httpcache'
# Don't cache pages that throw an error
HTTPCACHE_IGNORE_HTTP_CODES = [503, 504, 505, 500, 400, 401, 402, 403, 404]

⚠️ WARNING The default HTTP Policy used by Scrapy just caches every page until expiration, which is great for testing. But, if you want to deploy your Spider to production, you probably want to use the RFC2616 policy.

Render pages with Javascript

Luckily, there is a way to get the HTML after the Javascript files have been executed. Scrapinghub, which maintains Scrapy, also maintains a Javascript rendering service called Splash. To integrate it with Scrapy, there is a library called scrapy-splash. The scrapy-splash library exposes a Downloader Middleware, which interfaces with a Splash server running on your computer.

In order to run the Splash server, you first need to install Docker. Then, you can easily run a Splash server on port 8050:

sudo docker run -p 8050:8050 scrapinghub/splash --max-timeout 300

Now, there are couple of steps to integrate Splash and Scrapy properly. First, you need to set up the Downloader middleware:

DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 820,
}

Then, in order to make sure your spiders know where Splash is running, point to it in your settings.py:

SPLASH_URL = 'http://localhost:8050'

Next, make sure that Scrapy’s RetryMiddleware is compatible with Splash:

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

Finally, make sure that Scrapy’s HTTP cache can handle Splash responses:

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

Now you’re all set to scrape any website without fear of missing out on some data.

Storing the data

‼️ INFO If you’ve never used SQL databases before, it’s worth learning how relational databases work, and how to best query them with SQL in depth. But for a quick overview of SQLite, you can try the SQLite Tutorial website.

Now, you’ve got a Spider producing items, and you have a SQLite file to store all of them. Let’s talk about how to get those items into SQLite. Conceptually, I think of each Item class as corresponding to a table. So, if I’m scraping every player in NFL history, I’ll make a Player item. Then in SQLite I’ll create a players table. Each Item class corresponds to a SQLite table, and each instance of that Item class is a row in the SQLite table.

I’ve found this general template useful for defining Scrapy items that will be persisted in SQLite:

class Player(scrapy.Item):
# define fields
@staticmethod
def sql_create(database):
database.execute('''
CREATE TABLE IF NOT EXISTS
players (
// column names
)
''')
def sql_insert(database):
database.execute('''
INSERT OR REPLACE INTO
players (
// column names
) VALUES (
// column values
)
''')

The sql_create method creates a table to hold that class of items, and the sql_insert method inserts the specific instance of that item into the table.

The advantage of using SQLite is not that it just stores your data. You could do that by just writing the items to a file. But, SQLite helps you organize your data, not just store it. Having your data organized in tables allows you a lot of flexibility in your analyses. Best of all, SQLite is supported out of the box by Python.

Now that your items are set up, you can create a Scrapy Pipeline. Pipelines perform post-processing on items once they’ve been scraped by spider. You can create a Pipeline that creates all the necessary tables first, and then inserts each item into SQLite as it comes in.

Analyzing the data

pipenv install jupyter
jupyter notebook
A screenshot of one my Jupyter notebooks.
A screenshot of one my Jupyter notebooks.
A screenshot of one my Jupyter notebooks.

As you can see there are cells where you can enter Python code to perform some work. Then, the output cells will render the output of the operation.

‼️ INFO The nice thing about running Jupyter from your Pipenv project is that the notebooks also have access to all of the libraries you install with Pipenv.

Querying the data

pipenv install pandas
A screenshot of how Pandas renders DataFrames.
A screenshot of how Pandas renders DataFrames.
A screenshot of how Pandas renders DataFrames.

Visualizing the data

pipenv install matplotlib seaborn

Parting thoughts

  1. Scrape data from websites, even if they require Javascript to parse.
  2. Create a pipeline to organize and store your data in a SQLite database.
  3. Query and visualize the data in a Jupyter notebook.

Hope you found this guide helpful, and feel free to use my nfldata project as a reference.

See the original article on my website: https://pradyothkukkapalli.com/tech/web-scraping-stack/

Engineer and Sports Fan living in NYC