Sourcing, Scraping, Storing, Scheduling Data (Oh My!)

6 min readFeb 22, 2019

Intro

Baseball is a unique sport in that the situations a player may face can in fact be counted. There are 24 unique base out states, 9 innings, 9 players, all situations can be accounted for. This makes baseball the perfect sport for the application of statistics. And the only thing better than performing analysis is delivering the results through insightful data visualizations. That is what I aim to accomplish. But before any analysis or data visualizations can take place, a data pipeline must be established. What follows are my efforts to build that pipeline.

Sourcing

Major League Baseball has led the way with advanced analytics among professional sports. One major aspect of this has been the implementation of the pitch f/x system, created by Sportvision and first used in the 2006 MLB playoffs. Since then it has been installed in every major league stadium and is used to track pitch speed and trajectory. From this information new insights into the game have become possible and the best part is, is it is all open source!

I first started working towards sabrmetric analysis after finding Bill Petti’s website. There he wrote about creating a statcast database but in R (https://billpetti.github.io/2018-02-19-build-statcast-database-rstats/). There are a number of python scrapers that can be found on github specifically related to pitch f/x but I wanted to recreate what Bill Petti had coded in R but this time in Python.

After reading Bill’s article I knew he was sourcing his data from baseballsavant.com and I knew the url that was supplying the data. However, I still made use of the inspect feature in my web browser and found that when I attempted to download a csv file containing the data a unique url appeared under the network tab. After a quick inspection, I was able to see how Bill modified the url so it would download the data for only a specific day. Sourcing completed!

Scraping

Data, data everywhere but how do I get it? Well, this is where scrapping comes into play. By utilizing python it is possible to gather a large amount of data relatively quickly. In my specific situation I do not need to use tools like BeautifulSoup. Instead, since baseball savant delivers the data through a csv file, I only had to make use of the pandas function read_csv and pass it the url where the csv is located. Within said url, there was only one item that would need to change daily, the date. Initially, I tried to grab all of the data for a specific day but unfortunately the request timed out. Therefore, it was necessary to limit the data based upon the venue where the game was played. In order to accomplish this task in a timely fashion, I wrote a dictionary that stored the name of each venue and used a for loop to pass in each new venue. Then using the append function in pandas I was able to append each new data frame to my existing data frame. And to be nice to the website I made sure to include a pause within the function. The result of all this effort was a function that will collect all the data for games played on a specific day and output a pandas data frame. Now, all that was left was to move the data frame to a database and to schedule the automatic downloading of the data. All code for the scraper can be found below.

Storing

Now that the data has been successfully scraped and placed in a pandas dataframe I needed to determine how I would store the data. I was at a critical juncture where I could go down the traditional structured database route (MySQL) or the unstructured route (MongoDB). Ultimately it was decided to go with MongoDB due to the fact that my schema may change due to a change from MLB statcast or because additional columns may be added. Furthermore, in the event that my data moves to the cloud (S3) the MongoDB structure most closely resembles that of S3, making the transition easier.

After installing MongoDB on my local machine, I used PyMongo to move the data. Installation was a breeze by using conda install PyMongo in the command prompt and moving the data over to the database was just as easy after reading a tutorial in the documentation. My code to facilitate the data transfer and storage can be found below.

Scheduling

After having sourced, scraped, and stored the data it was now time to schedule this process. I was aware of two options, luigi or windows task scheduler. I went with the later due to the fact that it was quick and easy. If I went the luigi route that would have increased my timeline due to the amount of time required for implementation and time was not something I had.

Windows Task Scheduler from Penn State GIS

After a quick google search I found a very informative tutorial by the GIS department at Penn State. There were only a few steps required to automate the data retrieval and storage process. First I had to call the file path for the python executable. Then I had to call the python script, which can be found by combining the scraping code and storage code above. Finally, I had to save the file with the bat extension and list in windows schedule the file path of the bash file and a time when I wanted the program to execute. Example code can be found below.

Next Steps

As the second act of the project commences, there are a few critical tasks to accomplish. First, checking data quality to determine the quality of the data. This can be done by looking for outliers, determining how many missing values are in the data, and making sure there are no duplicate values. This will help to add clarity to the negative vertical distance values in the pitch location plot. Since, these values may represents a ball that was spiked in the dirt or if statcast measures distance below a hitters waist.

Once the data has been deemed to be of good quality it will be time to start looking for insight in the data. One area of focus will be to use a clustering method to see if there are patterns on how pitchers attach a hitter, determine how close a pitch may be to a centroid may provide insight on the accuracy of a pitcher. This analysis will not be of value unless the insights can be conveyed consicely. One method to do this will be through data visualization and deciding on whether to use seaborn, bokeh, or plotly.

The final area to look at will be to see if it is possible to determine a 95 percent confidence interval on how a player will perform in their next game. I will explore using a decision tree, random forest, and bayesian analysis. This information would be useful in a daily fantasy contest or evaulating a daily gambling line.