The Trials and Tribulations of Learning Python — My Experience Learning How to Write Functions, Use Beautiful Soup and Working with pyMongo
A short recap: Last month I left off the Chicago Python project with having successfully found and retrieved MLB statcast data, successfully downloaded and installed MongoDB, as well as scheduled when my data retrieval would take place!
However, a couple of things were a miss. In my function to download the statcast data there was a rather long list of hard coding that would convert specific variables to numeric data. That has since been corrected in an updated function where I make use of a list containing the variable names and the apply function to coerce to numeric. What once took over twenty lines now takes place with one list in a config file and one apply function, now that’s what I call an improvement.
Now for what will occur in this week’s episode of… “So you want to learn python”. This time around I started embarking on writing functions to move my data to my MongoDB, query my database, and use Altair to create a scatter plot depicting pitch locations. Unfortunately I ran out of time to write a function to scrape fangraphs, however I was successful in using Beautiful Soup 4.4 to scrape for more data. However, this week I am changing up the format from trying to describe what I did followed by code to more of an explanation so I can come back to this post when I no doubt forget how to perform some function writing, querying MongoDB, or how to create an Altair graph.
Writing Python Functions
Warning: what will follow is a brief statement description about functions followed by code that is not complete but significant better than what I had before…which was no functions.
Functions are magical pieces of code that help readability and are useful to divide the work into useful chunks. Python functions are composed of a block head, block keyword, and argument(s).
def function_name(argument1, argument2, etc):
In python the block head is def and the block_keyword is the name of the function. A variable can be passed to a function and a function can even return a value..wow that is awesome. I’d like to take a moment to thank Larry Page and Sergey Brin, without those two I’d be relegated to a library coming through volumes of text to learn what took minutes on google. Now back to the action…to call a function, all a user has to do is type in the function name and pass the required arguments.
Each language has its own syntax. For example, I can write a function in SAS by typing the following:
%macro <macro name>(<macro variable(s));
<macro code>
%mend;
Or I could write a function in R by writing:
<function name> <- function(<function variable>) {
< function code >
return(my_list)
}
As I was writing and beginning to use functions more I had a thought, what if someone uses the function incorrectly and gets an error. Is it possible to write a custom error message? At this moment my mentor came to the rescue and steered me down the path of using a try and except block for custom error messages. After a quick google search, thank you again Larry and Sergey, I found a wonderful link ( https://realpython.com/python-exceptions/ ) where I learned a little more than what my mentor had told me during our weekly meetings. The article covered the differences between exceptions and syntax errors, raising an exception, the try and except block, and including an else clause with the try and except block. I recommend giving the article a read if you have the chance.
Armed with my newly acquired knowledge, I wrote a function to move data to my database and to create a scatter plot showing pitch locations and grouped by pitch type.
Altair is a wonderful library that was recommended by my mentor. It is responsible for the pitch location chart and is very similar to ggplot2 in R. Furthermore, it was created by Jake VanderPlas who was very helpful on stackoverflow and answered my question on how to include the box to represent the strike zone. Altair, thus far, is my go to library for data visualization within python. Below you can find my code to create the above plot, with the try and except block as a catch all. This is an area that I will need to go back and revise my custom error handling.
Using Beautiful Soup
As I was looking at my current data, I realized that in order to achieve my goal of using a decision tree to predict daily fantasy points for a particular player I would need a different view of the data since my current view was very granular. So, I headed on over to my favorite baseball website fangraphs! They have it all there, summary statistics, new sabermetrics, its a cornucopia of information. But how would I get it? That’s where Beautiful Soup came into play. After inspecting the web page I saw the structure layout organized by tags and thought Beautiful Soup would be perfect here. After reviewing their robots.txt policy I saw that they allowed scraping and began to learn a new skill set.
After about an hour of unsuccessful trial and error I knew this part would be difficult. So, I did what any enterprising individual who wanted to cut down on their learning curve would do, I did some R&D aka searched to see if anyone else had done this and manipulate their code for my needs. Thankfully, someone had…thank you conorkcobin without your github this would have been harder
Now with a firm base of R&D work done, I began to attempt to scrape data on Mike Trout.
Querying MongoDB through pyMongo
After restarting my computer I could no long connect to my MongoDB through pyMongo…what gives?! Well, as I am a novice, I completely overlooked the fact that I needed to start my database. A quick google search led me to learning that I needed to execute the following.
"C:\Program Files\MongoDB\Server\4.0\bin\mongod.exe"
After successfully starting the database, and connecting to it through pyMongo I could now begin the task of querying the database. Knowing that a collection consists of documents and those documents are equivalent to a record in a SQL database, I began to structure my query in more familiar language. I knew I wanted to select all columns from my collection pitch_level and in order to limit my results I would select only documents from a single game, where game_pk = 529412. What normally would have looked something along the lines of
SELECT
*
FROM
pitch_level
WHERE
game_pk = 529412
Could more succinctly be written as:
results_df = mydb.mycol.find({"game_pk" : 529412})
In the code above, not was I able to return a query based upon the value in the unique game identifier, but I was able to store the results to a pandas dataframe! Talk about easy!
But what if I wanted to find documents that had a value GREATER THAN or LESS THAN a specific value? Well in mongoDB that is as easy as
mydb.mycol.find{<column> : {"$gt" : <value>}}
mydb.mycol.find{<column> : {"$lt" : <value>}}
As my favorite office supply store says, “that was easy”. But what if I wanted to return results that were between two? Say no more, all that needs to be done is:
# Between 2 dates
my_db.my_collection.find({"<column>": {"$gte": <value1>, "$lt": <value2>}})
But now what if I wanted to find a document with either this value OR that value? Say no more because pyMongo has you covered with:
my_db.my_collection.find({"$or":[{"<column>":<value1>},{"<column>":"<value2>"}]})
How about if I wanted to find a document with values IN a particular group? PyMongo has that too:
my_db.my_collection.find({"<column>":{"$in":[<value1>,<value2>]}})
How about if I wanted to find a document that was NOT IN a particular group. Yup, pyMongo has that too:
my_db.my_collection.find({"<column>":{"$nin":[<value1>,<value2>]}})
By combining my new found querying knowledge and function writing I was able to compose a general querying function:
Next Steps
The project is definitely coming along, but there are a couple of critical things that need to be addressed. I need to cleanup the function that moves the data to the mongoDB. Currently I am renaming columns in the function and that data cleaning needs to be taken care of outside of the function or in another function. I also need to put the fangraphs scraper in a function and determine a methodology to scrape only specific players that have played a game on the day the function runs automatically. I also need to better implement the custom error messages in the functions. Then, I can start to look at ways to predict daily fantasy points. Additionally, I need to put a little more time into creating a 2-D density plot that shows the likelihood of pitch being called a strike given where it crosses the plate. This likelihood has been calculated through a general additive model. Furthermore, this is useful when looking at if a pitcher was getting favorable calls or perhaps the catcher is good at framing pitches. However, this will not be useful unless there is a way to deliver all of my analysis. Thus, flask will have to be coded in order to show the graphs that will convey all of this information. I hope you join us next week on “Oh My God There Is So Much Work Left To DO!!!!”.