I made a Python program to graph a baseball player’s batting average over the course of their career.

Paul Leiva
5 min readAug 25, 2020

--

On my current journey of learning some data science tactics, I decided to compose a side project where I pooled data from a website and graphed it. I decided to go with Baseball Reference because it seemed interesting enough to work with and possibly worth starting with for expanding on the project further. I decided to call this project BAvGraph (“Batting Average Graphed”).

I already have most of the Java fundamentals down, so I decided to go with my next-best language, which would be Python. Rather than using an API (which are somewhat hard to find for Baseball Reference), I decided to use some Python libraries to scape the web page without needing a browser page open.

These are the libraries that I used for this program. The sys` library is the only one that comes default. You will need to install the others.

The first thing we do in this program is prompt for the player name. The program takes the player’s “on-field” name (for lack of a better term). Players with initials as their names (ex: J.D. Martinez) are not always found. Pitchers also will not be found all the time. So, focus on entering in position players only. You can enter quit to quit the program. Due to the nature of the URLs provided by Baseball Reference, the name that the user input is then sliced into a first and a last name (an array of length 2).

Then, we format the URL appropriately to start the scraping. As you can see in the page for Paul Molitor, the site uses a universal convention in making URLs by making use of the players last initial, then the first 5 letter of the player’s last name (or less if it is not that long), and the first 2 letters of the player’s first name (no player has less than that many letters in their first name), and then followed by a 01.shtml. So, this URL is formed from the player name that was inputted in lines 18 and 19. Thanks to Python’s easy use of Lists, this parsing process is possible in just a few lines. Line 21 makes use of the request, which receives the HTML response. BeautifulSoup is then used in line 23 to parse the response source code for the HTML page. We then fetch the player’s name on the page in lines 25–26. This will display in the final graph to ensure you have the right player. If your graph is titled with another name, then Baseball Reference must have some overlapping or similar names to the player that was entered.

The table has the id equal to batting_standard and if you go to the player’s page, you will see this table displayed with the title of Standard Batting. For all position players, this should be the first table on their page. They look like the table below.

Of concern to us is the Year column, and the BA column, but I also added the team, league, and home run count in the data list, potentially to expand on in a future project.

Next in lines 28, the program fetches the table on the page. Line 29 then fetches all the rows in the table. Then, the program makes 4 variables in lines 31–34: (1) prevYear, which is used to prevent more than one season from appearing in our Lists, (2) data a list to hold specific cell values, (3) yrs a List to hold the years/seasons that the player played, and (4) avgs a list of the averages the player had every year.

Now with our variables, we iterate through each of the rows. We first fetch the Year , Team, League, BA (the batting average), and the HR count. If the Year for the row is this is the same year as the previous year that was retrieved (in the row above the one at hand), then we skip over this row. The reason why this happens is because some players can be traded in a season and so the site will display a row for each team they played for in a season, but we are only concerned with the first row where the year changes. This is because in the case of where a player was traded during the season, their first row with a different year displayed will be their cumulative, and the team will be displayed as TOT to indicate the row is cumulative for that year and season. The prevYear variable will only change if the current year is different from the current value of prevYear.

We then take our data and append to each of our lists respectively.

Lastly, use matplotlib to generate a table that will look like the one below.

Then, to do another search, you would click the button to close the window of the scatter plot (the red ‘x’ in the upper left on Mac devices) and you can enter another player name.

Also, if there is an error in the program, you should just be prompted to enter a new player.

Thanks to the documentation for Beautiful Soup and the documentation for matplotlib for making this possible. Some other helpful web pages and tutorials were used too.

Public repository here.

Play ball!

--

--