The Good, The Bad, and The Unlucky: A Look at the Chicago Cubs’ End to 2018
About a month ago, MLB released a dataset showing detailed information regarding every MLB pitch from the 2015 season until the end of last season.
While baseball season is still a few months away, owners and general managers are working to find key statistics so that they can best assess their team’s performance from the previous years and find ways to improve for 2019. Using OmniSci Immerse, I created a dashboard that gave detailed insights for each player and team over the past four seasons. I’ll describe below the process of loading the data into the platform and the steps in making my charts. Soon enough, you’ll be able to discover your own insights, create your own graphs and perform your own sabermetrics for your favorite baseball team.
Data Preparation
I found the MLB dataset on kaggle.com, and I was able to download the four tables of atbats.csv, player_names.csv, pitches.csv, and games.csv straight from the website to my computer. Using OmniSci cloud, uploading data has never been easier. First, go to the “Data Manager” tab and then click the blue “Import Data” button in the top right corner. On Immerse, I chose the “Import data from local file” option, and first selected pitches.csv to upload. Once the file was done being uploaded, I clicked “Preview.” An important change that I made to data file initially was that I changed the datatype of the at-bat ID, or “ab_id,” to BIGINT instead of a float. This way, I can easily join the data sets into one table on columns that exist in multiple tables. I followed the same process for at_bats.csv, and changed the ab_id, batter_id and pitcher_id variables to BIGINT. For player_names.csv and games.csv, and I changed player ID and game ID to BIGINTs too.
The next step I took involved merging all four tables into a single table, to take advantage of the cross-filtering features that the Immerse offers. I inner joined the four tables using the SQL Editor to create a new table that I called mlb_data.
Creating an Immerse Dashboard
As a lifelong Chicago Cubs fan, I was particularly disappointed last year by our lackluster season, where we lost our lead in our division in the last month of the season and missed the playoffs in a win-or-go-home game against the Brewers. Using OmniSci Immerse, I was able to create a comprehensive dashboard to provide insight into what exactly went wrong. Did we underperform in certain areas that can be uncovered through data analytics? Was the competition playing unusually well against our pitching? Maybe it was a lot of bad luck?
To solve these questions, the first chart I set up was a timeline of games throughout the past four seasons. I went to my dashboard and clicked “Add Chart” in the top right corner. I then selected a Line Graph, and set mlb_data as my source. I set the x-axis to date, which originated from the games.csv file that was put into mlb_data. I binned the games by week, and then set my y-axis as a custom SQL entry for number of games. What showed up was a timeline of every individual game, from Opening Day April 5, 2015 to game 162 on September 30, 2018. This chart gives me the ability to crossfilter the dashboard on certain seasons or segments of seasons that are of particular interest to me.
With this chart, I have the ability to filter the entire dataset to only show records of pitches, at-bats and games of the last month of the 2018 season. By clicking and dragging on the timeline of records, I can alter the view to show only the records from September 1 to September 30 in 2018. In a matter of milliseconds, we can now see all the records of data that fit this criteria — slimming down over 2.8 million records into around 120,000 records.
The next few charts that I created were a histogram of the number of times that each “event” of an at-bat happened — such as number of strikeouts, walks, singles, home runs, and any other possibility at the plate for the batter. Using the custom SQL entry option, I created a few more tables outlining batting average, slugging percentage, and on-base percentage so that I could see statistics for both the league and individuals in an easy-to-analyze window. Keep in mind that I can still use the timeline graph to analyze these statistics over time, to see how all of these metrics are affected by different months or seasons!
One statistic that I thought could be helpful in analyzing why the Cubs did so poorly the last month of the season was BABIP, or batting average of balls in play. This statistic is used by the sabermetrics community as well as those who play fantasy baseball. BABIP is unique in that it assesses a player’s batting average of only hits that are able to be fielded by the defense. Singles, doubles, and triples are included and not home runs. Strikeouts and walks are not included either. The number gives statisticians the ability to see how often a player gets on base after making contact with the ball, and conversely for pitchers, how often they give up a hit when the batter puts the ball in play. Lastly, BABIP is sometimes viewed as the “luck factor” for a player. If the Cubs’ batters have an unexpectedly low batting average for a month, then it may be worth looking at their BABIP to see if that has also dropped below league average (approximately .300). Simply put, a low BABIP could indicate that players are unlucky because even when they hit the ball, it is being fielded for an out.
I created the BABIP chart in my dashboard using my own SQL queries, and then cross-filtered the data to only show Cubs batters from the last month of the season. With these filters, I was able to see that their top nine hitters had a BABIP of .313, which is actually above the league average. Their batting average and slugging percentages were .254 and .384 respectively. I realized that the Cubs woes probably were not due to bad luck, but that left the possibility that some other teams maybe had more luck.
I then took a look at the data for the Milwaukee Brewers, who were the team to go 20–7 in September to steal the division lead from the Cubs, and effectively kick them out of the playoffs. They had a team BABIP of .292, however their batting average was 22 points lower at .270. So even though their total average was much lower, when the Brewers put the ball in play, they had a higher chance of getting on base. Taking a deeper look, I created a pie chart that displayed players’ first and last names along with their number of plate appearances. I selected the Brewers’ top three hitters in Christian Yelich, Lorenzo Cain, and Jesus Aguilar to see how the other statistics would change. The three had a combined BABIP of .348 while their combined BA was much lower at .315.
So maybe the Brewers were a little luckier than normal. Looking throughout the whole season, I saw that the trio of Yelich, Cain, and Aguilar actually had BABIPs consistently higher than their normal batting averages. Meanwhile, the season’s league averages for BABIP and BA were .296 and .248. Could it be that the unusually high BABIP had a correlation with the higher batting averages for Milwaukee? That somehow they had more luck that the balls they made contact with would end up as hits?
We Want Data!
I only chose a few statistics to look at in this case, but there are hundreds of different searches that someone can use to analyze the MLB. When looking at sabermetrics, you are generally given a sheet full of numbers and data, or you have to individually google search each statistic for a player or team. When I put the data into OmniSci, I had the freedom to create and visualize any statistic into an easily digestible format.
Looking to the future, I hope that the MLB releases even more data for analysts and statisticians to work with. With modern technology, MLB has the ability to place trackers in every single baseball used in a game. If this data ever becomes publicly available, OmniSci would be able to use the heatmap and linemap charts to visualize the trajectory of every single pitch and hit! In the meantime, I encourage everyone to use OmniSci to find insights in large datasets in a matter of a few clicks.