Las Vegas AI; A Baseball Predictor

Zeb Gordon
Bucknell AI & CogSci
11 min readNov 8, 2019

Las Vegas AI; A Baseball Predictor

By Jarek Gozdieski. Weiwei Gu and Zeb Gordon

Introduction:

We have constructed an Artificial Intelligence agent that can predict the outcome of baseball games. The data used to build up the prediction is based on recent historical statistics of team statistics and individual pitcher statistics who will be playing that game. The AI agent will be constructed with a neural network that will weight different variables based on how much the variable influences the outcome. The network will output the confidence of winning for a specific team of the matchup the network analyzes.

Our short term goal will be to have out agent predict a game or a series because baseball teams usually play three or four games against each team. Our long term goal would be to predict season long records and who will be in the playoffs as well as who will win the World Series. We also plan to incorporate external variables which include weather, injuries, and who is playing at their home stadium and who is playing at their away stadium.

The environment that we foresee our agent being used in would be in a sports betting scenario that people could use to get a better reference on what team to bet on. Another environment that our agent could be used could be within baseball organizations. Organizations could use our agent to see what the need to address in team statistics in order to win a game and have a better record. This could help them negotiate contracts more efficiently as well as finding the right baseball players and mix of players to help their team.

There are a lot of adaptations that our agent needs to address. The agent needs to adapt to the teams that are playing the specific game it’s analyzing, as well as adapt to the data of those specific teams. Another adaptation that we need to address when we extend our agent to our long term goal is any external variables. This includes weather, injuries,and any trades that the team makes. The agent should also adapt to the recent record of each team. Athletic teams, especially baseball, tend to do better when they have a lot of momentum. This seems especially evident in longer series such as the postseason. This agent needs to be able to versatile in order to address these variables and problems.

Analytics in Baseball:

The decision to use baseball for the AI agent is because baseball has become the leading sport for analytically analysis. There is so much data for all positions and situations whether it be a pitcher-batter matchup, batting statistics, fielding statistics and the pitching and bullpen statistics.

Our largest inspiration that helped us recognize this phenomena was the movie MoneyBall. The movie is based on a true story about the 2002 Oakland Athletics baseball team. This money takes an in depth look of how Billy Beane, the general manager of the team at the time. He realizes that the conventional methodology of baseball and how to analyze and approach constructing a team is wrong. He partnered with Peter brand, an Ivy League graduate, to reconstruct the team with a limited budget. The way they approached this was through the use of personal statistics of players. This method revolutionized the way baseball is managed and the way players are analyzed as the 2002 Athletics out performed all expectations while also having one of the cheapest budgets in the Major leagues (Elbert, 2011).

This revolution has had monumental affects even nearly 20 years later. Data analytics have changed the game completely. As more and more data on baseball and players, more and more coaches and organizations are using to manage the games and decide on what players to get. The data goes as far as predicting the career performance and potential of younger players and rookies (“Analytics in Baseball: How More Data Is Changing the Game”, 2019). This theme is far from all traditional ways of how teams approached baseball, contracts with players and scouting out future talent. Traditionally, teams have searched for veteran players to add to their roster, but analytics has disproved this theory. Analytics values youth and the possible production that can come out of young players more than what veterans can bring to the production of the team (“Analytics in Baseball: How More Data Is Changing the Game”, 2019).

The investment of data analytics has gained so much popularity that companies have begun to devote resources towards this cause. Amazon Web services has created a platform called Statcast AI which is used to help decide when to steal and when to send in a reliever for a pitcher and other statistics. Similar to our AI agent, the Statcast AI platform uses historical data to gain insight on these statistics, however AWS uses all of the 143 years of baseball statistics to create its insight that can produce outcomes more efficiently and faster than a simple person analyzing this data.

Neural Network:

The main aspect of the structure of our AI agent is a neural network. A neural network is a set of algorithms designed to find patterns in data. This structure is used to predict outcomes from the data used to construct it. Neural networks have gained popularity across all industries with a large variety of applications. There are multiple use cases that implement the neural network in similar ways we used it.

One unique implementation we discovered on predictional neural networks was to predict the performance of swimming. The use of this model was to investigate, “the factors which are able to explain the performance in the 200 meters individual medley and 400 meters front crawl events in young swimmers” (Silva et al. 2007). This model was trained using both high level male an female swimmers and their physical characteristics and athleticism. This included performance of strength and flexibility during exercises on land as well as in the pool which incorporate multiple variables to find a more accurate conclusion. The type of neural network that was used was a Multilayer Perceptron model, a feed forward neural network, with three neurons in a single hidden layer. This research showcases how neural networks can be an effective tool for performance modeling and talent identification in swimming and in sports. This artificial intelligence agent was able to construct extremely realistic models for prediction on swimming performance (Silva et al. 2007). This research demonstrated the power or neural networks and its potential in sports. This gave us a good framework to inspire us to use neural networks for our project. Not only could we use neural networks to predict the outcomes of games and season long records, but we could extend the network to create functional scouting analysis on what players will most likely turn in to super stars and what players will not. Neural networks can simply automate the methodology of Billy Bean and the 2002 Oakland Athletics.

Another implementation we discovered was using neural networks to predict short term traffic patterns. This application was done by Mr. Brian Smith and Mr. Michael Demetsky. The goal of this project was to monitor patterns create an agent that could uses the real time data along with other pieces of data to predict and understand traffic flows. This implementation will be able to analyze non-linear, complex problems. They used the backpropagation model to create this model. They believed that this type of neural network was “more responsive to dynamic conditions” and prevented “overprediction” more than other models. Throughout their research and implementation, they also highlighted the practicality of the “propagation model’s ability to run in a parallel computing environment”. Smith and Demetsky demonstrated how this type of model can be most useful using real-time analysis (Smith & Demetsky, 1993). Although real-time analysis is not exactly what we proposed we wanted our agent to do, it is good to consider the benefits and flaws of each model. Furthermore, another step could be using our agent in the middle of the game where it might need to develop a prediction using the data from that game, as in real-time data.

These applications widened our knowledge on how we could leverage neural networks for an efficient and accurate approach to solve our baseball-game prediction agent. Using either a backpropagation model or a Multilayer Perceptron model will take our agent to the next step in terms of incorporating more variables to our prediction and using our model to do more than predict outcomes, but to provide a reason as well as highlight players that could change the outcomes of these games. The complexion for said projects would overpass what we can develop in our given time frame of our project but would be used to enhance our agent to solve more complex baseball problems.

For the objective of predicting baseball game outcomes and season long records as well as World Series Champions, we implemented a neural network model that used a sequential learning algorithm. This type of model has been applied for forecasting and prediction by scientists S.Rajasekaran, K. Thiruvenkatasamy and Tsong-Lin Lee. These scientists created a sequential learning model that “uses one hidden neuron to predict the current tidal level using the previous four levels quite accurately” (Rajasekaran et al. 2005). Sequential learning helped this agent accurately predict hourly data on tides for a month. This model is able capable of learning the tidal-level data using small observations and predicting the level with accuracy and efficiency(Rajasekaran et al. 2005). This application is the main inspiration of our approach to predicting the baseball outcomes and validates why we chose to use neural networks and specifically sequential learning to evaluate our historical baseball data and to output the winners of specific games and series.

Our Agent; Keras

In terms of the implementation of the neural network, we decided to use Keras. Keras is an open-source neural-network library written in Python that allows quick implementations of deep neural networks. The reason why we chose Keras to implement our agent is that Keras is a very user-friendly library. Keras has many features of the regular neural network such as layers, activate functions, and optimizers, etc. Also, we can find documentation of every Keras library function online (“Keras: The Python Deep Learning Library”, 2019) so that we can learn and implement Keras in a very short time. We decided to use the sequential models from Keras to actually implement our game predicting agent.

Ethics

There are a plethora of ethical dilemmas that will need to be addressed. Our agent could hinder on some ethical issues in how our agent can grow and the environments we propose our agent would be useful in. There are a lot of ethical problems when applying artificial agents to real world applications.

One of the environments we foresee our agent being used in is in the world of sports betting. Multiple states have been legalizing sports betting and phone applications have made this more accessible than ever. With all of this access, creating an agent like ours could incentivize people to become addicted to betting. People with betting addictions had a high affiliation with sports betting and engaged more often in sports related betting. (Russell et al. 2019). Providing a resource like ours and anticipating our agent to give someone all of the right answers could send a person down a very irresponsible and addictive path. Sports betting is a significant driver for psychological addictions such as a gambling problem. There is tremendous risk at stake when enticing people to gamble on sports that could financially and psychologically harm a lot of people (Russell et al. 2019). These potential risks will need to be addressed if our agent ever went public. It will be vital that users understand the potential risks involved with sports betting and gambling addictions.

Another ethical issue that could be problematic is the potential risk of violating data protection laws. The other environment that we plan to have our agent to be used is within the organization. We see our agent potentially being developed into something much more advanced than the one we created. The data that our agent currently uses is all public data on previous baseball statistic. This type of data analytics of public information and statistics does not potentially violate data protection laws. However, there are multiple cases where sports data analytics can violate data protection laws. An AC Milan soccer player using data analytics to prevent sports injuries using medical information from several sources can cause data protection concerns (Schwartz, 2010).

Conclusion

We obtained the statistics of each team from season 2010 to 2018 and the result of every regular season game from two data sets. To train our agent, for each regular game, we found the batting average, on-base percentage, steals, strike outs, and fielding errors of the two opposing team. Then we calculate the ratio of each data and obtain a list, and we feed the list to the agent as an input list, the result of the regular season game as an expected output. We apply relu as the activation function for our agent to output a result between 0 and 1.

In terms of result, our agent finally made accuracy up to 60%, which does not meet our expectations but shows its adaptation to some extend. The inadequate performance is caused by lack of understanding of data. In a baseball games, there are too many factors other than the five statistics we chose to train our agent. We need better understanding of baseball statistics to improve the performance of our agent.

References

Ebert, R. (2011, September 21). Moneyball movie review & film summary (2011): Roger Ebert. Retrieved November 5, 2019, from https://www.rogerebert.com/reviews/moneyball-2011.

GitHub. (2019). Keras: The Python Deep Learning library. Retrieved November 5, 2019, from https://keras.io/.

Knowledge@Wharton. (2019, February 21). Analytics in Baseball: How More Data Is Changing the Game. Retrieved November 5, 2019, from https://knowledge.wharton.upenn.edu/article/analytics-in-baseball/.

Rajasekaran, S., Thiruvenkatasamy, K., & Lee, T.-L. (2005, May 4). Tidal level forecasting using functional and sequential learning neural networks. Retrieved November 5, 2019, from https://www.sciencedirect.com/science/article/pii/S0307904X05000491#!

Russell, A. M. T., Hing, N., & Browne, M. (2019, March 28). Risk Factors for Gambling Problems Specifically Associated with Sports Betting. Retrieved November 5, 2019, from https://link.springer.com/article/10.1007/s10899-019-09848-x.

Schwartz, P. M. (2010). Data Protection Law anD the ethicaL Use of anaLytics. Retrieved November 5, 2019, from http://www.informationpolicycentre.com/uploads/5/7/1/0/57104281/data_protection_law_and_the_ethical_use_of_analytics__paul_schwartzwhite_paper_2010_.pdf.

Silva, A. J., Costa, A. M., Oliveira, P. M., Reis, V. M., Saavedra, J., Perl, J., … Marinho, D. A. (2007, March 1). The use of neural network technology to model swimming performance. Retrieved November 5, 2019, from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3778687/.

Smith, & Demetsky. (1993, November 30). SHORT-TERM TRAFFIC FLOW PREDICTION: NEURAL NETWORK APPROACH. Retrieved November 5, 2019, from https://trid.trb.org/view/424677.

--

--