Markov Chain Baseball Models

Published in

Analytics Vidhya

3 min readDec 31, 2019

Markov Chain Baseball Models: A Home Run in the Sabermetrics Community

The game of baseball is one in which analysts can break down an entire game by the events that occur in each half inning of the game. These events each half inning can be modeled as a finite Markov chain, in which there are 24 states of discrete events in a half inning, with the 25th state marking the end of the half inning. These states include all baseball in-game situations such as no outs, no runners on base to 2 outs with 3 runners on base.

From there, analysts can use transition probabilities from each state to state to be representative of the events that occur each half inning. Using these transition probabilities, analysts can perform full game simulations with computed expected run totals for both teams. The Markov chain is also effective for: providing strategic advice to managers such as lineup optimization and best batting order; assists front offices in evaluating the performance of baseball teams including influence of a particular player on team performance, or the effect a trade or free agent signing can have on the roster.

I’m sure you must now be wondering: “How do I build these Markov chain models?”

Fortunately, R and Python have you covered. R offers a package named markovchain and Python offers a package named discreteMarkovChain, both of which can be used to perform these baseball-related operations.

Analysts must install these packages in their preferred programming language and must also have access to baseball game play-by-play data to be able to build these models.

Bryce Harper is really fired up about these Markov chain baseball models!

To dive further into this subject, we can look at a baseball game is if it is a set of transitions that occur due to each player’s plate appearance. When a player comes up to bat, he finds himself in one of 24 possible situations. There can be a runner on each of first, second, and third bases, or there can be no runners at all. Then there are anywhere between zero, one, or two outs. The third out results in the end of the half inning, which is the 25th situation. Each plate appearance can result in runners possibly advancing or a change in the number of outs in the half inning. The finite nature of these events proves that the team will find itself in one of those 25 situations every single plate appearance.

Next, two-way frequency tables of the transition states are created. These tables are then converted into probability transition matrices and then into a Markov chain. We can then calculate the expected runs scored by moving from one state to another and then looping through all possible states to calculate how many runs were expected to score by the end of an inning, given an initial state.

Once we obtain the expected run totals for each team, we then build transition matrices for each individual starting player on both teams. After these matrices are built, we start and complete the game simulation process.

While no simulation can be perfectly accurate and may not be completely predictive of the results of a game, the completion of many simulations does give an idea of the game results we should expect (10,000 simulations seem to lend to stronger predictive power). It is not completely smart for teams to use this method as the one and only method of predicting a win in a game or to make roster decisions, but rather it is a good idea to use to supplement all other tools and resources they have at their disposal. Through research and personal experience, the Markov chain method has plenty of benefits in baseball though:

The batting order that produces the highest expected number of runs in a game is also the order that produces the greatest expected number of wins for that team. In order to find the batting order for a set of nine players which maximizes the expected number of runs produced, one must test all 9! = 362,880 possible lineups.
Optimal batting orders can expect to win approximately 4 more games per 162 game season than the worst batting orders.
We can quantify the influence of a trade on the number of games a team wins.
We evaluate the effect that adding free agents has on the expected number of wins per season.
We can use the calculated distribution of runs scored in each inning to assist with in-game managerial decisions.

Bryce Harper hopes you enjoy building these Markov chain baseball models that will inevitably predict the Phillies as your 2020 World Series Champions!

Markov Chain Baseball Models

Written by Michael Pallante