Statistics applications on sports: the baseball case

Mirco t
Pills of BSDSA
Published in
7 min readMar 25, 2024

If you are an American, there is probably an iconic memory stack in your mind: the first time you watched a baseball match. The warmth of that sunny afternoon, the excited jabbering, and the unmistakable perfume of snacks and hot dogs are rooted in the childhood of thousands of people. In fact, the MLB is ranked as the second-most popular league out of the major North American sports leagues.

In this article, we are going to explain a model that can be used to predict the battling outcomes of players thanks to the MAMSE (Minimum Averaged Mean Squared Error) weights and use our methodology to estimate commonly used baseball batting metrics. Thus, allowing to improve inferences with the data from the league members. Ultimately, they will be compared with the empirical data from the 2018 MLB.

Before delving into the matter, we may want to understand the practical benefits of applying data analysis to sports. To do that we need to return to Europe, where Liverpool FC has demonstrated the extraordinary advantages that can be contributed by a strategic evaluation of players’ injury histories, game and off-the-ball metrics. As a matter of fact, this led to huge improvements in the selection of strategic acquisitions and sales, whose profits were capitalized on future key roles of the club’s success.

Batting metrics and the multinomial distribution

The first thing to do is to understand that batting outcomes can be divided into discrete categories. Considering the j possible outcomes and sorting them as displayed in the following table (A), xij will result in the number of plate appearances of the ith batter. As a result, the joint distribution of the counts for the K discrete categories for batter i is given by xi = (xi1, xi2, …, xiK)^t ~ multinomial (ni, pi), where ni denotes the plate appearance and pi = (pi1, pi2, …, piK)^t the probability of each outcome with

Before computing the metrics, we define plate appearances (PA) as the sum of the possible outcomes, at-bats (AB) as the sum of the batter’s turn batting against a pitcher (SO+GO+AO+S+D+T+HR), and total bases (TB) as the sum of the bases for their respective value (1*S+2*D+3*T+4*HR).

We are interested in assessing the following batting metrics:

- Batting average: BA = (S+D+T+HR)/AB. The number of hits over the number of bats,

- On-base percentage: OBP = (S+D+T+HR+BB+HBP)/(PA-SH-SF). The rate at which the batter reaches the bases,

- Slugging percentage: SLG = TB/AB. The average total bases per at-bat,

- Weighted on-base average: wOBA = (0.69*BB+0.72*HBP+0.89*S+1.27*D +1.62*T+2.10*HR)/PA. A linear combination of each outcome with the expected value of each outcome to produce runs. They were computed with the data provided by the Fangraphs website specifically related to the 2018 season.

Table A. Batting outcomes

Maximum weighted likelihood estimation

In this approach, the definition of the weights is a critical factor as it can reduce the variance concern to the traditional maximum likelihood estimator (MLE) increasing the bias.

Let X1, …, Xm be data from m different populations with probability density functions f1(·; θ1), …, fm(·; θm) where Xi = (Xi1, …, Xim)^t. Suppose we want to infer about Θ1, an unknown vector of parameters of the population 1 using the other m-1 populations.

The weighted likelihood is

where w = (w1, …, wm)^t is the vector of weights. In this way, the inference on the first population can be contributed by the relevant information of the other populations and the weights can be computed respecting the common elements between the various populations.

As a consequence, the weighted log-likelihood is

and the maximum estimator is the value of θ1 that maximizes WL(θ1; x),

Minimum averaged mean squared error weights

Let Fi be the cumulative density function to the ith population, and let F^i define the one based on the sample of the population i, the weighted empirical distribution is

with wi >=0 and

To select the weights we are going to minimize the difference between F^w and F1 using the minimum averaged mean squared error (MAMSE) weights, which are computed minimizing the objective function

Shrinkage estimation of multinomial probabilities

The outcomes for the ith batter are xi = (xi1, xi2, …, xiK)^t ~ multinomial (ni, pi), where xij is the number of times outcome j occurred and pi = (pi1, …, piK)^t is the vector of the relative probabilities.

The weighted likelihood for estimating pi is

where wi = (wi1, …, wim)^t are the weights assigned to each batter for the inference.

Assuming pi ~ Dirichlet (α = (α1, …, αK)^t), with a Bayesian approach the previous distribution becomes P(pi|xi, α, w) ∝

where

are the new re-weighted counts for all j. As a result,

The Bayes estimator of pij is

where

Here, tj is a global target and tij is the shrinkage target and uses the information form all the other batters. We can rewrite the estimator formula to explicit the three weights:

In the implementation of our weighted likelihood method, we have used the MAMSE weights.

Data analysis

The model we have studied has been built with m=556 players with at least 25 plate appearances for the 2018 season. Will follow table B that shows the best 30 batters’ outcomes based on the public ESPN rankings, table C that provide the metrics for the former, and table D that displays the overall proportion

Table B. Counts for each category of batting outcome for the top 30 batters in the MLB for the 2018 season
Table C. Empirical batting metrics for the top 30 batters in the MLB for the 2018 season
Table D. Overall league proportion for the 11 outcomes

Moreover, we are now clustering batters in groups of 10 with small dissimilarities or high similarities to calculate MAMSE weights. This is going to be computed as the distance between the probabilities of batters i and l (pi,pl) with the Euclidean Distance:

At this point, the clustering of the batters is based on D, the matrix created with the dissimilarity’s measures, starting from those with the lowest values. Now we are going to select a fixed number of batters (9, given that in another paper it has been demonstrated that it needs to be lower than K -1) to cluster with player i to help improve inference on pi. So, we calculate the new counts

for each batter based on the weighted likelihood approach using MAMSE weights.

Table E. Estimates of the concentration parameters

In figure A two histograms will represent the relevance of the weights. As expected for the first 15 batters, given the higher number of plates appearances, the weight for the MLE is consistently much more relevant than the one for the player-specific shrinkage target. Instead, for the last 15, the weights of the outcome-specific target and shrinkage target are way more significant.

Figure A. Weight comparison for the top 15 (left) and last 15 (right) batters

A similar conclusion can be drawn observing the scatterplot (figure B).

Figure B. Weight representation for the EDC for all the batters

Discussion

To sum up, the weighted likelihood approach that we have seen improves the estimation of batter-specific metrics through data of other batters. This advantage can be observed in table F where players the highest plate appearances have really precis estimations, however, whenever the sample size is insufficient, the results have weak matches being more similar to the career results. Three exceptions have to be highlighted as N. Orf, J. Davis, and M. Viloria are in their first season.

Table F. Comparison of raw season metrics, estimated metrics using EDC, and career metric for last 10 batters with fewer number of plate appearances and top 5 batters

Sources

Wickramasinghe, L., Leblanc, A., & Muthukumarana, S. (2021). Model-based estimation of baseball batting metrics. Journal of Applied Statistics, 48(10), 1775–1797. https://doi.org/10.1080/02664763.2020.1775792

Dougramaji, M. (2023, October 6). Data Analytics in football: How LFC Used Data to Gain the Edge. https://rockborne.com/graduates/blog/data-analytics-in-football-lfc/

--

--