Cricket Science — Average vs Strike Rate (Part III: Applying BEREX)

5 min readMay 14, 2022

Here, I apply the BEREX (BErnoulli Run EXpectation) model to real-world cricket data. Read Part I to learn about the idea, what motivated it, and its mathematical derivation. In Part II, I implemented the model in code and explored its theoretical predictions. Part I is heavy on math, while Part II uses visualizations to understand BEREX.

If there is one TL;DR takeaway you need from the two articles: BEREX calculates how many runs a team composed entirely of one player would score (or concede) against an average opposition of their era.

ODIs, 2010 to present

We begin by considering all ODIs in the period 2010 to present between any two ICC Full Members. This includes 12 teams. I now have a ball-by-ball database that I can query for data, and I intend to share it in the near future. If you’re looking to replicate this analysis, you could also obtain the summary data from a source like ESPNCricinfo’s Statsguru.

I have filtered the dataset to batsmen with at least 60 innings and bowlers with at least 60 wickets. This ensures we avoid ‘noisy’ data such as from inflated averages with very few innings. Why do I not use innings for bowlers as well? Because the number of wickets is the denominator when calculating bowling average (akin to innings for batsmen) and is thus the appropriate measure for ensuring sufficiency of sample size.

The BEREX function returns a mean and standard deviation. I have included those in the tables below. The batsmen are sorted by runs:

The bowlers are sorted by wickets:

Recall that a higher BEREX mean is better for batsmen but worse for bowlers. Take your time to browse through the data. We’ll talk about possible metrics of quality below. But first…

T20Is, 2017 to 2022

Let’s repeat the exercise for T20 Internationals. I’m using a shorter 5-year period (2017–01–01 to 2022–01–01) since this version of cricket is more rapidly evolving. It’s still limited to matches between two ICC Full Members but I’ve reduced the cutoff to 30 innings and 30 wickets, respectively. This still ends up excluding some notable players like Chris Gayle and Lasith Malinga.

Batsmen:

Bowlers:

Z-scores and Percentiles

As I mentioned in the introductory post, this blog series will also be an opportunity to learn statistics. These ideas are foundational to good data science. That’s what this section is about.

In the earlier parts, we learned that BEREX acts as a simple cricket inning simulator. It can use a Bernoulli process to generate a sample score. Creating thousands of such simulated scores will give us the sample Mean and Standard Deviation of the distribution of scores. We bypassed this approach by using an analytical approach to calculate the two numbers, but the results from the two approaches should converge.

This allows us to apply the Central Limit Theorem — the BerexMean and BerexSD values are, respectively, the mean and standard deviation of a normal distribution. This in turn implies that we can use Z-scores to calculate scores at a desired percentile.

For example, the 20th percentile of a normal distribution corresponds to a z-score of about -0.84. In other words, the BEREX model expects the ‘team of player clones’ to score less than that only 20% of the time.

BEREX Rankings

BEREX has a very specific interpretation but does not produce an obvious one-dimensional ranking system of players. We can generate a ranking system from it, but it will necessarily introduce some subjectivity or arbitrariness in the assessment. This isn’t a flaw in the model; it is a consequence of mathematics. It is for this reason that I’m generally reluctant to create rankings that claim to be a definitive measure of quality — it’s simply too reductive.

For example, we can rank players by the BEREX means (descending for batsmen and increasing for bowlers). However, this arbitrarily chooses the 50th percentile of BEREX scores as our metric. Again, I’m not saying such a choice is bad, only that we should pick the percentile that meets our needs. I have found that low percentiles can often align with widespread opinion of player quality. Let’s see some results.

Ranked by BEREX mean (50th %ile)

ODI Batsmen (2010–2022) ranked by BEREX mean — ODI Batsmen by BEREX 50%ile (2010–2022; Qual min 60 inns)

ODI Bowlers by BEREX 50%ile (2010–2022; Qual min 60 wkts)

T20I Batsmen by BEREX 50%ile (2017–2022; Qual min 30 inns)

T20I Bowlers by BEREX 50%ile (2017–2022; Qual min 30 wkts)

Ranked by BEREX 10%ile

The above rankings may be controversial. For example, where is Virat Kohli? He’d be ranked 9th in the list by BEREX mean. This is unsurprising; ranking by 50%ile ignores the standard deviations. A lower BerexSD value predicts more consistent BEREX scores, and the above does not reward players for that.

Ranking by 10th percentile can address this. The numbers can be read as ‘this player will match or exceed this BEREX score in 9 of 10 innings’. As always, the order is descending for batsmen and ascending for bowlers (so for bowlers, it’s technically the 90th percentile of scores).

ODI Batsmen by BEREX 10%ile (2010–2022; Qual min 60 inns)

ODI Bowlers by BEREX 10%ile (2010–2022; Qual min 60 wkts)

T20I Batsmen by BEREX 10%ile (2017–2022; Qual min 30 inns)

T20I Bowlers by BEREX 10%ile (2017–2022; Qual min 30 wkts)

Copyright AFP, Getty — Top ranked players in each category by BEREX 10%ile

Summary

It is easy to apply BEREX to real-world cricket data. All we need is the Average and Strike Rate or Economy Rate. This simplicity is very powerful. It also makes it important to carefully select our datasets (based on time period, inning/wicket minimums, etc.). The choices we make impact the story told by our analysis. Indeed, this principle applies to most practical applications of data science.

This concludes our three-part series on BEREX. In Part I, we introduced BEREX and derived its mathematics. In Part II, we implemented it in code and created contour plots to draw theoretical insights. Here in Part III, we applied BEREX to real-world player data.