Evaluating Players & Performances Judiciously — T20

Evidence weighted normalized evaluation of players & performances based on situations & conditions

Published in

Boundary Line

12 min readJun 8, 2020

We often see player comparisons & evaluations where a Rohit Sharma who opens in T20Is & comes in to bat in the top 4 in the IPL, gets compared to an Andre Russell who comes in to bat in the mid to late middle order in almost every T20 competition in existence.

Most of us recognize that these are somewhat frivolous comparisons and so, we try to use contextual information and slice and dice the data to come up with inferences. But these inferences are based on small sample sizes. They are made equally confidently when made on the basis of a 100 ball sample size as they are when made on the basis of a 500 ball sample size. More critically, these are used to compare players who have played 100 balls to players who have played 500.

With larger sample sizes (# balls faced), we can be much more confident about our estimates & inferences

Not only do these inferences have non-trivial reliability issues as illustrated above, but they hurt credibility. Players, coaches & administrators can’t be expected to have a background and training in interpreting stats. It is a core responsibility of, and requirement for analysts on the team to interpret and communicate these uncertainties and nuances and/or weave them into their models. Not doing this results in more inferences being invalidated on the field, which in turn devalues the use of data science in cricket & pushes the widespread adoption of data driven decision making in the sport, out further.

Additionally, even with good sample sizes, an innings by a number 3 in the Blast coming in at 2 for 1 in the 1st over at Southampton can’t be compared to an innings where he comes in at 70 for 1 in the 8th over at Northampton.

Here, I am going to try and address most of these issues and concerns and try to come up with a robust way of comparing players and performances.

Core Ideas & Considerations

I will focus on two key metrics for batsmen in T20 cricket: Runs Per Over (RPO) & Length of Innings. RPO is analogous to what is traditionally called Strike Rate (SR). SR is number of runs scored per every 100 deliveries. RPO is number of runs scored per every 6 deliveries. I used RPO since that is more intuitive for T20 cricket where the total innings length is only a little over 100 balls. My models are built from individual deliveries, so changing this up to go all the way down to ball level is trivial as well i.e. this isn’t worth over-rotating on. I have developed a metric which I call survival factor, to account for innings length. The details of this metric are available below.

For bowlers, I have stuck to the RPO metric here. I will be looking into wickets and their impact in a subsequent effort.

The core idea behind my model is to a) evaluate relative to the average in the same context b) account for sample size. Both of these are counter-forces. The more contextual information you use, the less samples you have. Only about 14 venues out of the 129 that have hosted T20s since Apr-2017 have hosted more than 20 games. The majority of these have been domestic competitions like the IPL, the Blast, & the BBL. Wankhede, for example has hosted only two T20Is. Of the 24 IPL matches that it has hosted, there were only 8 where the first wicket fell after the power play while batting first.

Here is how I defined context:

With this definition, the number of combinations we have of venue, competition & match situation is huge, and the sample size of each combination is extremely small. Additionally, there is information that can be leveraged across these dimensions. While runs scored at a given venue may differ by competition, if we adjust for competition, there is still a lot of information about the venue that we can glean by looking at data across competitions. Similarly, if we adjust for venue, we can gain a lot of information about competitions by looking across venues. A lot of match situations are also pretty similar and have information that can be leveraged by other situations. A situation with a first innings score of 5/1 in the second over is not that different from a 12/1 in the third over.

The Model

The model that I built accounts for these considerations and prevents overfitting to slices with small sample sizes by ensuring that it is able to generalize well and fit unseen data that was not used to build the model as well as possible. I help the model out in this endeavor by 1) leveraging the methodology I wrote about here and using groups of similar venues instead of using each of the 129 venues separately. 2) using a more compressed representation of match situation information (this is not very dissimilar from how a smaller image sent via a messaging app still looks the same on your phone screen as its larger, original version, and in some cases, can even be used to reproduce a larger image pretty close to the original). 3) I additionally use the batsman’s position for which I use individual position only for the middle order (between 3 and 7) and categorizing everyone else as “opener” or “tail”.

There’s a wee bit more to it

I’ve been calling out “the” model so far, but what I’ve built is actually a series of models. First, a model accounting for the contextual considerations above, predicts the probability of each of the four outcomes of a delivery: i) dot ball ii) boundary (4s & 6s) iii) run (1, 2, 3) iv) wicket. Another model, uses the structure learned by the first model as well as its output and computes the expected runs that the delivery would yield if a wicket did not fall on that delivery. A third model uses the wicket probabilities generated by the first model to compute the cumulative probability of a batsman surviving all the deliveries that they do survive, per innings.

The little bit of additional complexity isn’t in vain

Here is how this model performs compared to a naive model that uses the average runs scored per ball at each batting position to estimate expected runs per delivery. Each data point here is an average across match situations for a batting position, competition and venue. We can clearly see the benefit that additional contextual information brings.

The survival probability model also works well. Here, it is shown in aggregate for different batting positions.

Still not quite there

Now, that we have context adjusted expected runs per delivery, we can calculate the average difference in RPO per batsman/bowler from the average batsman/bowler under the same context. However, the sample size issue rears its head again here. If we just use the raw output of the model at this stage, the top 5 batsmen based on RPO are as shown below.

This is clearly not useful. We need to adjust our confidence in the relative RPO numbers based on the number of balls faced by a batsman. The more data we have for a particular player, the more confident we can be in the estimates. If a batsman has played fewer balls, we conservatively say that they are likely to be closer to the average and need more noteworthy performances in either direction before we can be confident that they are really better or worse than the average.

And, voila!

After applying this model, batsmen with more evidence, for whose numbers we can associate higher confidence, bubble up in the top 5.

We use a similar approach to aggregate the probability of survival across multiple innings as well. There, the volume is in terms of number of innings instead of number of balls.

The Metrics

We have two (Edit: Three. I added Evidence Weighted Relative Wickets Per Over to this.) metrics that we have developed here:

Evidence Weighted Relative RPO (EWR-RPO): This is a measure of how many runs more or less a batsman scores in one over relative to the average batsman coming in at the same position, in the same competition, on a similar ground, under similar circumstances. For bowlers, this translates to runs conceded in one over relative to the average bowler bowling to batsmen coming in at the same position, in the same competition, on a similar ground, under similar circumstances. For batsmen, a higher value of EWR-RPO is better, while for bowlers a lower value is better. For bowlers, it might also make sense to multiply this by 4 to get a more intuitive measure, but I am not doing that here. 0 is a par value for both and represents the average bowler/batsman.
Evidence Weighted Cumulative survival probability of average batsman or Evidence Weighted Survival Factor (EWSF): This is a measure of how likely it is for an average batsman coming in at the same position, in the same competition, on a similar ground and facing deliveries under similar circumstances to have gotten out by this time (the ball before he loses his wicket, if he gets out in that innings, or the last ball that he plays, for innings were he remains unbeaten). EWSF ranges from 0 to 1, although 1 is almost unattainable. A higher EWSF is indicative of the extent to which batsmen play longer than average innings. Using this for bowlers is tricky because although higher EWSF indicates a well set batsman, it also indicates a batsman who has played longer than others and hence, may be more likely to get out. I also didn’t see the value of using this for bowlers at this time, given that I plan to introduce match impact related metrics in a follow up.

Let’s take a look at what these metrics look like in tandem for batting:

We are clearly able to see the value of some of the usual suspects here in the top right periphery of the chart, where players that play longer innings and score at a fast rate, reside. However there are some interesting insights that emerge as well. We see that some of the bigger names in openers that show up here play more of an anchoring role than a short blaze. This is somewhat in line with my findings on how teams that slowly ramp up without losing wickets in the power play, and that use this platform in the middle overs do better. L Ronchi wasn’t one of these openers. Munsey isn’t either. Both Ronchi & Munsey are ahead of their peer openers in their respective leagues in both scoring rate and longevity of innings, consistently. Another interesting player for me was Rahkeem Cornwall, who doesn’t quite play the long innings that some of the top openers play, but he consistently scores fast and has an above average time spent at the crease. His bowling EWR-RPO is at -0.135 which is only at the 30th percentile level. The top bowlers based on EWR-RPO are as follows:

Bottomless Usage Potential

The basic approach that I have described here can be extended in a variety of ways. Since our original model is built for expectations per delivery, we can build a whole set of these follow up models adjusting for confidence based on sample size. We can have additional models per batsman against bowler type (spin/pace), bowler against batsman type (LHB/RHB), player per competition , player per position etc. (even matchups, although I am a little more cautious there because of small sample sizes) The list is almost endless and depends on the analyses that we want this to power.

Naive averages are sufficient for use overall splits due to larger sample sizes

For some cases, where there are enough samples that can be aggregated for fairly confident estimates e.g. bowler type, simple aggregations can work.

But most cases where we are comparing players need the additional step to account for the sample sizes. The below table illustrates this using the case of evaluating batsmen’s RPO against spin.

Evidence weighting is needed at the player level & leveraging higher level signals is helpful

The evidence weighted estimates of Relative-RPO are more robust than the naive average. Moreover, if we use only the performance data of each batsmen against spin, we have fewer sample sizes and so our evidence weighted algorithm is less confident about attributing a larger Relative-RPO number. Using additional signal from the batsmen’s overall perf (for the same sample size, we can be more confident in a batsman’s numbers against spin if they have also done well against pace) lets us be confident about the naive average numbers for larger sample sizes while being pessimistic about the estimates with smaller sample sizes. Finally, adding the signal from overall performances against spin lets us reign in over-aggressive estimates a bit more by requiring more evidence for a batsman’s score to be farther away from the overall Relative-RPO for spin.

No free lunch

Before parting, let’s briefly acknowledge a couple of loopholes that we have left or created here and how I plan to address them.

I’ve introduced metrics to address how fast batsmen score runs, the longevity of their stay and bowler economy rates here. However, we haven’t arrived at a similar evidence weighted relative metric for measuring the wicket taking ability of a bowler (Edit: I have, since writing this, added EWR-WPO which is the evidence weighted relative wickets per over metric for bowlers. The relative aspect of this comes from a comparison against the probability of a wicket falling in a given scenario. This probability is based on the above explained scenario/context only, and doesn’t take into account ball outcome history). While relative RPO (emphasis on relative) for batsmen and bowlers generally shouldn’t be a bad proxy for impact on the match result, wickets taken or lost in different contexts have different impact on the match result. I’ve shown before how boundary rate in different phases of the game impacts match outcomes. It is likely that relative RPO may not provide the complete picture of player contribution towards the result of a game. To address these, I will be working on a complementary match impact model to evaluate players based on how outcomes influence the outcome of matches.

Secondly, by taking into account available evidence for the metrics here, we’ve made it hard to evaluate players for whom we have less evidence because of smaller sample sizes. This could apply to new players, but also to players who may have played fewer deliveries in a particular context e.g. against left arm bowlers, across competitions, in a specific position etc. Moreover, since our metrics evaluate relative to the expectations in a given context, players performances get evaluated for that specific context. This doesn’t say much directly about how a player would fare in other contexts. A player like Munsey who has done exceedingly well in associate T20Is shows up on the top, but there is still a question mark on how he would fare in the IPL.

For these cases, my approach would be to find other similar players in the context where a given player has a non-trivial sample size of evidence and who have played a non-trivial sample size of evidence in the desired context, and use their metrics to do a projection for the player of interest. For e.g. if player X has played in the CPL & PSL, but not in the IPL and we want to evaluate how this player might do in the IPL, I’d take players from the CPL & PSL that have similar numbers to player X within some threshold of similarity. Of these players Y1, Y2…Yn, I’d take those who’ve also played in the IPL and take a similarity weighted average of how they have done in the IPL to estimate how player X might do. A similar approach could also be applied in time to evaluate how new players would fare in upcoming years.

p.s. I specifically use “he” here because the data that I used was from men’s T20s. The approach applies to women’s cricket as well and I will be prioritizing looking at women’s data in the future.

If you enjoyed this piece, check out more of my work at Boundary Line and follow along here & on twitter @amol_desai

I can be reached on twitter or via email or Linkedin