Test Cricket Trends : A Data Overview with Python

sarang manjrekar
Analytics Vidhya
Published in
8 min readJul 10, 2021

Game of Cricket was brought to India by the British and today it’s the most followed game in the subcontinent. The game of cricket is 150 Years old and as any other game, it has evolved thorugh the ages to become one of the most followed modern sport today.

The game earlier used to be the sport of the Elite and hence called the Gentleman’s game. Today we have fascinating stories of cricketers like MS Dhoni, a Wicketkeeper Captain hailing from a Tier II city like Ranchi, T.Natarajan who rose onto the Horizon from even more humble background, and so are numerous stories of small town boys coming to Centre stage and grabbing attention of cricket crazy nation and truning into superstars.

Even with so much of a changing landscpae, cricket is still a game played on a 22-yard pitch, with 2 teams of 11 players each, contesting to outsmart the other in various facets of the game namely Batting, Bowling and Fielding.

Test Cricket is unarguably considered to be the purest and most challenging format of the game by players and critics alike, worldwide.

History of this game takes us back to 1877, when the first Test match was played between England and Australia at the famous Melbourne Cricket Ground.

As the saying goes : “Numbers dont lie”. Scorecard of a played game of cricket provides a fair enough glimpse into any match. Hence I took to scrap the data from ESPNCricinfo which happens to be One stop resource for scorecards of all the matches played till date.

Web scrapping the data, I’ve created a Match detail master dataset, comprising of data for over 2400 Test matches played in last 150 years of Test cricket.

We’ll be exploring trends towards various facets of this multidisciplinary game.

Test Match: Most Challenging Format of the Game

Starting with Web scrapping from ESPNCricinfo Website :

I’ve written Python functions to scrap data at primarily 3 levels :

  1. Get CricInfo Match IDs of all the test matches
  2. Get relevant scorecard data features for each Test Match
  3. Transform the extracted features to build meaningful features

Themes to Explore in the Dataset :

I.A - Spread of the game to Indian SubContinent

When Cricket was born in 19th century, the Subcontinent was very much under foreign rule and even though the oldest stadium in India “The Eden Gardens” of Kolkata was built in 1864, the first match here was held only in 1934.

Until early 1900s, Only England, Australia, South Africa were the prominent countries where professional cricket was played at the highest level.British colonies in Asia were gaining independence around the mid 20th century and also were finding an identity of their own at the world stage, be it in political sphere or the sporting one.

Wild & Crazy Fans in Subcontinent

Today the game is in great demand in the SubContinent, with the zeal and enthusiasm of Cricket lovers in this part of the world being unaparallel to anywhere else in the world. No wonder, today huge chunk of world cricket is played in the Subcontinent.

Python Pandas Dataframe created for the purpose is match_stats_df. Running Visualizations on host country’s continent info :

Lets have a look at the Country wise Data over the decades as well :

Glimpses of world events from the Graph :

  1. Dips in total matches hosted can be seen in the decades of 1910s ,1940s, which were the World war years.
  2. Line plot for S.Africa ( pink line )stalls abruptly around 1960s and is seen emerging again on the graph at the 1990s. Apartheid years had S.Africa out of action at a global level and were allowed back in, only after anti-apartheid movement succeeded in Pretoria.
  3. Bangladesh line plot ( orange line ) sees little blips in 1950s and 1960s, but then is out of action till 1990s. This is clearly due to matches held in 50s and 60s in the erstwhile Pakistan. Bangladesh as an independent nation only made a debut in late 1990s.
  4. Pakistan graph sees a downfall for the number of matches hosted there around 1990s and furthermore around 2000s. This can be attributed to insurgencies which reduced the number of teams willing to visit Pakistan and finally the terrorist attack in 2009 on Sri Lankan team was the final nail in the coffin. This inturn has brought about the rise of UAE as prominent test cricket venue on the global cricket map
  5. All said and done, the Meccas of England and Australia still remain the hostspots of Test cricket around the globe.

I.B Hot Spots of Test Cricket

Lets take a look at Meccas of Test cricket, with a minimum of 50 Test matches played at the venue.

Above Graph doesnt feature any Indian venue. The reason for this primarily could be the rapid development of Tier II cities as Test venues, resulting in over 15 test venues in the country today. On comparison, nations like UK, AUS, SA have 4–5 test venues only, resulting in each of them hosting more matches.

II. Batsmanship : Trends in Test Matches.

A. Steady rise in rates of Scoring Runs in Test Matches.

Well into half of the 20th Century, a run rate of close to 2–2.2 is seen to be a pre dominant feature. But just around the decade of 1970, shift seems to be towards 2.5 RPO. The single cause this could be attributed to other than better Bats manufacturing would be the introduction of Shorter format of the game (One Day Internationals) in 1971.

Next set of upgrade to batting run rates occurs with the arrival of shortest format of the game in first decade of 21st century.

In the decade of 2000s and 2010s, run rate clearly can be seen shifting closer to mean of 3.5 RPO mark, which has indeed ensured more and more Test matches in the modern era being able to achieve a win/loss outcome, lesser Draw matches and more excitement for the sport fans.

B. Runs scored by an individual in a Test Innings : Power Law Distribution

Every batsman stepping on a cricket field aims to have atleast one Test century under his belt if not greedy for more. Over all these years, considering on avg 30 batters getting chance to bat per match ( 2400+ matches till date), approx 70000+ times batters must have had the opportunity to showcase their batting skills, but as observed earlier, centuries are rare to come, and only 1–2 batters on an avergae, in each match get to the magical 3 figure mark. Lets take a look at what do the rest go through.

Strange enough, but the third quartile across the decades is less than 40. To paraphrase it, scoring 40+ runs in a Test innings puts you in top 25 %. Isnt it weird that fans are greedy and dont want to settle for anything less than a century from their batting favorites.

C. Gradually becoming a Batsmen’s Game ?

With the bats getting better at hitting the ball harder and certain rules being alleged to be favoring the batters more than bowlers. Although a safe score in first innings depends on a lot of factors, but under any conditions 400+ score can be considered a safe bet.

Lets find out the data evidence for the assumption :

A. There is a steady rise in 400+ totals for all the teams from 1880s — 2000s.

B. Dominance of Australia in the 2000s can be seen as the massive tower of 70+ totals of 400+.

C. All the teams have reduced number of 400+ totals in 2010–2020 decade in comparison to 2000–2010. These can be attributed largely to :

1. Lesser test cricket being played with advent of T20 cricket as the more acceptable format.

2. Batters getting more impatient, adventurous shot making, applying themselvs lesser, urge to score faster runs might be resulting into even Teams like Australia and England.

3. Only teams who have more 400+ scores in the last decade are New Zealand, bangladesh, which are both ever emerging powers of World Cricket.

D. Contribution of the lower order to a Team’s Total

Irrespective of how many specialist batsman you have in your pecking order, they always say, “tail should wag”. Lower order (Batter no 7 — Batter No 11) generally consists of All rounders / Bowlers, who happen to be less skilled with batting.

Often when the Top 6 batters fail to add substantial runs for their team, all eyes are on the tail with the expectation : “tail should wag”.

Tail is never reliable, and it delivers runs less often than expected. Lets quantify their contribution.

Highest Q3 quartiles are measured for South Africa, Bangladesh, Zimbabwe. Thus showing their lower order contributing more to their respective Team totals, as compared with lower order of other teams.

  • Skipping Ireland, as the nation has played its inaugral Test only in 2019.

I will be posting more of analysis on this dataset in upcoming articles, which will be focused on other facets of the game that we all love. Stay Tuned..

Thanks for reading! Share it with cricket buffs….

Github repo : https://github.com/Sarrae1406/cricket_analytics

--

--