Using GPT 4-o to extract cricket data

Published in

Cranium Crawler: Data, Cricket and Life

8 min readJul 9, 2024

Using the latest LLMs to understand data on cricket

A serious cricket enthusiast is often in search of valuable insights arisign from the sport. But, finding such data in public domain is often restricted to a few companies and agencies, who support teams using proprietary data or do not provide greater access to the common public. Hence, there arises a need to create open-source mechanisms that can capture data for cricket fans, which they can use for different analysis.

Finding the right datasets

Despite Cricinfo and Cricbuzz being industry standards in providing quick updates on the sport to it’s millions of fans worldwide, extracting data from these websites is often a task and restricted to building web crawlers which might be subjected to rate limits and blocking. One open source website which I have used a lot has been Cricsheet, for there are regular updates and a plethora of formats of data that one can choose from. In this article, I share some insights on how to leverage GPT 4-o to quickly extract data to summarize DRS data in test cricket.

For those interested to read other articles penned by me on DRS, please see the below links.

https://medium.com/kishan19/quantifying-the-impact-of-umpires-call-f00f7d1c80e0
https://medium.com/kishan19/decoding-the-drs-performance-of-fielding-teams-a6133920f476

Building parsers to gain insights

The first step is to identify what insights you wish to gather from the raw data. In Cricsheet, data about reviews taken by cricket teams can be found in the JSON format.

**JSON format containing review data for a given test**

Step 1: Download the data onto Google Colab runtime

The first step is to download information about all tests, which can be found in this URL. (Zip file). The below snippet shows how to extract the files onto a local folder. The computation of this code has been done in Google Colab. I have posted a video on how to set up Google Colab below.

Video for getting started with Google Colab

data_path="temp/"
cricket_format="test"
p = Path(data_path).mkdir(parents=True, exist_ok=True)
cricsheet_file_storage_path=os.path.join(data_path,cricket_format)

cricsheet_file_path="{0}_cricsheet_files".format(cricket_format)

cricsheet_url=globals()[cricsheet_file_path]

Path(cricsheet_file_storage_path).mkdir(parents=True, exist_ok=True)

##Download zip files and remove them after all the necessary files have been extracted
! wget -q "$cricsheet_url"
zip_file=cricsheet_url.split('/')[-1]
! unzip -o -qq "$zip_file" -d "$cricsheet_file_storage_path"
! rm "$zip_file"

This extracts the set of .JSON files for each of the 805 games (dataset size at the time of writing this article), along with a .README txt file. The sample of the README file is as below.

Snippet from the .README txt file in Cricsheet test match data

Step 2: Summarize match metadata into a table

I want to be able to summarize the data in this README file into a neat table. For that, I leverage GPT-4o from Open AI, to help build a python function that will extract a .README file and parse into a neat text.

Head to https://chatgpt.com/
Enter the following Prompt in the chat window. We need to paste sample text from the .README.txt file
PROMPT to GPT-4o below.

extract all rows of the readme file below after "The 4th field is the gender of the players involved in the match. The
5th field is the id of the match, and the remainder of the line shows the
teams involved in the match." into a table. 

This zip archive contains data files from Cricsheet in JSON format. This
archive contains 805 male Test matches.


The JSON data files contained in this zip file are version 1.0.0, and 1.1.0
files. You can learn about the structure of these files at
https://cricsheet.org/format/json/


You can find the available downloads at https://cricsheet.org/downloads/, and
you can find the most up-to-date version of this zip file at
https://cricsheet.org/downloads/tests_male_json.zip


The matches contained in this zip archive are listed below. The first field is
the start date of the match (for test matches or other multi-day matches), or
the actual date (for all other types of match). The second is the type of
teams involved, whether 'club', or 'international'. The third is the type of
match, either Test, ODI, ODM, T20, IT20, MDM, or a club competition code (such
as IPL). The 4th field is the gender of the players involved in the match. The
5th field is the id of the match, and the remainder of the line shows the
teams involved in the match.


2024-03-30 - international - Test - male - 1419831 - Sri Lanka vs Bangladesh
2024-03-22 - international - Test - male - 1419830 - Sri Lanka vs Bangladesh
2024-03-08 - international - Test - male - 1388227 - New Zealand vs Australia
2024-03-07 - international - Test - male - 1389403 - England vs India

This helps to write a function like the below, on which you can pass the .README txt file and get the data extracted onto a neat table.

Output of the above prompt from GPT 4o

# Function to read and parse the readme.txt file
def parse_readme_file(file_path):
    with open(file_path, 'r') as file:
        lines = file.readlines()

    # Find the start of the relevant section
    start_index = 0
    for i, line in enumerate(lines):
        if "The 4th field is the gender of the players involved in the match." in line:
            start_index = i + 3  # The actual match data starts 3 lines after this line
            break

    # Extract and parse the relevant lines
    parsed_data = []
    for line in lines[start_index:]:
        if line.strip():  # Ignore empty lines
            fields = line.strip().split(' - ')
            start_date, team_type, match_type, gender, match_id, teams = fields
            team1, team2 = teams.split(' vs ')
            parsed_data.append([start_date, match_type, match_id, team1, team2])

    # Create a DataFrame from the parsed data
    df = pd.DataFrame(parsed_data, columns=['matchDate', 'Format', 'matchID', 'Team1', 'Team2'])

    return df

# File path to the readme.txt file
file_path = os.path.join(cricsheet_file_storage_path,'README.txt')

# Parse the readme.txt file and get the DataFrame
cricsheet_match_df = parse_readme_file(file_path)

# Display the DataFrame
cricsheet_match_df

Step 3: Extract all review fields onto a table

For each match file, which is a .JSON file, leverage GPT-4o again to understand the structure and extract the relevant review fields. Once the basic skeleton of the function is obtained, we can tweak it to extract the exact fields.

PROMPT to GPT-4o below.

extract the review field below into a neat dataframe  in pandas python

{'meta': {'data_version': '1.1.0', 'created': '2024-04-03', 'revision': 1},
 'info': {'balls_per_over': 6,
  'city': 'Chattogram',
  'dates': ['2024-03-30',
   '2024-03-31',
   '2024-04-01',
   '2024-04-02',
   '2024-04-03'],
  'event': {'name': 'Sri Lanka tour of Bangladesh', 'match_number': 2},
  'gender': 'male',
  'match_type': 'Test',
  'match_type_number': 2537,
  'officials': {'match_referees': ['JJ Crowe'],
   'reserve_umpires': ['Gazi Sohel'],
   'tv_umpires': ['CB Gaffaney'],
   'umpires': ['RK Illingworth', 'RJ Tucker']},
  'outcome': {'winner': 'Sri Lanka', 'by': {'runs': 192}},
  'player_of_match': ['PHKD Mendis'],
  'players': {'Sri Lanka': ['KNM Fernando',
    'FDM Karunaratne',
    'BKG Mendis',
    'AD Mathews',
    'LD Chandimal',
    'DM de Silva',
    'PHKD Mendis',
    'NGRP Jayasuriya',
    'MVT Fernando',
    'CBRLS Kumara',
    'AM Fernando'],
   'Bangladesh': ['Mahmudul Hasan Joy',
    'Zakir Hasan',
    'Taijul Islam',
    'Nazmul Hossain Shanto',
    'Mominul Haque',
    'Shakib Al Hasan',
    'Liton Das',
    'Shahadat Hossain',
    'Mehedi Hasan Miraz',
    'Khaled Ahmed',
    'Hasan Mahmud']},
  'registry': {'people': {'AD Mathews': '896d78ad',
    'AM Fernando': 'de3d549a',
    'BKG Mendis': '5d1e7582',
    'CB Gaffaney': 'd5ac41d8',
    'CBRLS Kumara': '4751caa3',
    'DM de Silva': '7d608e12',
    'FDM Karunaratne': '400e2d4d',
    'Gazi Sohel': 'f0df3ed7',
    'Hasan Mahmud': '80ad39ac',
    'JJ Crowe': '2e760301',
    'KNM Fernando': 'd2a2bb1a',
    'Khaled Ahmed': '5dbdecf1',
    'LD Chandimal': '8dd02a98',
    'Liton Das': '0404d43c',
    'MVT Fernando': '5eef2118',
    'Mahmudul Hasan Joy': '4b51bd51',
    'Mehedi Hasan Miraz': 'c58b0108',
    'Mominul Haque': 'de0a3209',
    'NGRP Jayasuriya': '127bc5ca',
    'Nazmul Hossain Shanto': '1274d7ab',
    'PHKD Mendis': '08548b13',
    'RJ Tucker': '38810cfc',
    'RK Illingworth': '20672246',
    'Shahadat Hossain': '69e23303',
    'Shakib Al Hasan': '7dc35884',
    'Taijul Islam': '9d1d34bd',
    'Zakir Hasan': 'e3e09091'}},
  'season': '2023/24',
  'team_type': 'international',
  'teams': ['Sri Lanka', 'Bangladesh'],
  'toss': {'decision': 'bat', 'winner': 'Sri Lanka'},
  'venue': 'Zahur Ahmed Chowdhury Stadium, Chattogram'},
 'innings': [{'team': 'Sri Lanka',
   'overs': [{'over': 0,
     'deliveries': [{'batter': 'KNM Fernando',
       'bowler': 'Khaled Ahmed',
       'non_striker': 'FDM Karunaratne',
       'runs': {'batter': 0, 'extras': 0, 'total': 0}},
      {'batter': 'KNM Fernando',
       'bowler': 'Khaled Ahmed',
       'non_striker': 'FDM Karunaratne',
       'runs': {'batter': 0, 'extras': 0, 'total': 0}},
      {'batter': 'KNM Fernando',
       'bowler': 'Khaled Ahmed',
       'non_striker': 'FDM Karunaratne',
       'runs': {'batter': 0, 'extras': 0, 'total': 0}},
      {'batter': 'KNM Fernando',
       'bowler': 'Khaled Ahmed',
       'non_striker': 'FDM Karunaratne',
       'runs': {'batter': 0, 'extras': 0, 'total': 0}},
{'batter': 'BKG Mendis',
       'bowler': 'Taijul Islam',
       'non_striker': 'FDM Karunaratne',
       'review': {'by': 'Bangladesh',
        'umpire': 'RJ Tucker',
        'batter': 'BKG Mendis',
        'decision': 'struck down',
        'type': 'wicket'},
       'runs': {'batter': 0, 'extras': 0, 'total': 0}},

Output of the prompt from GPT-4o as below.

# Function to extract review data and create a DataFrame
def extract_reviews(match_data):
    reviews = []
    for inning_num,inning in enumerate(match_data.get('innings', [])):
        for over in inning.get('overs', []):
            for delivery in over.get('deliveries', []):
                if 'review' in delivery:
                    review = delivery['review']
                    review_data = {
                        'team': inning.get('team', ''),
                        'innings_in_match':int(inning_num+1),
                        'over': over.get('over', ''),
                        'batter': delivery.get('batter', ''),
                        'bowler': delivery.get('bowler', ''),
                        'non_striker': delivery.get('non_striker', ''),
                        'review_by': review.get('by', ''),
                        'umpire': review.get('umpire', ''),
                        'review_batter': review.get('batter', ''),
                        'umpires_call':review.get("umpires_call",''),
                        'decision': review.get('decision', ''),
                        # 'review_type': review.get('type', ''),
                        'review_type': delivery.get('wickets','')[0].get('kind') if 'wickets' in delivery else '',
                        # 'runs_batter': delivery.get('runs', {}).get('batter', ''),
                        # 'runs_extras': delivery.get('runs', {}).get('extras', ''),
                        # 'runs_total': delivery.get('runs', {}).get('total', '')
                        'venue':match_data['info'].get("venue",""),
                        'city':match_data['info'].get("city",""),
                        'season':match_data['info'].get("season","")
                    }
                    reviews.append(review_data)

    # Create DataFrame
    df = pd.DataFrame(reviews)
    return df

# Extract review data and create DataFrame
df_reviews = extract_reviews(match_data)

df_reviews

Step 4: Collect data for all DRS instances across all test matches

Consolidate all data into a table.

review_df=pd.DataFrame()

for match_id in list(set(cricsheet_match_df['matchID'])):
    match_data=json.load(open(os.path.join(cricsheet_file_storage_path,"{0}.json".format(match_id))))
    df_reviews=extract_reviews(match_data)
    review_df=pd.concat([df_reviews,review_df])

Final consolidated table of all DRS instances in cricket

Possibilities with such data

In the next article, I shall outline on how to build a natural language query system with which you can interact with the data in this table.

For e.g. if I ask “How many DRS events happened at Barbados?”, I get an answer that there were 29 instances of DRS taken in tests that happened at Barbados.

GPT 4o understanding a cricket based question and returning the answer to it correctly

More on this on the next post. Till then, happy reading.

About me
This work is a reflection of my learning journey as a senior data scientist, who works with numbers on a day to day basis. I plan to make it a fun, light introduction to analytics using the theme of Cricket, targeted towards entry level cricket enthusiasts looking to get their feet warm in the field of analytics and AI. I am also learning along the way, so bear with me if you find any obvious flaws, and after all, we all are a student of life and cricket.
All lessons will have a set of video lectures, documentation in the form of a medium blog, and an associated code which can be downloaded and run.
The entire codebase used in this blog can be found at https://gist.github.com/kishan19/400b1fcf66899956411fae9877032f07