Data Science for DOTA2(PART 1 — Data Collection)

10 min readMay 11, 2023

Keywords: Opendota API call, Python Requests, BeautifulSoup, Scrapy

Wings Got the TI6 Championship in Seattle KeyArena

It has been almost 2 years since TI10, 5 years since TI8 and 7 years since TI6 in which Wings got the last championship for China. Last year, when Ame decided to take a break and Somnus(Maybe) announced retirement, I thought my great memory with LGD would fade away. However, now Somnus returns, together with those amazing summer legends. I feel so encouraged. Before the end of my summer break, I’ll try to apply what I have learned to a small project — Data Science for DOTA2.

The first part will be data collection, including using Opendota API to get matches and player information; using requests to get a simple page and parsing with BeautifulSoup; using scrapy to download detailed information for heroes. Later, for part 2, I plan to complete the analysis of the data and build a prediction model for the draft. For part 3, I will try to build a recommendation system, so that the players can have a better understanding of the heroes, and hopefully, the system can help new players.

Opendota API

Opendota offers dota related data. You can log in with your Steam account and explore the matches. Here is the description of the free tier and the premium one:

I think it is as cheap as DOTA2 :) Here I have an example to show how to use requests to get data from the API:

# get all the heroes information
import requests
import pandas as pd
pd.set_option('display.max_columns', None)
opendota_api_heros_url = 'https://api.opendota.com/api/heroes'

data = requests.get(opendota_api_heros_url, timeout=3)
data_json = data.json()

en_hero_df = pd.DataFrame(data_json)
en_hero_df['en_name'] = en_hero_df['name'].apply(lambda x:x.replace('npc_dota_hero_', ''))
en_hero_df = en_hero_df.rename(columns = {'localized_name': 'official_name', 'id': 'hero_id',\
                                          'name': 'ingame_name'})
en_hero_df.to_csv('dota2_heroes.csv', index=False, header=True)
en_hero_df.head()

# get the win and lose info of an account. 
# this is Ame's account I think
def get_api_json(url, loop=True, proxy=None):
    try:
        return requests.get(url, headers={'User-Agent': 'Chrome'}, timeout=3, proxies=proxy).json()
    except requests.exceptions.RequestException as e:
        print(e)
        return get_api_json(url, loop, proxy) if loop is True else None
get_api_json("http://api.opendota.com/api/players/148351321/wl", loop=True)
# result: {'win': 769, 'lose': 842}

Next, we can make an API call to get the information of a match.

# this is a match between PSG.LGD and Team Spirit
match_json = get_api_json("https://api.opendota.com/api/matches/6227419633", loop=True)
match_json['players'][0]['name']
# result: 萧瑟

I build a dictionary to map the hero ids to the official name:

hero_id_to_name = dict(zip(en_hero_df['hero_id'].values, en_hero_df['official_name'].values))

To get huge amounts of data, we can use the explorer of the website. You can use the basic version to select the attributes in the UI or you can switch to SQL command(based on PostgreSQL, check this). I first get all the match ids and their start time, then I download the detailed information.

# SELECT
        
# matches.match_id,
# matches.start_time

# FROM matches
# JOIN match_patch using(match_id)

# WHERE matches.start_time >= extract(epoch from timestamp '2021-01-01T07:00:00.000Z')
# AND matches.start_time <= extract(epoch from timestamp '2023-05-09T07:00:00.000Z')
# ORDER BY matches.start_time DESC
# LIMIT 100000

match_id_api = 'https://api.opendota.com/api/explorer?sql=SELECT%0A%20%20%20%20%20%20%20%20%0Amatches.match_id%2C%0Amatches.start_time%0A%0AFROM%20matches%0AJOIN%20match_patch%20using(match_id)%0A%0AWHERE%20matches.start_time%20%3E%3D%20extract(epoch%20from%20timestamp%20%272021-01-01T07%3A00%3A00.000Z%27)%0AAND%20matches.start_time%20%3C%3D%20extract(epoch%20from%20timestamp%20%272023-05-09T07%3A00%3A00.000Z%27)%0AORDER%20BY%20matches.start_time%20DESC%0ALIMIT%20100000'
match_data = requests.get(match_id_api).json()
match_ids = [row['match_id'] for row in match_data['rows']]
print(f"Altogether we have {len(match_ids)} matches.")
# Altogether we have xxx matches.

I manually make a list including the game performance features I’m interested in:

attributes_list= ['region',
 'isRadiant',
 'win',
 'lose',
 'total_gold',
 'total_xp',
 'kills_per_min',
 'kda',
 'abandons',
 'neutral_kills',
 'tower_kills',
 'courier_kills',
 'lane_kills',
 'hero_kills',
 'observer_kills',
 'sentry_kills',
 'roshan_kills',
 'necronomicon_kills',
 'ancient_kills',
 'buyback_count',
 'observer_uses',
 'sentry_uses']

Next, I write a loop to get all the information I need for future work. Here if you want to use the premium tier, you should input your credit card information on Opendota and please add “?api_key=YOUR_API” after each URL.

from datetime import datetime
import time
# https://www.opendota.com/api-keys
def get_hero_position_and_role(row_dict_list, player_dict_list, match_id, matches_list):
    # Please add ?api_key=YOUR_API after {match_id} to accelerate the api call
    # Refer the opendota website https://www.opendota.com/api-keys
    match_json = get_api_json(f"https://api.opendota.com/api/matches/{match_id}", loop=True)
    matches_list.append(match_json)
    
    # match_id = match_json['match_id']
    # I have tested the following attributes and generally there will be no error
    row_dicts = []
    win_camp = None
    start_time, duration, game_mode = None, None, None
    radiant_score, dire_score = None, None
    try:
        if match_json["radiant_win"]==True:
            win_camp = 'radiant'
        else: 
            win_camp = 'dire'
        if match_json['game_mode']:  # mode 1: all pick, mode 2: captain mode
            game_mode = match_json['game_mode']
        else:
            game_mode = None
        start_time = match_json["start_time"]
        start_time = datetime.fromtimestamp(start_time).strftime('%y-%m-%d')
        duration = match_json["duration"]
        radiant_score = match_json['radiant_score'] 
        dire_score = match_json['dire_score'] 
    except:
        print(f'unknown result id: {match_id}')

    if 'players' in match_json.keys():
        for p in match_json['players']:
            row_dict = {}
            row_dict['match_id'] = match_id
            row_dict['camp'] = 'radiant' if p['isRadiant'] else 'dire'
            row_dict['hero_id'] = p['hero_id']
            row_dict['hero_name'] = hero_id_to_name[p['hero_id']]
            if 'name' in p:
                row_dict['player'] = p['name']
            else:
                row_dict['player'] = None
            row_dict['player_id'] = p['account_id']
            if 'lane_role' in p:
                row_dict["lane_role"] = {1: "safe", 2: "mid", 3: "off", 4:'other'}[p["lane_role"]]
            else:
                row_dict["lane_role"] = None
            if p["lh_t"] and len(p["lh_t"])>=6:
                # print(p["lh_t"])
                row_dict["6_min_last_hits"] = p["lh_t"][6]
            else:
                row_dict["6_min_last_hits"] = None
            purchase_ward_observer = p["purchase_ward_observer"] if "purchase_ward_observer" in p else 0
            purchase_ward_sentry = p["purchase_ward_sentry"] if "purchase_ward_sentry" in p else 0
            row_dict["purchase_ward_count"] = purchase_ward_observer + purchase_ward_sentry
            row_dict['gold_per_min'] = p['gold_per_min']
            row_dict['hero_damage'] = p['hero_damage']
            row_dict['tower_damage'] = p['tower_damage']
            if 'rank_tier' in p:
                row_dict['rank_tier'] = p['rank_tier']
            for a in attributes_list:
                if a in p:
                    row_dict[a] = p[a]
                else:
                    row_dict[a] = None

            row_dicts.append(row_dict)
            player_dict_list.append(row_dict)

        w_df = pd.DataFrame(row_dicts)
        sorted_core_df = w_df.sort_values("6_min_last_hits", ascending=False).head(6)  # find the 6 heros with max last hits in 6 mins --> cores
        # sorted_ward_df = w_df.sort_values('purchase_ward_count', ascending=False).head(4)  # find 4 heros with max wards purchase --> support TAT
        def get_role(r):
            # define the cores first(cores have higher hits), define support then
            if r.lane_role=='safe':
                if r.hero_id in sorted_core_df.hero_id.tolist():
                    return 'carry'
                else:
                    return 'hard_support'
            elif r.lane_role=='off':
                if r.hero_id in sorted_core_df.hero_id.tolist():
                    return 'offlane'
                else:
                    return 'soft_support'
            elif r.lane_role=='mid':
                if r.hero_id in sorted_core_df.hero_id.tolist():
                    return 'mid'
                else:
                    return 'Unknown'
            
            # elif r.lane_role=='safe' and r.hero_id in sorted_ward_df.hero_id.tolist():
            #     return 'soft_support'
            # elif r.hero_id in sorted_ward_df.hero_id.tolist():
            #     return 'hard_support'
        # print(sorted_ward_df)
        w_df['role'] = w_df.apply(lambda r:get_role(r), axis=1)

        radiant_df = w_df[w_df['camp']=='radiant']
        dire_df = w_df[w_df['camp']=='dire']

        hero_positions_radiant = {radiant_df.hero_name.tolist()[i]:[radiant_df.role.tolist()[i], radiant_df.player_id.tolist()[i]] for i in range(len(radiant_df))}
        hero_positions_dire = {dire_df.hero_name.tolist()[i]:[dire_df.role.tolist()[i], dire_df.player_id.tolist()[i]] for i in range(len(dire_df))}
        
        res_dicts = {}
        res_dicts['match_id'] = match_id
        res_dicts['start_time'] = start_time
        res_dicts['duration'] = duration
        res_dicts['game_mode'] = game_mode
        res_dicts['win_camp'] = win_camp
        res_dicts['radiant_hero'] = hero_positions_radiant
        res_dicts['dire_hero'] = hero_positions_dire
        res_dicts['radiant_score'] = radiant_score
        res_dicts['dire_score'] = dire_score

        row_dict_list.append(res_dicts)
    else:
        print(f"Problem id: {match_id}")
    print(f"current length: {len(row_dict_list)}")
    # print(f"current length: {count}")
    return row_dict_list, player_dict_list, matches_list

team_dict_list = []
player_dict_list = []
count = 1
matches_list = []
for match_id in match_ids[:3]:
    # count += 1
    if count % 1200 == 0:
        time.sleep(65) # free tier 60 calls per minute
    count += 1
    team_dict_list, player_dict_list, matches_list = get_hero_position_and_role(team_dict_list, player_dict_list, match_id, matches_list)

# team_dict_list[0]
# {'match_id': 7146139538,
#  'start_time': '23-05-08',
#  'duration': 1409,
#  'game_mode': 2,
#  'win_camp': 'dire',
#  'radiant_hero': {'Death Prophet': ['mid', 1459205302],
#   'Jakiro': ['hard_support', 1459031599],
#   'Underlord': ['offlane', 1423366323],
#   'Gyrocopter': ['soft_support', 1458610209],
#   'Naga Siren': ['carry', 1459124273]},
#  'dire_hero': {'Queen of Pain': ['mid', 1526386891],
#   'Disruptor': ['hard_support', 1513040880],
#   'Templar Assassin': ['carry', 1427100407],
#   'Pudge': ['soft_support', 1517400698],
#   'Mars': ['offlane', 1529282248]},
#  'radiant_score': 14,
#  'dire_score': 34}

Now you can save the data into your local server.

import json 

# Combine dictionaries into a list 
dict_list = matches_list
len(dict_list)
 
# Write list of dictionaries to JSON file (optional)
with open('data.json', 'w') as f: 
    json.dump(dict_list, f) 

team_df = pd.DataFrame(team_dict_list)
team_df.to_csv('team.csv', index=False, header=True)

player_df = pd.DataFrame(player_dict_list)
# player_df.head(10)
player_df.to_csv('player.csv', index=False, header=True)

I can share the data I have downloaded, if you would like to get the data, please email me as the dataset is very huge. Later I will try to upload them to a drive link and it will be shown once it is done.

2. DOTA2 wiki

I would like to say that DOTA2 wiki is a great source and I’m grateful to all the developers. Now I will show you how to download hero attributes using requests and BeautifulSoup. I checked https://dota2.fandom.com/robots.txt and I didn’t find that getting hero information would be banned. But if it is not allowed, please let me know and I will delete it immediately.

Here I basically get the hero attributes from the base URL.

import requests
from bs4 import BeautifulSoup
# reference: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

urls = ['https://dota2.fandom.com/wiki/Table_of_hero_attributes']
r = requests.get(urls[0])
html_doc = r.content
soup = BeautifulSoup(html_doc, 'html.parser')

attr_list = []

def get_attributes(url, attr_list):
    r = requests.get(url)
    html_doc = r.content
    soup = BeautifulSoup(html_doc, 'html.parser')

    # print(soup.prettify())
    for tr in soup.tbody.find_all('tr')[1:]:
        hero_name = tr.a.get('href').split('/')[-1] # set the hero_name as in the url for future merge with other data
        tds = tr.find_all('td')
        attr_dict = {}
        attr_dict['hero'] = hero_name
        attr_dict['base_strength'] = float(tds[2].string)
        attr_dict['add_strength'] = float(tds[3].string)
        attr_dict['lvl30_strength'] = float(tds[4].string)
        attr_dict['base_agility'] = float(tds[5].string)
        attr_dict['add_agility'] = float(tds[6].string)
        attr_dict['lvl30_agility'] = float(tds[7].string)
        attr_dict['base_intelligence'] = float(tds[8].string)
        attr_dict['add_intelligence'] = float(tds[9].string)
        attr_dict['lvl30_intelligence'] = float(tds[10].string)
        attr_list.append(attr_dict)

for url in urls:
    get_attributes(url, attr_list)

print(f"Altogether we have {len(attr_list)} heroes.")
# Altogether we have 124 heroes.

attr_list[2]
# {'hero': 'Ancient_Apparition',
#  'base_strength': 20.0,
#  'add_strength': 1.9,
#  'lvl30_strength': 75.1,
#  'base_agility': 20.0,
#  'add_agility': 2.2,
#  'lvl30_agility': 83.8,
#  'base_intelligence': 23.0,
#  'add_intelligence': 3.4,
#  'lvl30_intelligence': 121.6}

Save the data:

attribute_df = pd.DataFrame(attr_list)
attribute_df.to_csv('hero_attributes.csv', index=False, header=True)
attribute_df

Next, let’s use Scrapy to crawl hero details(URLs, bios, counter relationships, and pros/cons).

To use scrapy, you need to install it(pip install scrapy). And if you follow the tutorial, you will be able to create a project quickly. There are 2 important commands.

a. scrapy shell ‘https://dota2.fandom.com/wiki/Heroes'

Run this in your terminal. You will get into the scrapy shell and you can freely test the CSS/xpath selectors.

b. scrapy crawl Heroes -O heroes_url.json

Run this cmd to crawl data and you will get a heroes_url.json file with the data you crawled. Remember, “Heroes” here is the “name” in your spider class.

I’ll show the scripts. They should be inside the spiders/ folder. And they are simple ones.

# get heros, links and attributes
import scrapy

class HeroesSpider(scrapy.Spider):
    name = "Heroes" # Attention to the name here

    def start_requests(self):
        urls = [
            "https://dota2.fandom.com/wiki/Heroes",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

        
    def parse(self, response):
        count = 0
        
        for heroes in response.css('table>tbody>tr'):
            for hero in heroes.css('td>div>div:first-child'):
                count += 1
                print(hero.css('div>a::attr(title)').get())
                print(count)
                if count <= 31:
                    attribute = 'Strength'
                elif count>31 and count <= 62:
                    attribute = 'Agility'
                elif count>62 and count <= 93:
                    attribute = 'Intelligence'
                else:
                    attribute = 'Universal'
                    
                yield {
                    "hero": hero.css('div>a::attr(href)').get().split('/')[-1],
                    "link": 'https://dota2.fandom.com' + hero.css('div>a::attr(href)').get(),
                    'attribute': attribute
                }

You need to run the above script first so that you will get the URLs for the following steps.

# get the heroes and their bios
import scrapy
import json


class HeroBioSpider(scrapy.Spider):
    name = "HeroBio"

    # attributes = [hero['attribute'] for hero in json.load(content)]
    # names = [hero['hero'] for hero in json.load(content)]
    
    def start_requests(self):
        
        f = open('heroes_url.json')
        urls = [hero['link'] for hero in json.load(f)]
        urls = urls
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
        
    def parse(self, response):
        
        hero = response.url.split('/')[-1]
        
        bios = ''
        for bio in response.css('div#heroBio>*'):
            bios += '.\n' .join(bio.css('::text').getall())
        
        yield {
            "hero": hero,
            "bio": bios
        }

# get the heroes and their counter relationships

import scrapy
import json


class HeroCounterSpider(scrapy.Spider):
    name = "HeroCounter"

    # f = open('heroes_url.json')
    # urls = [hero['link'] for hero in json.load(f)]
    # attributes = [hero['attribute'] for hero in json.load(content)]
    # names = [hero['hero'] for hero in json.load(content)]
    
    def start_requests(self):
        f = open('heroes_url.json')
        urls = [hero['link']+'/Counters' for hero in json.load(f)]
        urls = urls
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
        
    def parse(self, response):
        counters_list = []
        hero_name = response.url.split('/')[-2]
        
        count = 0
        for hero in response.css('div.mw-parser-output>*'):
            if hero.css('div>b>a'):
                count += 1
            if hero.css('h2>span'):
                counters_list.append(count)
            elif hero.xpath('.//h2/span[@id="Good_against..."]'):
                counters_list.append(count)
            elif hero.xpath('.//h2/span[@id="Works well with..."]'):
                counters_list.append(count)
        heros_list = response.css('div.mw-parser-output>div>b>a::text').getall()
        
        bad_against = heros_list[:counters_list[1]]
        good_against = heros_list[counters_list[1]:counters_list[2]]
        work_well_with = heros_list[counters_list[2]:]
        yield {
            "hero": hero_name,
            "bad_against": bad_against,
            "good_against": good_against,
            "work_well_with": work_well_with
        }

# get the heroes and their pros and cons
import scrapy
import json


class HeroProConSpider(scrapy.Spider):
    name = "HeroProCon"

    # f = open('heroes_url.json')
    # urls = [hero['link'] for hero in json.load(f)]
    # attributes = [hero['attribute'] for hero in json.load(content)]
    # names = [hero['hero'] for hero in json.load(content)]
    
    def start_requests(self):
        f = open('heroes_url.json')
        urls = [hero['link']+'/Guide' for hero in json.load(f)]
        urls = urls
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
        
    def parse(self, response):
        hero_name = response.url.split('/')[-2]
        
        pros_css = response.xpath('.//table[tbody/tr/th[contains(text(),"Playstyle")]]//tr[@valign="top"]/td[1]/ul/li')
        pros = '\n'.join([pro.css('::text').get() for pro in pros_css])
        
        cons_css = response.xpath('.//table[tbody/tr/th[contains(text(),"Playstyle")]]//tr[@valign="top"]/td[2]/ul/li')
        cons = '\n'.join([con.css('::text').get() for con in cons_css])
        
        yield {
            "hero": hero_name,
            "pros": pros,
            "cons": cons
        }

These tasks are very basic applications of scrapy, you can explore yourself. The folder stores crawled data.

And this is the end of PART 1. Next, if you want to merge all the data with some keys such as hero name, here is the code for your reference:

import pandas as pd
import os

pd.set_option('display.max_columns', None)

# my scrapy project is named "heroes"
hero_detail_path = './heroes/heroes/heroes'

pro_con_path = os.path.join(hero_detail_path, 'hero_pro_con.json')
bio_path = os.path.join(hero_detail_path, 'heroes_bios.json')
counter_path = os.path.join(hero_detail_path, 'heroes_counter.json')
url_path = os.path.join(hero_detail_path, 'heroes_url.json')

basic_hero_df = pd.read_csv('dota2_heroes.csv')
attr_df = pd.read_csv('hero_attributes.csv')
hero_pro_con_df = pd.read_json(pro_con_path)
hero_bio_df = pd.read_json(bio_path)
hero_counter_df = pd.read_json(counter_path)
hero_url_df = pd.read_json(url_path)

# replace nature's prophet(it has messy unicode)
hero_pro_con_df.replace('Nature%27s_Prophet', "Nature's_Prophet", inplace=True)
attr_df.replace('Nature%27s_Prophet', "Nature's_Prophet", inplace=True)
hero_bio_df.replace('Nature%27s_Prophet', "Nature's_Prophet", inplace=True)
hero_counter_df.replace('Nature%27s_Prophet', "Nature's_Prophet", inplace=True)
hero_url_df.replace('Nature%27s_Prophet', "Nature's_Prophet", inplace=True)

basic_hero_df.drop(columns=['ingame_name','en_name'], inplace=True)
basic_hero_df['name_id'] = basic_hero_df['official_name'].apply(lambda x:x.lower().replace('-','').replace('_','').replace(' ',''))
attr_df['name_id'] = attr_df['hero'].apply(lambda x:x.lower().replace('-','').replace('_','').replace(' ',''))
hero_pro_con_df['name_id'] = hero_pro_con_df['hero'].apply(lambda x:x.lower().replace('-','').replace('_','').replace(' ',''))
hero_bio_df['name_id'] = hero_bio_df['hero'].apply(lambda x:x.lower().replace('-','').replace('_','').replace(' ',''))
hero_counter_df['name_id'] = hero_counter_df['hero'].apply(lambda x:x.lower().replace('-','').replace('_','').replace(' ',''))
hero_url_df['name_id'] = hero_url_df['hero'].apply(lambda x:x.lower().replace('-','').replace('_','').replace(' ',''))

basic_hero_df = basic_hero_df.merge(hero_pro_con_df, on='name_id').drop(columns=['hero'])
basic_hero_df = basic_hero_df.merge(attr_df, on='name_id').drop(columns=['hero'])
basic_hero_df = basic_hero_df.merge(hero_bio_df, on='name_id').drop(columns=['hero'])
basic_hero_df = basic_hero_df.merge(hero_counter_df, on='name_id').drop(columns=['hero'])
basic_hero_df = basic_hero_df.merge(hero_url_df, on='name_id').drop(columns=['hero'])

basic_hero_df.to_csv('all_heroes.csv', index=False, header=True)
basic_hero_df.head()

This is all for the data collection part. There are a lot of things not included such as the details of hero mechanisms. I will explain them later in PART 2. Thank you for your time! Hope you enjoy data science and DOTA2 :).

It’s so cute to be here.

Please let me know if there are any questions.

Data Science for DOTA2(PART 1 — Data Collection)

Written by Alleria