Data Science for DOTA2(PART 1 — Data Collection)

Alleria
10 min readMay 11, 2023

--

Keywords: Opendota API call, Python Requests, BeautifulSoup, Scrapy

Wings Got the TI6 Championship in Seattle KeyArena

It has been almost 2 years since TI10, 5 years since TI8 and 7 years since TI6 in which Wings got the last championship for China. Last year, when Ame decided to take a break and Somnus(Maybe) announced retirement, I thought my great memory with LGD would fade away. However, now Somnus returns, together with those amazing summer legends. I feel so encouraged. Before the end of my summer break, I’ll try to apply what I have learned to a small project — Data Science for DOTA2.

The first part will be data collection, including using Opendota API to get matches and player information; using requests to get a simple page and parsing with BeautifulSoup; using scrapy to download detailed information for heroes. Later, for part 2, I plan to complete the analysis of the data and build a prediction model for the draft. For part 3, I will try to build a recommendation system, so that the players can have a better understanding of the heroes, and hopefully, the system can help new players.

  1. Opendota API

Opendota offers dota related data. You can log in with your Steam account and explore the matches. Here is the description of the free tier and the premium one:

Description

I think it is as cheap as DOTA2 :) Here I have an example to show how to use requests to get data from the API:

# get all the heroes information
import requests
import pandas as pd
pd.set_option('display.max_columns', None)
opendota_api_heros_url = 'https://api.opendota.com/api/heroes'

data = requests.get(opendota_api_heros_url, timeout=3)
data_json = data.json()

en_hero_df = pd.DataFrame(data_json)
en_hero_df['en_name'] = en_hero_df['name'].apply(lambda x:x.replace('npc_dota_hero_', ''))
en_hero_df = en_hero_df.rename(columns = {'localized_name': 'official_name', 'id': 'hero_id',\
'name': 'ingame_name'})
en_hero_df.to_csv('dota2_heroes.csv', index=False, header=True)
en_hero_df.head()
Result
# get the win and lose info of an account. 
# this is Ame's account I think
def get_api_json(url, loop=True, proxy=None):
try:
return requests.get(url, headers={'User-Agent': 'Chrome'}, timeout=3, proxies=proxy).json()
except requests.exceptions.RequestException as e:
print(e)
return get_api_json(url, loop, proxy) if loop is True else None
get_api_json("http://api.opendota.com/api/players/148351321/wl", loop=True)
# result: {'win': 769, 'lose': 842}

Next, we can make an API call to get the information of a match.

# this is a match between PSG.LGD and Team Spirit
match_json = get_api_json("https://api.opendota.com/api/matches/6227419633", loop=True)
match_json['players'][0]['name']
# result: 萧瑟

I build a dictionary to map the hero ids to the official name:

hero_id_to_name = dict(zip(en_hero_df['hero_id'].values, en_hero_df['official_name'].values))

To get huge amounts of data, we can use the explorer of the website. You can use the basic version to select the attributes in the UI or you can switch to SQL command(based on PostgreSQL, check this). I first get all the match ids and their start time, then I download the detailed information.

# SELECT

# matches.match_id,
# matches.start_time

# FROM matches
# JOIN match_patch using(match_id)

# WHERE matches.start_time >= extract(epoch from timestamp '2021-01-01T07:00:00.000Z')
# AND matches.start_time <= extract(epoch from timestamp '2023-05-09T07:00:00.000Z')
# ORDER BY matches.start_time DESC
# LIMIT 100000

match_id_api = 'https://api.opendota.com/api/explorer?sql=SELECT%0A%20%20%20%20%20%20%20%20%0Amatches.match_id%2C%0Amatches.start_time%0A%0AFROM%20matches%0AJOIN%20match_patch%20using(match_id)%0A%0AWHERE%20matches.start_time%20%3E%3D%20extract(epoch%20from%20timestamp%20%272021-01-01T07%3A00%3A00.000Z%27)%0AAND%20matches.start_time%20%3C%3D%20extract(epoch%20from%20timestamp%20%272023-05-09T07%3A00%3A00.000Z%27)%0AORDER%20BY%20matches.start_time%20DESC%0ALIMIT%20100000'
match_data = requests.get(match_id_api).json()
match_ids = [row['match_id'] for row in match_data['rows']]
print(f"Altogether we have {len(match_ids)} matches.")
# Altogether we have xxx matches.

I manually make a list including the game performance features I’m interested in:

attributes_list= ['region',
'isRadiant',
'win',
'lose',
'total_gold',
'total_xp',
'kills_per_min',
'kda',
'abandons',
'neutral_kills',
'tower_kills',
'courier_kills',
'lane_kills',
'hero_kills',
'observer_kills',
'sentry_kills',
'roshan_kills',
'necronomicon_kills',
'ancient_kills',
'buyback_count',
'observer_uses',
'sentry_uses']

Next, I write a loop to get all the information I need for future work. Here if you want to use the premium tier, you should input your credit card information on Opendota and please add “?api_key=YOUR_API” after each URL.

from datetime import datetime
import time
# https://www.opendota.com/api-keys
def get_hero_position_and_role(row_dict_list, player_dict_list, match_id, matches_list):
# Please add ?api_key=YOUR_API after {match_id} to accelerate the api call
# Refer the opendota website https://www.opendota.com/api-keys
match_json = get_api_json(f"https://api.opendota.com/api/matches/{match_id}", loop=True)
matches_list.append(match_json)

# match_id = match_json['match_id']
# I have tested the following attributes and generally there will be no error
row_dicts = []
win_camp = None
start_time, duration, game_mode = None, None, None
radiant_score, dire_score = None, None
try:
if match_json["radiant_win"]==True:
win_camp = 'radiant'
else:
win_camp = 'dire'
if match_json['game_mode']: # mode 1: all pick, mode 2: captain mode
game_mode = match_json['game_mode']
else:
game_mode = None
start_time = match_json["start_time"]
start_time = datetime.fromtimestamp(start_time).strftime('%y-%m-%d')
duration = match_json["duration"]
radiant_score = match_json['radiant_score']
dire_score = match_json['dire_score']
except:
print(f'unknown result id: {match_id}')

if 'players' in match_json.keys():
for p in match_json['players']:
row_dict = {}
row_dict['match_id'] = match_id
row_dict['camp'] = 'radiant' if p['isRadiant'] else 'dire'
row_dict['hero_id'] = p['hero_id']
row_dict['hero_name'] = hero_id_to_name[p['hero_id']]
if 'name' in p:
row_dict['player'] = p['name']
else:
row_dict['player'] = None
row_dict['player_id'] = p['account_id']
if 'lane_role' in p:
row_dict["lane_role"] = {1: "safe", 2: "mid", 3: "off", 4:'other'}[p["lane_role"]]
else:
row_dict["lane_role"] = None
if p["lh_t"] and len(p["lh_t"])>=6:
# print(p["lh_t"])
row_dict["6_min_last_hits"] = p["lh_t"][6]
else:
row_dict["6_min_last_hits"] = None
purchase_ward_observer = p["purchase_ward_observer"] if "purchase_ward_observer" in p else 0
purchase_ward_sentry = p["purchase_ward_sentry"] if "purchase_ward_sentry" in p else 0
row_dict["purchase_ward_count"] = purchase_ward_observer + purchase_ward_sentry
row_dict['gold_per_min'] = p['gold_per_min']
row_dict['hero_damage'] = p['hero_damage']
row_dict['tower_damage'] = p['tower_damage']
if 'rank_tier' in p:
row_dict['rank_tier'] = p['rank_tier']
for a in attributes_list:
if a in p:
row_dict[a] = p[a]
else:
row_dict[a] = None

row_dicts.append(row_dict)
player_dict_list.append(row_dict)

w_df = pd.DataFrame(row_dicts)
sorted_core_df = w_df.sort_values("6_min_last_hits", ascending=False).head(6) # find the 6 heros with max last hits in 6 mins --> cores
# sorted_ward_df = w_df.sort_values('purchase_ward_count', ascending=False).head(4) # find 4 heros with max wards purchase --> support TAT
def get_role(r):
# define the cores first(cores have higher hits), define support then
if r.lane_role=='safe':
if r.hero_id in sorted_core_df.hero_id.tolist():
return 'carry'
else:
return 'hard_support'
elif r.lane_role=='off':
if r.hero_id in sorted_core_df.hero_id.tolist():
return 'offlane'
else:
return 'soft_support'
elif r.lane_role=='mid':
if r.hero_id in sorted_core_df.hero_id.tolist():
return 'mid'
else:
return 'Unknown'

# elif r.lane_role=='safe' and r.hero_id in sorted_ward_df.hero_id.tolist():
# return 'soft_support'
# elif r.hero_id in sorted_ward_df.hero_id.tolist():
# return 'hard_support'
# print(sorted_ward_df)
w_df['role'] = w_df.apply(lambda r:get_role(r), axis=1)

radiant_df = w_df[w_df['camp']=='radiant']
dire_df = w_df[w_df['camp']=='dire']

hero_positions_radiant = {radiant_df.hero_name.tolist()[i]:[radiant_df.role.tolist()[i], radiant_df.player_id.tolist()[i]] for i in range(len(radiant_df))}
hero_positions_dire = {dire_df.hero_name.tolist()[i]:[dire_df.role.tolist()[i], dire_df.player_id.tolist()[i]] for i in range(len(dire_df))}

res_dicts = {}
res_dicts['match_id'] = match_id
res_dicts['start_time'] = start_time
res_dicts['duration'] = duration
res_dicts['game_mode'] = game_mode
res_dicts['win_camp'] = win_camp
res_dicts['radiant_hero'] = hero_positions_radiant
res_dicts['dire_hero'] = hero_positions_dire
res_dicts['radiant_score'] = radiant_score
res_dicts['dire_score'] = dire_score

row_dict_list.append(res_dicts)
else:
print(f"Problem id: {match_id}")
print(f"current length: {len(row_dict_list)}")
# print(f"current length: {count}")
return row_dict_list, player_dict_list, matches_list

team_dict_list = []
player_dict_list = []
count = 1
matches_list = []
for match_id in match_ids[:3]:
# count += 1
if count % 1200 == 0:
time.sleep(65) # free tier 60 calls per minute
count += 1
team_dict_list, player_dict_list, matches_list = get_hero_position_and_role(team_dict_list, player_dict_list, match_id, matches_list)

# team_dict_list[0]
# {'match_id': 7146139538,
# 'start_time': '23-05-08',
# 'duration': 1409,
# 'game_mode': 2,
# 'win_camp': 'dire',
# 'radiant_hero': {'Death Prophet': ['mid', 1459205302],
# 'Jakiro': ['hard_support', 1459031599],
# 'Underlord': ['offlane', 1423366323],
# 'Gyrocopter': ['soft_support', 1458610209],
# 'Naga Siren': ['carry', 1459124273]},
# 'dire_hero': {'Queen of Pain': ['mid', 1526386891],
# 'Disruptor': ['hard_support', 1513040880],
# 'Templar Assassin': ['carry', 1427100407],
# 'Pudge': ['soft_support', 1517400698],
# 'Mars': ['offlane', 1529282248]},
# 'radiant_score': 14,
# 'dire_score': 34}

Now you can save the data into your local server.

import json 

# Combine dictionaries into a list
dict_list = matches_list
len(dict_list)

# Write list of dictionaries to JSON file (optional)
with open('data.json', 'w') as f:
json.dump(dict_list, f)

team_df = pd.DataFrame(team_dict_list)
team_df.to_csv('team.csv', index=False, header=True)

player_df = pd.DataFrame(player_dict_list)
# player_df.head(10)
player_df.to_csv('player.csv', index=False, header=True)

I can share the data I have downloaded, if you would like to get the data, please email me as the dataset is very huge. Later I will try to upload them to a drive link and it will be shown once it is done.

2. DOTA2 wiki

I would like to say that DOTA2 wiki is a great source and I’m grateful to all the developers. Now I will show you how to download hero attributes using requests and BeautifulSoup. I checked https://dota2.fandom.com/robots.txt and I didn’t find that getting hero information would be banned. But if it is not allowed, please let me know and I will delete it immediately.

Here I basically get the hero attributes from the base URL.

import requests
from bs4 import BeautifulSoup
# reference: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

urls = ['https://dota2.fandom.com/wiki/Table_of_hero_attributes']
r = requests.get(urls[0])
html_doc = r.content
soup = BeautifulSoup(html_doc, 'html.parser')

attr_list = []

def get_attributes(url, attr_list):
r = requests.get(url)
html_doc = r.content
soup = BeautifulSoup(html_doc, 'html.parser')

# print(soup.prettify())
for tr in soup.tbody.find_all('tr')[1:]:
hero_name = tr.a.get('href').split('/')[-1] # set the hero_name as in the url for future merge with other data
tds = tr.find_all('td')
attr_dict = {}
attr_dict['hero'] = hero_name
attr_dict['base_strength'] = float(tds[2].string)
attr_dict['add_strength'] = float(tds[3].string)
attr_dict['lvl30_strength'] = float(tds[4].string)
attr_dict['base_agility'] = float(tds[5].string)
attr_dict['add_agility'] = float(tds[6].string)
attr_dict['lvl30_agility'] = float(tds[7].string)
attr_dict['base_intelligence'] = float(tds[8].string)
attr_dict['add_intelligence'] = float(tds[9].string)
attr_dict['lvl30_intelligence'] = float(tds[10].string)
attr_list.append(attr_dict)

for url in urls:
get_attributes(url, attr_list)

print(f"Altogether we have {len(attr_list)} heroes.")
# Altogether we have 124 heroes.

attr_list[2]
# {'hero': 'Ancient_Apparition',
# 'base_strength': 20.0,
# 'add_strength': 1.9,
# 'lvl30_strength': 75.1,
# 'base_agility': 20.0,
# 'add_agility': 2.2,
# 'lvl30_agility': 83.8,
# 'base_intelligence': 23.0,
# 'add_intelligence': 3.4,
# 'lvl30_intelligence': 121.6}

Save the data:

attribute_df = pd.DataFrame(attr_list)
attribute_df.to_csv('hero_attributes.csv', index=False, header=True)
attribute_df
Hero_attributes Result

Next, let’s use Scrapy to crawl hero details(URLs, bios, counter relationships, and pros/cons).

To use scrapy, you need to install it(pip install scrapy). And if you follow the tutorial, you will be able to create a project quickly. There are 2 important commands.

a. scrapy shell ‘https://dota2.fandom.com/wiki/Heroes'

Run this in your terminal. You will get into the scrapy shell and you can freely test the CSS/xpath selectors.

b. scrapy crawl Heroes -O heroes_url.json

Run this cmd to crawl data and you will get a heroes_url.json file with the data you crawled. Remember, “Heroes” here is the “name” in your spider class.

I’ll show the scripts. They should be inside the spiders/ folder. And they are simple ones.

# get heros, links and attributes
import scrapy

class HeroesSpider(scrapy.Spider):
name = "Heroes" # Attention to the name here

def start_requests(self):
urls = [
"https://dota2.fandom.com/wiki/Heroes",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)


def parse(self, response):
count = 0

for heroes in response.css('table>tbody>tr'):
for hero in heroes.css('td>div>div:first-child'):
count += 1
print(hero.css('div>a::attr(title)').get())
print(count)
if count <= 31:
attribute = 'Strength'
elif count>31 and count <= 62:
attribute = 'Agility'
elif count>62 and count <= 93:
attribute = 'Intelligence'
else:
attribute = 'Universal'

yield {
"hero": hero.css('div>a::attr(href)').get().split('/')[-1],
"link": 'https://dota2.fandom.com' + hero.css('div>a::attr(href)').get(),
'attribute': attribute
}

You need to run the above script first so that you will get the URLs for the following steps.

# get the heroes and their bios
import scrapy
import json


class HeroBioSpider(scrapy.Spider):
name = "HeroBio"

# attributes = [hero['attribute'] for hero in json.load(content)]
# names = [hero['hero'] for hero in json.load(content)]

def start_requests(self):

f = open('heroes_url.json')
urls = [hero['link'] for hero in json.load(f)]
urls = urls
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):

hero = response.url.split('/')[-1]

bios = ''
for bio in response.css('div#heroBio>*'):
bios += '.\n' .join(bio.css('::text').getall())

yield {
"hero": hero,
"bio": bios
}
# get the heroes and their counter relationships

import scrapy
import json


class HeroCounterSpider(scrapy.Spider):
name = "HeroCounter"

# f = open('heroes_url.json')
# urls = [hero['link'] for hero in json.load(f)]
# attributes = [hero['attribute'] for hero in json.load(content)]
# names = [hero['hero'] for hero in json.load(content)]

def start_requests(self):
f = open('heroes_url.json')
urls = [hero['link']+'/Counters' for hero in json.load(f)]
urls = urls
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
counters_list = []
hero_name = response.url.split('/')[-2]

count = 0
for hero in response.css('div.mw-parser-output>*'):
if hero.css('div>b>a'):
count += 1
if hero.css('h2>span'):
counters_list.append(count)
elif hero.xpath('.//h2/span[@id="Good_against..."]'):
counters_list.append(count)
elif hero.xpath('.//h2/span[@id="Works well with..."]'):
counters_list.append(count)
heros_list = response.css('div.mw-parser-output>div>b>a::text').getall()

bad_against = heros_list[:counters_list[1]]
good_against = heros_list[counters_list[1]:counters_list[2]]
work_well_with = heros_list[counters_list[2]:]
yield {
"hero": hero_name,
"bad_against": bad_against,
"good_against": good_against,
"work_well_with": work_well_with
}
# get the heroes and their pros and cons
import scrapy
import json


class HeroProConSpider(scrapy.Spider):
name = "HeroProCon"

# f = open('heroes_url.json')
# urls = [hero['link'] for hero in json.load(f)]
# attributes = [hero['attribute'] for hero in json.load(content)]
# names = [hero['hero'] for hero in json.load(content)]

def start_requests(self):
f = open('heroes_url.json')
urls = [hero['link']+'/Guide' for hero in json.load(f)]
urls = urls
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
hero_name = response.url.split('/')[-2]

pros_css = response.xpath('.//table[tbody/tr/th[contains(text(),"Playstyle")]]//tr[@valign="top"]/td[1]/ul/li')
pros = '\n'.join([pro.css('::text').get() for pro in pros_css])

cons_css = response.xpath('.//table[tbody/tr/th[contains(text(),"Playstyle")]]//tr[@valign="top"]/td[2]/ul/li')
cons = '\n'.join([con.css('::text').get() for con in cons_css])

yield {
"hero": hero_name,
"pros": pros,
"cons": cons
}

These tasks are very basic applications of scrapy, you can explore yourself. The folder stores crawled data.

And this is the end of PART 1. Next, if you want to merge all the data with some keys such as hero name, here is the code for your reference:

import pandas as pd
import os

pd.set_option('display.max_columns', None)
# my scrapy project is named "heroes"
hero_detail_path = './heroes/heroes/heroes'

pro_con_path = os.path.join(hero_detail_path, 'hero_pro_con.json')
bio_path = os.path.join(hero_detail_path, 'heroes_bios.json')
counter_path = os.path.join(hero_detail_path, 'heroes_counter.json')
url_path = os.path.join(hero_detail_path, 'heroes_url.json')

basic_hero_df = pd.read_csv('dota2_heroes.csv')
attr_df = pd.read_csv('hero_attributes.csv')
hero_pro_con_df = pd.read_json(pro_con_path)
hero_bio_df = pd.read_json(bio_path)
hero_counter_df = pd.read_json(counter_path)
hero_url_df = pd.read_json(url_path)

# replace nature's prophet(it has messy unicode)
hero_pro_con_df.replace('Nature%27s_Prophet', "Nature's_Prophet", inplace=True)
attr_df.replace('Nature%27s_Prophet', "Nature's_Prophet", inplace=True)
hero_bio_df.replace('Nature%27s_Prophet', "Nature's_Prophet", inplace=True)
hero_counter_df.replace('Nature%27s_Prophet', "Nature's_Prophet", inplace=True)
hero_url_df.replace('Nature%27s_Prophet', "Nature's_Prophet", inplace=True)

basic_hero_df.drop(columns=['ingame_name','en_name'], inplace=True)
basic_hero_df['name_id'] = basic_hero_df['official_name'].apply(lambda x:x.lower().replace('-','').replace('_','').replace(' ',''))
attr_df['name_id'] = attr_df['hero'].apply(lambda x:x.lower().replace('-','').replace('_','').replace(' ',''))
hero_pro_con_df['name_id'] = hero_pro_con_df['hero'].apply(lambda x:x.lower().replace('-','').replace('_','').replace(' ',''))
hero_bio_df['name_id'] = hero_bio_df['hero'].apply(lambda x:x.lower().replace('-','').replace('_','').replace(' ',''))
hero_counter_df['name_id'] = hero_counter_df['hero'].apply(lambda x:x.lower().replace('-','').replace('_','').replace(' ',''))
hero_url_df['name_id'] = hero_url_df['hero'].apply(lambda x:x.lower().replace('-','').replace('_','').replace(' ',''))

basic_hero_df = basic_hero_df.merge(hero_pro_con_df, on='name_id').drop(columns=['hero'])
basic_hero_df = basic_hero_df.merge(attr_df, on='name_id').drop(columns=['hero'])
basic_hero_df = basic_hero_df.merge(hero_bio_df, on='name_id').drop(columns=['hero'])
basic_hero_df = basic_hero_df.merge(hero_counter_df, on='name_id').drop(columns=['hero'])
basic_hero_df = basic_hero_df.merge(hero_url_df, on='name_id').drop(columns=['hero'])

basic_hero_df.to_csv('all_heroes.csv', index=False, header=True)
basic_hero_df.head()
Sample Output

This is all for the data collection part. There are a lot of things not included such as the details of hero mechanisms. I will explain them later in PART 2. Thank you for your time! Hope you enjoy data science and DOTA2 :).

It’s so cute to be here.

Please let me know if there are any questions.

jiayou

--

--