Using Python to build a statistical model to predict the winner of a professional game of Dota

Bert Miller
Jul 30, 2017 · 15 min read

I began playing Dota almost 5 years ago. Now, I don’t have much time to play but occasionally I watch competitive matches. I’m passionate about data science and statistics and I wondered if I could marry the two. The project was to see if I could build a model that would predict the winner of a competitive Dota match by using data on the team’s past performance. Hopefully, I would be able to give some good predictions on who will win TI.

So you don’t have to look through this entire page, I was able to get 62% of games right overall by looking at the difference between average stats like GPM, XPM, hero damage, etc.

This could probably be improved by looking at series matchups as opposed to individual games within a series. I think this is pretty easy to see. In a best of 3 if the game goes to 3 then if correctly predict the winner of a series and assume they will win every game you’ll only predict 66% of your games right.

However, what’s interesting is that when my model predicts that the probability of winning is above 75% it is correct 80% of the time. There’s clear tension here, as ideally I should be predicting an 80% probability of winning and being correct 80% of the time, but using simple linear regression I’m happy with this result.

Anyway, hope you enjoy.

The Idea

The basic premise was that the better a team is the better it’s average stats per game will be. Moreover, I’m assuming that when there is a disparity in some average stat, say, one team has much higher average last hits than another, that team has a higher probability of winning the game. The way that I thought to capture this was to look at the difference in average stats. To put it in mathematical terms, my model looks like this:

p(radiant win) = f(β*d_assists + β*d_denies + … + β*d_tower_damage + β*d_xpm)

Where β = some coefficient which we’ll find and d_assists represent the taking the radiant team’s average assists minus the dire team’s average assists.

(For statistically inclined readers, I’m using logistic regression with a binary outcome)

So, what do I actually need to make this model?

  1. A list of teams that I’m looking at
  2. A list of games they have played
  3. Data from these games

I began by simply googling ‘dota 2 professional match data’ , ‘dota 2 data’, etc and ended up datdota. My first impression was that I could use their team performances but I decided that this wasn’t the right solution for me because it doesn’t include some stats that I could get by pulling match data like tower & hero damage as well as hero healing. Instead, I chose to grab a lot of matches, get data about them, use that to get team’s average stats, and then use the matches from before to train my model. Moreover, I thought that 1000 matches might not be enough for meaningful analysis. To get match data I used this dota 2 API which proved to be extremely useful.

Step one

The first thing I needed was to get a list of teams. Dota 2’s API doesn’t take text names like “EG” or “Cloud 9” but instead each team has an ID. If my code is going to be scalable at all I need these IDs. After a bit of toying around I settled on this: I downloaded a csv of 1000 previous premium matches on datdota. What I’m really interested in here is a list of 1000 match IDs, which we can use to pull team IDs! I downloaded the CSV and fired up Python. I’m not going to go through how to use the Dota 2 API in depth, but essentially, you need to get an API key from Steam and then learn the structure of the data it pushes back to you. Here’s the actual code I used:

import dota2api
import pprint
import pandas as pd
import numpy as np
pd.set_option('display.expand_frame_repr', False)api = dota2api.Initialise("STEAM_API_KEY")matches_data = pd.read_csv("data\\matches_premium.csv", encoding = "ISO-8859-1")dict_info = {'Team_ID': [np.nan], 'Team_name': [np.nan]}
match_info = pd.DataFrame(dict_info)
for i in range(0, len(matches_data)): #
try:
current_match = matches_data['Match ID'][i]
print(current_match)
match = api.get_match_details(match_id = current_match)

try:
radiant_current = match['radiant_team_id']
except:
radiant_current = ""
try:
radiant_name = match['radiant_name']
except:
radiant_name = ""
temp_info1 = {'Team_ID': radiant_current, 'Team_name': radiant_name}
match_info = match_info.append(temp_info1, ignore_index=True)
try:
dire_current = match['dire_team_id']
except:
dire_current = ""
try:
dire_name = match['dire_name']
except:
dire_name = ""
temp_info2 = {'Team_ID': dire_current, 'Team_name': dire_name}
match_info = match_info.append(temp_info2, ignore_index=True)
except:
print("Something went wrong")
match_info = match_info.dropna()
match_info_nodups = match_info.drop_duplicates()
match_info_nodups.to_csv("data\\match_info_data.csv", index=False)

What I’m doing here is iterating through the csv of 1000 matches we just downloaded and getting the match data for each and every one. I create a new dataframe called match_info and add both the radiant and dire team’s names and IDs. Sometimes the API has trouble getting a particular match’s data back to you and sometimes there are blanks for the team name or ID. There are better ways to handle this than I did, but I just wanted to get a list of team IDs and focus on getting data and crunching numbers. Once your script I drop the empty row(s) and duplicates and print it to a CSV. The first rows of your data will look like this…

Team_ID       Team_Name
15 LGD-GAMING
2626685 EHOME.KEEN
726228 Vici Gaming
4372042 Team Freedom
2512249 Digital Chaos
3 compLexity Gaming
3715574 SG e-sports team
3477208 Midas Club Elite
..............................

Some data editing is necessary here. Certain teams (Mineski and Infamous) change their names a few times but they retain the same ID. Unfortunately this isn’t picked up earlier so I went into my CSV and manually deleted those entries.

Great! We have a list of teams and their IDs. In total I have 103. Let’s get some matches.

Step two

I turned again to datdota for help. They have a great head to head performance tool that lists match IDs but it would take a huge amount of time to go through this manually for every team. Instead, we’re gonna use python to do this for us pulling the table details from the page’s HTML. I used this and this to learn how to pull data from HTML tables using Python.

If you’re interested in how this works here’s a run down: I loaded all of the teams and IDs we just loaded. I start with the 1st entry and grab the match IDs with the 2nd entry. Then I do the 1st and 3rd, etc, all the way to 103rd. Then, I start again with the 2nd entry get the match IDs with the 3rd entry, then 2nd and 4th, etc, and repeat for every combination of teams in our list of 103 teams. This comes out to be 5253 combinations! Anyway, here’s the code:

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
def getMatchIds(team_a, team_b, match_ids):
try:
datdota = 'https://www.datdota.com/teams/head-to-head?team-a=' + team_a + '&team-b=' + team_b + '&tier=1&tier=2&valve-event=does-not-matter&patch=7.06&patch=7.05&patch=7.04&patch=7.03&patch=7.02&patch=7.01&patch=7.00&patch=6.88&patch=6.87&patch=6.86&patch=6.85&patch=6.84&patch=6.83&patch=6.82&patch=6.81&patch=6.80&patch=6.79&patch=6.78&winner=either&after=01%2F01%2F2015&before=25%2F07%2F2017&duration=0%3B200'
r = requests.get(datdota)
soup = BeautifulSoup(r.text, 'lxml')
table = soup.find(class_ ='table table-striped table-bordered table-hover data-table')
for row in table.find_all('tr')[2:]:
col = row.find_all('td')
temp_id = col[1].text.strip()
print(temp_id)
temp_df = {'match_id': temp_id}
match_ids = match_ids.append(temp_df, ignore_index=True)
return match_ids
except:
return match_ids
dict1 = {'match_id': [np.nan]}
match_ids = pd.DataFrame(dict1)
# match_ids = getMatchIds(team_a, team_b, match_ids)team_ids = pd.read_csv("data\\match_info_data.csv", encoding = "ISO-8859-1")# print(team_ids)for i in range(0, len(team_ids)):
for x in range (i + 1, len(team_ids)):
team_a = team_ids.loc[i, 'Team_ID']
team_b = team_ids.loc[x, 'Team_ID']
match_ids = getMatchIds(str(team_a), str(team_b), match_ids)
match_ids = match_ids.dropna(how ='all')
print(match_ids)
match_ids.to_csv("data\\match_ids.csv", index = False)

This returned 4916 matches for me after 11.5 minutes. Awesome. Onto the next step.

Step three

Time to get data about the matches! I’m using the dota2api I linked earlier and I highly suggest that you spend time just tinkering around with this before you try anything serious. It took me a bit before I figured out a system that worked for me. Here’s a rundown of what I’m doing:

I load all of the match IDs from the csv we just generated. I call the dota2api to get the match details for each of these. It responds with a whole lot of data and from that I’m grabbing these variables. One thing to note is that the team statistics are the sum of each player on the team not team average.

# MATCH
first_blood = 0
match_duration = 0
match_winner = 0 #<-- 1 if radiant wins, 0 if dire wins
# RADIANT
radiant_assists = 0
radiant_denies = 0
radiant_gpm = 0
radiant_healing = 0
radiant_hero_damage = 0
radiant_kills = 0
radiant_last_hits = 0
radiant_name = ""
radiant_total_levels = 0
radiant_tower_damage = 0
radiant_xpm = 0
radiant_barracks = 0 # <-- status of the barracks
# DIRE
dire_assists = 0
dire_denies = 0
dire_gpm = 0
dire_healing = 0
dire_hero_damage = 0
dire_kills = 0
dire_last_hits = 0
dire_name = ""
dire_total_levels = 0
dire_tower_damage = 0
dire_xpm = 0
dire_barracks = 0 # <-- status of the barracks

I’m grabbing a lot of data and a lot of matches so this will take awhile. Here’s my actual code:

import dota2api
import pprint
import pandas as pd
import numpy as np
pd.set_option('display.expand_frame_repr', False)api = dota2api.Initialise("STEAM_API_KEY")matches_data = pd.read_csv("data\\match_ids.csv")matches_data['Radiant'] = ''
matches_data['Dire'] = ''
matches_data['radiant_assists'] = 0
matches_data['radiant_barracks'] = 0
matches_data['radiant_denies'] = 0
matches_data['radiant_gpm'] = 0
matches_data['radiant_healing'] = 0
matches_data['radiant_hero_damage'] = 0
matches_data['radiant_kills'] = 0
matches_data['radiant_last_hits'] = 0
matches_data['radiant_total_levels'] = 0
matches_data['radiant_tower_damage'] = 0
matches_data['radiant_tower_status'] = 0
matches_data['radiant_xpm'] = 0
matches_data['dire_assists'] = 0
matches_data['dire_barracks'] = 0
matches_data['dire_denies'] = 0
matches_data['dire_gpm'] = 0
matches_data['dire_healing'] = 0
matches_data['dire_hero_damage'] = 0
matches_data['dire_kills'] = 0
matches_data['dire_last_hits'] = 0
matches_data['dire_total_levels'] = 0
matches_data['dire_tower_damage'] = 0
matches_data['dire_tower_status'] = 0
matches_data['dire_xpm'] = 0
matches_data['first_blood'] = 0
matches_data['match_duration'] = 0
matches_data['radiant_winner'] = 0
for i in range(0, len(matches_data)):
# MATCH
first_blood = 0
match_duration = 0
match_winner = 0
# RADIANT
radiant_assists = 0
radiant_denies = 0
radiant_gpm = 0
radiant_healing = 0
radiant_hero_damage = 0
radiant_kills = 0
radiant_last_hits = 0
radiant_name = ""
radiant_total_levels = 0
radiant_tower_damage = 0
radiant_xpm = 0 #26905 xp for 25
radiant_barracks = 0
# DIRE
dire_assists = 0
dire_denies = 0
dire_gpm = 0
dire_healing = 0
dire_hero_damage = 0
dire_kills = 0
dire_last_hits = 0
dire_name = ""
dire_total_levels = 0
dire_tower_damage = 0
dire_xpm = 0
dire_barracks = 0
current_match = matches_data['match_id'][i]
print(current_match)
try:
match = api.get_match_details(match_id = current_match)
for x in range(0, 5):
radiant_assists += match['players'][x]['assists']
radiant_denies += match['players'][x]['denies']
radiant_gpm += match['players'][x]['gold_per_min']
radiant_healing += match['players'][x]['hero_healing']
radiant_hero_damage += match['players'][x]['hero_damage']
radiant_kills += match['players'][x]['kills']
radiant_last_hits += match['players'][x]['last_hits']
radiant_total_levels += match['players'][x]['level']
radiant_tower_damage += match['players'][x]['tower_damage']
radiant_xpm += match['players'][x]['xp_per_min']
radiant_barracks = match['barracks_status_radiant']
radiant_name = match['radiant_name']
radiant_tower_status = match['tower_status_radiant']
matches_data.loc[i, 'radiant_assists'] = radiant_assists
matches_data.loc[i, 'radiant_barracks'] = radiant_barracks
matches_data.loc[i, 'radiant_denies'] = radiant_denies
matches_data.loc[i, 'radiant_gpm'] = radiant_gpm
matches_data.loc[i, 'radiant_healing'] = radiant_healing
matches_data.loc[i, 'radiant_hero_damage'] = radiant_hero_damage
matches_data.loc[i, 'radiant_kills'] = radiant_kills
matches_data.loc[i, 'radiant_last_hits'] = radiant_last_hits
matches_data.loc[i, 'Radiant'] = radiant_name
matches_data.loc[i, 'radiant_total_levels'] = radiant_total_levels
matches_data.loc[i, 'radiant_tower_status'] = radiant_tower_status
matches_data.loc[i, 'radiant_tower_damage'] = radiant_tower_damage
matches_data.loc[i, 'radiant_xpm'] = radiant_xpm
for x in range(5, 10):
dire_assists += match['players'][x]['assists']
dire_denies += match['players'][x]['denies']
dire_gpm += match['players'][x]['gold_per_min']
dire_healing += match['players'][x]['hero_healing']
dire_hero_damage += match['players'][x]['hero_damage']
dire_kills += match['players'][x]['kills']
dire_last_hits += match['players'][x]['last_hits']
dire_total_levels += match['players'][x]['level']
dire_tower_damage += match['players'][x]['tower_damage']
dire_xpm += match['players'][x]['xp_per_min']
dire_barracks = match['barracks_status_dire']
dire_name = match['dire_name']
dire_tower_status = match['tower_status_dire']
matches_data.loc[i, 'dire_assists'] = dire_assists
matches_data.loc[i, 'dire_barracks'] = dire_barracks
matches_data.loc[i, 'dire_denies'] = dire_denies
matches_data.loc[i, 'dire_gpm'] = dire_gpm
matches_data.loc[i, 'dire_healing'] = dire_healing
matches_data.loc[i, 'dire_hero_damage'] = dire_hero_damage
matches_data.loc[i, 'dire_kills'] = dire_kills
matches_data.loc[i, 'dire_last_hits'] = dire_last_hits
matches_data.loc[i, 'Dire'] = dire_name
matches_data.loc[i, 'dire_total_levels'] = dire_total_levels
matches_data.loc[i, 'dire_tower_damage'] = dire_tower_damage
matches_data.loc[i, 'dire_tower_status'] = dire_tower_status
matches_data.loc[i, 'dire_xpm'] = dire_xpm
first_blood = match['first_blood_time']
match_duration = match['duration']
if (match['radiant_win'] == True):
match_winner = 1
else:
match_winner = 0
matches_data.loc[i, 'first_blood'] = first_blood
matches_data.loc[i, 'match_duration'] = match_duration
matches_data.loc[i, 'radiant_winner'] = match_winner
except:
print("Couldn't get data for ", current_match)
pprint.pprint(matches_data.head(10))
matches_data.to_csv("data\\full_data.csv", index=False)

It took me 35 minutes to complete this and after removing entries with blanks for team name I got 4558 matches. Not bad.


Part 2

What do we have left? A few things. Here’s the plan of attack:

  1. Get team averages
  2. Find the difference between two team’s average stats for each match
  3. See if we can predict who won a game of dota based off that info

Simple enough. Let’s begin. Team averages are remarkably easy to get because pandas has a handy function called pivot_table, essentially the same as a pivot table in excel. Again, I’m not going to get into the details of how this works but here’s my code:

import pprint
import pandas as pd
import numpy as np
pd.set_option('display.expand_frame_repr', False)matches_data_radiant = pd.read_csv("data\\full_data.csv", encoding = "ISO-8859-1")
matches_data_radiant.insert(3, 'is_radiant', 1)
matches_data_dire = pd.read_csv("data\\full_data.csv", encoding = "ISO-8859-1")
matches_data_dire.insert(3, 'is_radiant', 0)
del matches_data_radiant['Dire']
del matches_data_radiant['dire_assists']
del matches_data_radiant['dire_barracks']
del matches_data_radiant['dire_denies']
del matches_data_radiant['dire_gpm']
del matches_data_radiant['dire_healing']
del matches_data_radiant['dire_hero_damage']
del matches_data_radiant['dire_kills']
del matches_data_radiant['dire_last_hits']
del matches_data_radiant['dire_total_levels']
del matches_data_radiant['dire_tower_damage']
del matches_data_radiant['dire_tower_status']
del matches_data_radiant['dire_xpm']
del matches_data_dire['Radiant']
del matches_data_dire['radiant_assists']
del matches_data_dire['radiant_barracks']
del matches_data_dire['radiant_denies']
del matches_data_dire['radiant_gpm']
del matches_data_dire['radiant_healing']
del matches_data_dire['radiant_hero_damage']
del matches_data_dire['radiant_kills']
del matches_data_dire['radiant_last_hits']
del matches_data_dire['radiant_total_levels']
del matches_data_dire['radiant_tower_damage']
del matches_data_dire['radiant_tower_status']
del matches_data_dire['radiant_xpm']
matches_data_radiant = matches_data_radiant.rename(columns = {'Radiant' : 'team', 'radiant_assists' : 'assists', 'radiant_barracks' : 'barracks', 'radiant_denies' : 'denies', 'radiant_gpm' : 'gpm', 'radiant_healing' : 'healing', 'radiant_hero_damage' : 'hero_damage','radiant_kills' : 'kills', 'radiant_last_hits' : 'last_hits', 'radiant_total_levels' : 'total_levels', 'radiant_tower_damage' : 'tower_damage', 'radiant_tower_status' : 'tower_status', 'radiant_xpm' : 'xpm'})matches_data_dire = matches_data_dire.rename(columns = {'Dire' : 'team', 'dire_assists' : 'assists', 'dire_barracks' : 'barracks', 'dire_denies' : 'denies', 'dire_gpm' : 'gpm', 'dire_healing' : 'healing', 'dire_hero_damage' : 'hero_damage','dire_kills' : 'kills', 'dire_last_hits' : 'last_hits', 'dire_total_levels' : 'total_levels', 'dire_tower_damage' : 'tower_damage', 'dire_tower_status' : 'tower_status', 'dire_xpm' : 'xpm'})matches_data_both = pd.concat([matches_data_radiant, matches_data_dire])matches_pivot = matches_data_both.pivot_table(index = ['team'], margins = True , aggfunc = np.mean)del matches_pivot['match_id']
del matches_pivot['barracks']
del matches_pivot['tower_status']
matches_pivot = matches_pivot.drop(['All'])# pprint.pprint(matches_pivot)matches_pivot.to_csv("data\\team_averages.csv")

The output looks like this

Great! On to the next step.

Finding the difference between team’s average stats

We’re going to take the matches data from earlier and lookup the each team’s stats from the table we just generated. These will be subtracted from each other and added to a new dataframe which we’ll save to a csv and use to make predictions. I started by, as always, loading some csv’s from before into python. I then make a new dataframe called matchup_data which pulls the match_id, team names, and whether radiant won or not. A function to lookup the teams to find their stats is defined using a large map with an index for each team that points to which row the team is on in our team_averages file. Then, I subtract their stats and add them to matchup_data. Here’s the code:

import pprint
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score , confusion_matrix
pd.set_option('display.expand_frame_repr', False)team_data = pd.read_csv("data\\team_averages.csv", encoding = "ISO-8859-1")
matches_data = pd.read_csv("data\\full_data.csv", encoding = "ISO-8859-1")
matchup_data = matches_data[['match_id', 'Radiant', 'Dire', 'radiant_winner']]
matchup_data['diff_assists'] = 0
matchup_data['diff_denies'] = 0
matchup_data['diff_first_blood'] = 0
matchup_data['diff_gpm'] = 0
matchup_data['diff_healing'] = 0
matchup_data['diff_hero_damage'] = 0
matchup_data['diff_kills'] = 0
matchup_data['diff_last_hits'] = 0
matchup_data['diff_match_duration'] = 0
matchup_data['diff_total_levels'] = 0
matchup_data['diff_tower_damage'] = 0
matchup_data['diff_xpm'] = 0
def find_team_difference(team1, team2, i):
team_map = {'4 protect five' : 0, '458 Production' : 1, 'ANARCHY' : 2, 'Alliance' : 3, 'Beware.MCSFi' : 4, 'Brodie' : 5, 'CDEC Gaming' : 6, 'CRYPTEX' : 7, 'Click And Seach' : 8, 'Cloud 9' : 9, 'Clutch Gamers' : 10, 'Colombia DotA' : 11, 'Comanche' : 12, 'Cool Beans' : 13, 'Crescendo' : 14, 'Cyber Anji' : 15, 'Danish Bears' : 16, 'Digital Chaos' : 17, 'Double Dimension' : 18, 'EHOME' : 19, 'EHOME.KEEN' : 20, 'Effect' : 21, 'Elements Pro Gaming' : 22, 'Elite Wolves ' : 23, 'Elite Wolves D2' : 24, 'Evil Geniuses' : 25, 'Execration' : 26, 'FILLER PICK' : 27, 'FTD.a' : 28, 'Faceless' : 29, 'FireDota' : 30, 'FlipSid3 Tactics' : 31, 'Fnatic' : 32, 'Gambit Esports' : 33, 'Geek Fam' : 34, 'Guess' : 35, 'Happy Feet' : 36, 'Harlie is Back!' : 37, 'Infamous -_^' : 38, 'Infamous ¯\_(?)_/¯' : 39, 'Infamous?' : 40, 'Infamous´' : 41, 'Intelligence Quotient xD' : 42, 'Invictus Gaming' : 43, 'LGD-GAMING' : 44, 'LGD.Forever Young' : 45, 'M19' : 46, 'MAD KINGS-' : 47, 'MVP.Revolution' : 48, 'Midas Club Elite' : 49, 'Mineski' : 50, 'Mineski.GGNetwork' : 51, 'Moogle' : 52, 'Natus Vincere' : 53, 'Newbee' : 54, 'Ninjas in Pyjamas' : 55, 'OG Dota2' : 56, 'OverPower' : 57, 'PENTA Sports' : 58, 'Pacific Blue' : 59, 'Pacific Red' : 60, 'Planet Dog' : 61, 'Planet Odd' : 62, 'Prodota GaminG' : 63, 'Prodota Gaming' : 64, 'RRnatics' : 65, 'Rex Regum QEON' : 66, 'SG e-sports' : 67, 'SG e-sports team' : 68, 'STARS e-Sports' : 69, 'Signature.Dota2' : 70, 'Skatemasters' : 71, 'Skyville!' : 72, 'StarGameRage' : 73, 'TEAM MAX' : 74, 'THUNDERBIRDS' : 75, 'TNC Pro Team' : 76, 'TORO GAMING' : 77, 'Team Bazaar' : 78, 'Team EVOS' : 79, 'Team Empire' : 80, 'Team Freedom' : 81, 'Team Liquid' : 82, 'Team Maven' : 83, 'Team NP' : 84, 'Team Random' : 85, 'Team Red' : 86, 'Team Secret' : 87, 'Team Singularity' : 88, 'Team VGJ' : 89, 'Team. Spirit' : 90, 'The Imperial' : 91, 'Trust' : 92, 'Union Gaming Bolivia' : 93, 'Unknown Team' : 94, 'Vega Squadron' : 95, 'Vegetables Esports Club' : 96, 'Vici Gaming' : 97, 'Vici Gaming Potential' : 98, 'Virtus.pro' : 99, 'WarriorsGaming.Unity' : 100, 'WarriorsGaming.Youth' : 101, 'Wheel Whreck While Whistling' : 102, 'Zenith' : 103, 'anthrax' : 104, 'compLexity Gaming' : 105, 'iG.Vitality' : 106, 'is GG' : 107, 'mousesports' : 108}
team1_ix = team_map[team1]
team2_ix = team_map[team2]
matchup_data.loc[i, 'diff_assists'] = team_data.loc[team1_ix, 'assists'] - team_data.loc[team2_ix, 'assists']
matchup_data.loc[i, 'diff_denies'] = team_data.loc[team1_ix, 'denies'] - team_data.loc[team2_ix, 'denies']
matchup_data.loc[i, 'diff_first_blood'] = team_data.loc[team1_ix, 'first_blood'] - team_data.loc[team2_ix, 'first_blood']
matchup_data.loc[i, 'diff_gpm'] = team_data.loc[team1_ix, 'gpm'] - team_data.loc[team2_ix, 'gpm']
matchup_data.loc[i, 'diff_healing'] = team_data.loc[team1_ix, 'healing'] - team_data.loc[team2_ix, 'healing']
matchup_data.loc[i, 'diff_hero_damage'] = team_data.loc[team1_ix, 'hero_damage'] - team_data.loc[team2_ix, 'hero_damage']
matchup_data.loc[i, 'diff_kills'] = team_data.loc[team1_ix, 'kills'] - team_data.loc[team2_ix, 'kills']
matchup_data.loc[i, 'diff_last_hits'] = team_data.loc[team1_ix, 'last_hits'] - team_data.loc[team2_ix, 'last_hits']
matchup_data.loc[i, 'diff_match_duration'] = team_data.loc[team1_ix, 'match_duration'] - team_data.loc[team2_ix, 'match_duration']
matchup_data.loc[i, 'diff_total_levels'] = team_data.loc[team1_ix, 'total_levels'] - team_data.loc[team2_ix, 'total_levels']
matchup_data.loc[i, 'diff_tower_damage'] = team_data.loc[team1_ix, 'tower_damage'] - team_data.loc[team2_ix, 'tower_damage']
matchup_data.loc[i, 'diff_xpm'] = team_data.loc[team1_ix, 'xpm'] - team_data.loc[team2_ix, 'xpm']
for x in range(0, len(matchup_data)):
print(x, " of ", len(matchup_data))
try:
find_team_difference(matchup_data.loc[x,'Radiant'], matchup_data.loc[x, 'Dire'], x)
except:
print("Not able to find ", x)
matchup_data.to_csv("data\\matchup_data.csv", index = False)

This outputs a table that looks like this (but 4557 rows long!)

Finally, on to predictions

Things are pretty straight forward from here. I’m using Logistic Regression and sklearn. I load in our matchup data, split it into train and test sets, and feed it into a logistic regression model and ask it for an accuracy score. Here’s the code:

import pprint
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score , confusion_matrix
matchup_data = pd.read_csv("data\\matchup_data.csv", encoding = "ISO-8859-`1`")x_data = matchup_data[['diff_assists', 'diff_denies', 'diff_first_blood', 'diff_gpm', 'diff_healing', 'diff_hero_damage', 'diff_kills', 'diff_last_hits', 'diff_match_duration', 'diff_total_levels', 'diff_tower_damage', 'diff_xpm']]
y_data = matchup_data.radiant_winner
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.33, random_state = 12)
model = LogisticRegression()
model.fit(x_train, y_train)
prediction = dict()
prediction['Logistic'] = model.predict(x_test)
print('Log: ',accuracy_score(y_test, prediction['Logistic']))
print(model.coef_)
conf_mat_logist = confusion_matrix(y_test, prediction['Logistic'])
print('Logist \r', conf_mat_logist)

Which gives us…

An accuracy of 62% isn’t bad! Judging by the confusion matrix, it looks like we do slightly worse at predicting who will win as opposed to who will lose. The numbers under model coefficients sort of represent how impactful each coefficient was. Ordered by largest to smallest here they are:

Variable                Coefficient      
'diff_assists', 0.02543
'diff_kills', 0.02189
'diff_xpm' 0.00432
'diff_match_duration', 0.00287
'diff_gpm', 0.00041
'diff_tower_damage', 0.00022
'diff_first_blood', 0.00019
'diff_healing', -(0.00002)
'diff_hero_damage', -(0.00003)
'diff_last_hits', -(0.00090)
'diff_denies', -(0.01385)
'diff_total_levels', -(0.11222)

Keep in mind the scale of some of these variables as well. There is far more variation in tower damage and GPM than there is kills. When taking that into account it seems that GPM, hero damage, and tower damage are the best predictors of which team will win.

These coefficients can be used to get the actual probability of a team winning in a matchup. I used logistic regression, so the formula is…

p(radiant_win) = 1/(1+e^-(1*(0.02543*diff_assists + 0.02189*diff_kills…- 0.11222*diff_total_levels)))

Not very pretty.

If we go open up our matchup data csv in excel we can look at the actual predicted probabilities of a win for each match. Here’s what it actually looks like, sorted by highest to lowest probability of winning. The column ‘correct’ checks if our prediction is above 50% and radiant won, or that our prediction was below 50% and radiant lost.

If I look at only games where there is a 75% or above probability of one team winning (either radiant or dire) my model has a 82% accuracy. This means that I’m underestimating the probability some team will win by a bit. In an ideal world these metrics would match up. But given that this is the first draft of my model I’m happy with my results.


This could be improved in a number of ways as well. The obvious improvement (which I’ll get to work on) is predicting the winner of a series as opposed to individual games. Past that, some extra data could prove helpful. I grabbed data on the status of barracks and towers in each game but didn’t do anything with them. A ‘towers destroyed’ and ‘barracks destroyed’ data point could be cool. Match_duration could be split up into winning duration and losing duration. First blood right now is just a statistic of when first blood takes place on average, but doesn’t specify which team got first blood. Maybe including how often a team is radiant/dire would be useful as well. There are a few outliers here or there in my dataset as well. Eliminating those should be helpful.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade