Counter Strike matches result prediction

Jeferson Machado Santos
Analytics Vidhya

--

Part 1: web scrapping data

Some friends of mine enjoy to spend their whole nights playing Counter Strike on their computer. This game is also played by professional e-players and has very organized tournaments and matches which data are available online. Once I was starting my studies on data science, it seemed a good idea to develop a machine learning model that predicts the result of matches based on players statistics.

According to its Wikipedia page, Counter-Strike is an objective-based, multiplayer first-person shooter. Two opposing teams — the Terrorists and the Counter Terrorists — compete in game modes to complete objectives, such as securing a location to plant or defuse a bomb and rescuing or guarding hostages.

HLTV.org is a website that gathers all results, tournaments, historical statistics from players and teams, schedule for future matches, as well as matches live transmissions. So, this will be the source for our data science project.

This is the first of a series of 3 posts about gathering data, developing machine learning model, and deploying it. In this post I will describe how I collected historical data from players and matches through web scrapping hltv.org website using python techniques.

In the results section of hltv.org, we can find the results of matches played back to 2012. When we click on a match link, we see data as date of the mach, final result, teams, hour, maps played, as well as individual statistics from players in that match. Accessing each player link, we can see general statistics from players history of matches. So, the idea of this web scrapping project is to access each match from results page, collect data from the match, then access statistics from each player who played that match and collect their historical statistics until the immediate day before the date of the match. Thinking about a machine learning model perspective, data until the day before the match will be available data to predict future matches. By the end of this project, we should have a matches database and a players database structured like this:

Matches database

This project was developed in Jupyter notebook and if you don’t want to read all this post, you can download the notebook with the complete notebook from my github repository here. Now we will go through the sessions of code used to web scrap our data.

from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd

Starting by importing resources and libraries which will be necessary, webdriver will allow us to open a browser and control it through our python code. With BeautifulSoup we can read the content of a webpage controlled by Webdriver and iterate through its content. We will also need pandas to deal with dataframes.

driver=webdriver.Chrome()date=[]
team1=[]
team2=[]
finalResult1=[]
finalResult2=[]
tournament=[]
linkStatsList=[]
playersPlayers=[]
kdPlayers=[]
adrPlayers=[]
kastPlayers=[]
ratingPlayers=[]
datePlayers=[]
team1Players=[]
team2Players=[]
finalResult1Players=[]
finalResult2Players=[]
tournamentPlayers=[]
playerteamPlayers=[]
mapPlayers = []
overallKillsPlayers = []
overallDeathsPlayers = []
overallKill_DeathPlayers = []
overallKill_RoundPlayers = []
overallRoundsWithKillsPlayers = []
overallKillDeathDiffPlayers = []
openingTotalKillsPlayers = []
openingTotalDeathsPlayers = []
openingKillRatioPlayers = []
openingKillRatingPlayers = []
openingTeamWinPercentAfterFirstKillPlayers = []
openingFirstKillInWonRoundsPlayers = []
months = {
'January':'01',
'February':'02',
'March':'03',
'April':'04',
'May':'05',
'June':'06',
'July':'07',
'August':'08',
'September':'09',
'October':'10',
'November':'11',
'December':'12'
}

Before starting the proper web scrapping, we prepare all data which will be necessary in the process. We start the webdriver tool by defining acesse through Google Chrome and assign it to a variable called ‘driver’. Next, we create empty lists of all the columns which we want on our final dataframes, one for each data we want to collect. As the code passes by the pages, the collected data will be appended to each of the lists. By the end of the process, they will be united to form dataframes. All list that finish with ‘Players’ on its name refers to data that will be collected from the players statistics pages, while the other are from the matches played. We also create a dictionary with months names and its numbers, once we will need it to read the date of a match and then search the statistics from previous day, which uses month in numbers format.

page=0while page <=99:  matchesLinks=[]  driver.get('https://www.hltv.org/results?offset='+str(page))  content=driver.page_source
soup=BeautifulSoup(content)
for div in soup.findAll('div', attrs={'class':'results'}):
for a in div.findAll('a', attrs={'class':'a-reset'}):
link = a['href']
matchesLinks.append(link)
|
|
|
page+=100

We start by defining a variable ‘page’ and assigning its value to zero. Since there’s many result pages with data back to 2012, we can iterate through this pages adding 100 to the page variable at each iteration using a while loop. In this case, we limited the code to stop when page is greater than 99 because we want to web scrap data only from the first page. All the code from now on will go inside this while loop. First, a list called ‘matchesLinks’ is created, and it will store all the links of each match page to be accessed later. We use our ‘driver’ variable and its method ‘get’ to access hltv website on Google Chrome. Here we form the website address concatenating ‘https://www.hltv.org/results?offset=' and the value from page variable, transformed to string. So, each time we add 100 to our page variable, we will open a different results page to be we scrapped.

We then store the page content to a variable ‘content’ and read it with BeautifulSoup, storing it to a variable called ‘soup’. By analyzing the page with browser inspector, we see the results are in side a container with the class ‘results’, and each link has a class ‘a-reset’. So, our code finds all the divs with the class ‘results’ in the content stored in our variable ‘soup’, and for each one it finds all links with the class ‘a-reset’, and then we collect the ‘href’ attribute of each ‘a’ element, that is the proper link, and append to the list ‘matchesLinks’.

for link in matchesLinks:  if (link[:8]=="/matches"):    url='https://www.hltv.org/'+link


driver.get(url)
content=driver.page_source
soup=BeautifulSoup(content)
for div in soup.findAll('div', attrs={'class':'match-page'}):
pageDate=div.find('div',attrs={'class':'date'})
pageTournament=div.find('div',attrs={'class':'event text-ellipsis'})
date.append(pageDate.text)
tournament.append(pageTournament.text)
for div in soup.findAll('div', attrs={'class':'team1-gradient'}):
pageTeam1=div.find('div',attrs={'class':'teamName'})
pageResult1=div.find('div',attrs={'class':['won','lost','tie']})
team1.append(pageTeam1.text)
finalResult1.append(pageResult1.text)
for div in soup.findAll('div', attrs={'class':'team2-gradient'}):
pageTeam2=div.find('div',attrs={'class':'teamName'})
pageResult2=div.find('div',attrs={'class':['won','lost','tie']})
team2.append(pageTeam2.text)
finalResult2.append(pageResult2.text)

From now on, we iterate through the links stored on matchesLinks, open each one of them using webdriver and BeautifulSoup as in the first step and collect the data we need from the page. Our code will find a div with class ‘match-page’, where we can find data from match date and tournament. These data will be appended to date and tournament lists. The same process is followed on next lines to collect information of teams in the match and final results.

for div in soup.findAll('div', attrs={'id':"all-content"}):
team = pageTeam1.text
for table in div.findAll(class_='table totalstats'):
rows = table.find_all('tr')[1:]
for row in rows:
cell = [i.text for i in row.find_all('td')]
playersPlayers.append(cell[0].split('\n')[2])
kdPlayers.append(cell[1])
adrPlayers.append(cell[2])
kastPlayers.append(cell[3])
ratingPlayers.append(cell[4])
datePlayers.append(pageDate.text)
team1Players.append(pageTeam1.text)
team2Players.append(pageTeam2.textt)
finalResult1Players.append(pageResult1.text)
finalResult2Players.append(pageResult2.text)
tournamentPlayers.append(pageTournament.text)
playerteamPlayers.append(team)
mapPlayers.append(maps[j])
team = pageTeam2.text

Now that we collected general information about the match, we can dive into the table with palyers statistics about this match. The process is the same, but the difference here is that the table element is a string, so we must use a different method to filter its class as shown in the code above. After that, we append each player statistics to its related list.

for divl in soup.findAll('div',attrs={'class':'small-padding stats-detailed-stats'}):
for a in divl.findAll('a'):
link_stats = a['href']
break
url='https://www.hltv.org/'+link_stats
linkStatsList.append(url)
driver.get(url)
content=driver.page_source
soup=BeautifulSoup(content)

After gathering all data from each player for this specific match, we start to look for historical data of each player previous to the immediate date of the match. In each match page there is a link called ‘Detailed stats’ where we can find links to the historical statistics of the players. The above code find this link on the match page and directs the controlled page to it, so the code can start to collect this data on next steps.

for table in soup.findAll(class_='stats-table'):
rows = table.find_all('tr')[1:]
for row in rows:
stats_auxiliary = {}
link_player = [i['href'] for i in row.find_all('a')]
dateStats = pageDate.text
dateSplit = dateStats.split(' ')
year = dateSplit[-1]
month = months[dateSplit[-2]]
if len(dateSplit[0])==3:
toInt = int(dateSplit[0][0])
day_aux = toInt-1
day = '0'+str(day_aux)
else:
toInt = int(dateSplit[0][0:2])
day_aux = toInt-1
day = str(day_aux)
url='https://www.hltv.org'+link_player[0][:15]+'individual/'+link_player[0][15:]+'?startDate=2013-01-01&endDate='+year+'-'+month+'-'+day driver.get(url)
content=driver.page_source
soup=BeautifulSoup(content)

On the ‘Detail stats’ page there is a new table with links to each player historical statistics. We follow the same process of finding the link on each line. The difference here is that we will provide the start date and end date of to filter the statistics period of each player. The start date will always be 01/01/2013, while the end date will be the previous day of the match. To adjust the end date, we must manipulate the date of the match collected on the previous steps of the code. We will then split it by spaces, to separate day, month and year, and will work on it to have the date on YYYY/MM/DD.

driver.get(url)
content=driver.page_source soup=BeautifulSoup(content)
for divpl in soup.findAll('div',attrs={'class','standard-box'}):
for divst in divpl.findAll('div',attrs={'class','stats-row'}):
stat = []
for span in divst.findAll('span'):
if (span.text != 'K - D diff.'):
stat.append(span.text)s
stats_auxiliary[stat[0]]=stat[1]

overallKillsPlayers.append(stats_auxiliary['Kills']) overallDeathsPlayers.append(stats_auxiliary['Deaths']) overallKill_DeathPlayers.append(stats_auxiliary['Kill / Death']) overallKill_RoundPlayers.append(stats_auxiliary['Kill / Round']) ocerallRoundsWithKillsPlayers.append(stats_auxiliary['Rounds with kills']) overallKillDeathDiffPlayers.append(stats_auxiliary['Kill - Death difference']) openingTotalKillsPlayers.append(stats_auxiliary['Total opening kills']) openingTotalDeathsPlayers.append(stats_auxiliary['Total opening deaths']) openingKillRatioPlayers.append(stats_auxiliary['Opening kill ratio']) openingKillRatingPlayers.append(stats_auxiliary['Opening kill rating']) openingTeamWinPercentAfterFirstKillPlayers.append(stats_auxiliary['Team win percent after first kill']) openingFirstKillInWonRoundsPlayers.append(stats_auxiliary['First kill in won rounds'])

Once the code access each players historical statistics, we follow the same process to append the data to the list of each data we created on the beginning of the code.

players_auxdf=pd.DataFrame({'Date':datePlayers,'Team1':team1Players,'Team2':team2Players,'Final Result 1':finalResult1Players,'Final Result 2':finalResult2Players,'Tournament':tournamentPlayers,'Player Team':playerteamPlayers,'Player':playersPlayers,'KD':kdPlayers,'ADR':adrPlayers,'KAST':kastPlayers,'Rating':ratingPlayers,'Map':mapPlayers, 'Overall Kills':overallKillsPlayers,'Overall Deaths':overallDeathsPlayers,'Overal Kill / eath':overallKill_DeathPlayers,'Overall Kill / Round':overallKill_RoundPlayers,'Overall Rounds with Kills':overallRoundsWithKillsPlayers,'Overall Kill - Death Diff':overallKillDeathDiffPlayers,'Opening Total Kills':openingTotalKillsPlayers,'Opening Total Deaths':openingTotalDeathsPlayers,'Opening Kill Ratio':openingKillRatioPlayers,'Opening Kill rating':openingKillRatingPlayers,'Opening Team win percent after 1st kill':openingTeamWinPercentAfterFirstKillPlayers,'Opening 1st kill in won rounds':openingFirstKillInWonRoundsPlayers})playersdf=pd.concat([playersdf,players_auxdf])playersPlayers=[]
kdPlayers=[]
adrPlayers=[]
kastPlayers=[]
ratingPlayers=[]
datePlayers=[]
team1Players=[]
team2Players=[]
finalResult1Players=[]
finalResult2Players=[]
tournamentPlayers=[]
playerteamPlayers=[]
mapPlayers = []
overallKillsPlayers = []
overallDeathsPlayers = []
overallKill_DeathPlayers = []
overallKill_RoundPlayers = []
overallRoundsWithKillsPlayers = []
overallKillDeathDiffPlayers = []
openingTotalKillsPlayers = []
openingTotalDeathsPlayers = []
openingKillRatioPlayers = []
openingKillRatingPlayers = []
openingTeamWinPercentAfterFirstKillPlayers = []
openingFirstKillInWonRoundsPlayers = []

After collecting all the data from players involved in this match, the code create a Dataframe called ‘players_auxdf’, uniting all lists created, and then concatenates it with the general ‘playersdf’ Dataframe, which will be constructed along the loops. After, we can clean all the lists once the data is already stored on the Dataframe and so we can collect data from players of the next match.

All this code will loop through all the matches in one page, according to our list called ‘matchesLinks.

df=pd.DataFrame({'Date':date,'Team1':team1,'Team2':team2,'Final Result 1':finalResult1,'Final Result 2':finalResult2,'Tournament':tournament,'Link Stats':linkStatsList})
df.to_csv('csMatches_nd_'+str(page)+'.csv',index=False)

date=[]
team1=[]
team2=[]
finalResult1=[]
finalResult2=[]
tournament=[]
linkStatsList=[]

playersdf.to_csv('csplayers_nd_'+str(page)+'.csv',index=False)

playersdf = playersdf[0:0]

page+=100

Finally, we can store the lists with general data from the matches on a Dataframe and then save it in a .csv file to later use. We also save the the players Dataframe to a .csv file. The code also cleans all lists and ‘playersdf’ Dataframe to restart the job in a new page. The dataframes could have been saved on another form of storage, like a SQL database instead of the .csv files used here.

Here finishes the web scrapping step of this projects. We can now move to next step and analyze the data we collected in order to use it to build a machine learning model to predict the winner of counter strike matches. The project continues on next post.

--

--

Jeferson Machado Santos
Analytics Vidhya

Data Engineer at Xepelin. Experience in building data ecosystems which gather data and turn them available to business users.