Counter Strike matches result prediction

Jeferson Machado Santos
Analytics Vidhya
8 min readMay 13, 2020

--

Part 2: Machine Learning model

This is the second of a series of 3 posts about developing a machine learning model to predict the result of online Counter Strike matches. The first article described how we collected historical data from matches and players performance from hltv.org. We also described how Counter Strike itself works. You can read the beginning of this story here.

In this second article we will build a machine learning model to predict the result of Counter Strike matches based on the data we collected on the previous article.

By the end of the previous article, two databases with historical data were generated, one for matches and one for the performance of each player fromeach team on these matches. Here is the structure of them.

Matches database
Players database

Our objective is to predict the winner team from a match. By taking a look at the databases we can see that matches database does not have to many information except from teams that played and final result. Since final result is what we intend to predict, we will need to work on players performance database that has information previous to the match. If you do not want to read all this article, you can check the full jupyter notebook of this project on my github, here.

By taking a look at this database, we can see the ‘KD’ (kill / death) column and the ADR column. KD shows the number of kills and deaths each player had on that match, and the ADR is the difference between both. In theory, the team with most kills should win the match. So, we will cross both databases in order to sum total kill difference from players of each team in each match and check if it correlates with the winner team of each match.

We start by importing pandas library to deal with dataframes as well as the files we saved as .csv by the end of the first post and saving them to variables named ‘matches_raw’ and ‘players_raw’.

It’s necessary to cross information on both database in order to sum the Kill Difference (ADR) of each player for each team. We will create a new column in each database in which we concatenate information from columns ‘Date’, ‘Team 1’, ‘Team 2’, ‘Final Result 1', ‘Final Result 2’ and ‘tournament’. So, when we need to find player from a specific match, we can look for the columns on players database that have the same value than this column on matches database.

Now that we have a column to cross both database, we can sum the kill difference of players for each match and for each team on the match. We start by making a filter to find players with same value on ‘Match column’ as on the match database and assigning it to ‘total1’ in line 6. Next we can filter ‘total1’ to sum only the ADR of players from team 1 in line 7. We can append it to total_adr_team1 list on line 8. We then follow the same steps for team 2 on lines 10 to 12. Finally, we can create columns with total kill difference for each team on matches database.

Also, we will add a column called ‘Team1 victory’ to easily, and numerically, check which team won the match. We will go through all the lines on ‘matches_raw’ and check ‘Final Result 1’ and ‘Final Result 2’, the final score of the match. If the score from team 1 is higher than team 2, we will give it the value of 1. Otherwise, it will be 0. We will also give the value of 2 for draws.

Now we can show and confirm our initial hypothesis that teams with higher total kill difference in a match were the winners. As we can see on the distribution plot below, most of the time team 1 is the winner team, it has a positive total kill difference, and we can see the same effect on team 2 plot, but on the opposite side.

To accomplish our objective of predicting the result of Counter Strike matches, we will build a model that predicts the kill difference (ADR) of each player in a match. With this data, we can sum all the kill differences for each team, and predict the winner team. From now on, we will work mainly on the players database to build our model.

We assign players_raw database to train_X, since it will be the database used to train the model. Then we create a opposing team column, since we only have the Player Team information on players database. This information will be important for the model, once players can have better or worse performance against specific teams. The code go through all the rows on train_X with the while loop on line 6, and check if the value on ‘Player Team’ is equal to the value on ‘Team 1’ column. If it is, the opposing team is team 1. Otherwise it is team 1.

With train_X.info() command we can see the detail of all the columns on our database. The non-null count column shows that we don’t have missing data, so we can escape imputing steps of data cleaning. However, we have data that will not be useful for the model. The date will not be important to the match result, as well as team 1 and team 2, once we already have team information on ‘Player Team’ and ‘Opposing Team’ columns. Final results are information that are available only after a match is played, so we can not use it to predict the match result. The kill difference (ADR) is what we are trying to predict, so we keep this out of the model and assign it to the variable train_y, because it will be our results dataset to train our model. KD, Rating and map also refer to the already played match and will not be available for prediction. Basically, we keep data from historical performance, the player itself, its team and its opposing team.

Now that we have only the columns we want to keep on our model, we will focus on columns which datatype is ‘object’ and try to turn it to numeric to have a better performance on our model. ‘Opening Team win percent after 1st kill’ and ‘Opening 1st kill in won rounds’ are objects only because the collected data kept a ‘%’ character on the end. So we will work to remove this character and turn these columns to numerical with a percentage number.

On lines 3 and 4 we create empty lists which will store the new values for each line of the database. On lines 6 to 13 we go through all the lines of our train database and change to float type each of the values from the columns, leaving out the last character (%). On lines 15 to 20 we append the new columns to the database and drop the old ones. And now we can see that the columns are with the float format.

Now we will work on the remaining object data, which are ‘Player’, ‘Player Team’ and ‘Opposing Team’, all of them names. To deal with it we can use encoders, that in general assign a numerical code to each different value, in this case name, it finds on the data set.

Now we have all the columns from the database with numerical values and we can start with the model itself. Remember that our objective is to build a model that predicts the kill difference based on the data which we just cleaned and adjusted.

In order to evaluate the models, we create a function in which we can provide the model and it calculates the mean absolute error based on different simulations of train and validation datasets, using the ‘cross_val_score’ function. On the cells below we will evaluate the use of tow algorithms, a Random Forest Regressor and XGBoost regressor.

We also used a for loop to change n_estimators of the model passed to score function. As you can see on the results, XGBoost has a much better average error result, and we will continue with it. Next step is to find the best parameters of our XGBoost algorithm. To make that, we follow the process described on this post. I will not go through all the steps, but our final model will have the following parameters and it gives us an average error of 7.37:

The last step of optimization is to select the best features to use in the model from all the columns of we area using from our database to train our model.

In this case, all our columns were kept. Finally we can run our score function on last time in order to find the best n_estimators for our model with these features and parameters. Defining the best n_estimators, we can store our model, make its final training and save it with pickle library for further use.

So now we have a model that can predict the kill difference of each player of each team in a Counter Strike match. If we make the process of sum the kill difference of all the players from each team on each match, we can check which team had a higher prediction of kill difference and it is the predicted winner. In order to test our model, I collected a new database of matches using the web scrapping code from our first post, which was not involved in this training. After making all the preparations on the database, I ran the model and compared the results with the real outcome of the matches.

With a totally new database, the great majority of the predictions was correct. Also, when the difference between the total kill difference prediction of the teams is higher than 25, the correct predictions rate is much higher.

Here we accomplish the objective of this model, that is to predict the outcome of Counter Strike matches based on historical data from hltv.org. On next post, I will share how we can make a web scrapping algorithm to collect data of futures matches to be played and predict, as well as how we can adjust this files and deploy them to be run outside the jupyter notebook every moment we want to predict matches.

--

--

Jeferson Machado Santos
Analytics Vidhya

Data Engineer at Xepelin. Experience in building data ecosystems which gather data and turn them available to business users.