PyTorch neural networks to predict match results in soccer championships — Part I

André Luiz França Batista
11 min readJun 10, 2019

--

This two-part tutorial will show you how to build a Neural Network using Python and PyTorch to predict matches results in soccer championships. In this first part of the tutorial you will learn how to collect and prepare the data from datasets using Python with Pandas library. In the second part of the tutorial you will learn how to create and setup the Neural Networks models using Python and PyTorch platform.

About the data

The data used for the proposed approach were the football matches results, which were obtained from the Brazilian League Championships, seasons 2010 (BLC 2010), 2011 (BLC 2011) and 2012 (BLC 2012). These different datasets can provide the predictive power of the proposed method. The championships were played on the point system accrued by 20 teams. Each team faced the other both home and away, playing in their own stadium as well as that of the visiting team. In this manner, there were a total of 38 rounds and in each one 10 matches were played, thus producing a total of 380 matches per championship.

Where is the data from?

These datasets were collected from UOL Sports website BLC 2010, BLC 2011, BLC 2012 and converted into three CSV files (one for each season). These CSV files will be used for the proposed approach and it can be found here.

The dataset format

Take a look at the datasets in the CSV files.

Each CSV file contains one header row and 760 rows of data. Each row contains 49 features which are: “athlete_id”, “athlete_name”, “round”, “team_id”, “team_name”, “assistances”, “receivedBalls”, “recoveredBalls”, “lostBalls”, “yellowCards”, “redCards”, “cards”, “crossBalls”, “receivedCrossBalls”, “missedCrossBalls”, “receivedCrossBallsPercent”, “defenses”, “sucessfulTackles”, “unsucessfulTackles”, “tackles”, “sucessfulDribles”, “unsucessfulDribles”, “dribles”, “givenCorners”, “receivedCorners”, “receivedFouls”, “committedFouls”, “goodFinishes”, “postFinishes”, “badFinishes”, “finishes”, “goals”, “ownGoals”, “offsides”, “longPasses”, “sucessfulLongPasses”, “unsucessfulLongPasses”, “sucessfulPasses”, “unsucessfulPasses”, “passes”, “sucessfulPassesPercent”, “timePlayed”, “switchField”, “backPasses”, “nickname”, “substitute”, “substituted”, “win”, “draw”, “defeat”.

We will not use all these 49 features available in the dataset. Instead, we will use 27 features following the methodology that took place in the paper:

Martins, Rodrigo G., et al. “Exploring polynomial classifier to predict match results in football championships.” Expert Systems with Applications 83 (2017): 79–93. Access here.

For this tutorial we will use Python as programming language. In order to collect and prepare the dataset we will use Pandas library. For the neural networks models (training and testing) we will use the PyTorch platform.

Collect and prepare the dataset

First things first, we need to download the CSV files and save them. All the CSV files can be found here. For this tutorial I use only the dataset from BLC 2010. Feel free to try it with the CSV files and the datasets from the other seasons.

Input:

import pandasfilepath = "data/training_2010.csv"
df = pandas.read_csv(filepath)
print(df) # print to check

Output:

     athlete_id  athlete_name  round ... win  draw  defeat
0 0 NaN 1 ... 0 1 0
1 0 NaN 1 ... 0 1 0
2 0 NaN 1 ... 1 0 0
3 0 NaN 1 ... 0 0 1
4 0 NaN 1 ... 0 1 0
5 0 NaN 1 ... 0 1 0

Pandas library is a great choice to work with data. If you are not familiar with Pandas I strongly suggest that you read this post here from Ted Petrou.

In the code above we’ve just load all the data from the CSV file to a Pandas dataframe (df). We have now one big dataframe with all the data and we need to select the desired data and extract it from the dataframe. As explained before this dataset contains 760 rows of data, each row contains 49 features. Each row represents all the team’s stats for one specific round. In order to prepare the data to be used in the PyTorch models we need to follow this roadmap:

  • Select and extract the features that we want;
  • Separate the data for training;
  • Separate the data for test;
  • Normalize the data between 0 and 1;
  • Shuffle the training and test set;
  • Split training and test sets into input and output;
  • Convert both output sets.

This roadmap can be useful for you to prepare your data to be used in PyTorch models. In this tutorial we are following this roadmap with the BLC 2010 dataset but it can be used with others datasets as well.

Select and extract the features

As we stated before, the dataset contains 49 features but we will not use them all. For this tutorial we will use the 27 features used in the paper that we mentioned earlier.

Input:

extract = [5, 6, 7, 8, 9, 10, 13, 14, 16, 17, 18, 20, 21, 23, 24, 25, 26, 27, 29, 30, 31, 32, 33, 35, 36, 37, 38, 47, 48, 49]
df = df.iloc[:, extract] # df with the desired columns (features)
print(df) # print to the check

Output:

     assistances  receivedBalls   ...  win  draw  defeat
0 5 127 ... 0 1 0
1 4 319 ... 0 1 0
2 5 297 ... 1 0 0
3 2 172 ... 0 0 1
4 0 202 ... 0 1 0
5 0 149 ... 0 1 0
.. .. ... ... .. .. ..
758 5 105 ... 1 0 0
759 3 131 ... 0 0 1
[760 rows x 30 columns]

In the code above we’ve created an array with the columns index from the features that we want. Then we use the Pandas iloc function to extract the desired features from the original dataframe. Feel free to try extract other features. Important note: the last three indexes [47, 48, 49] are essential because they stores the matches results.

Separate the data for training and test

We will use Pandas iloc function to split the dataset into training and test sets. The training set is the stats for each match from round 1 to round 37. The test set is the stats for each match from round 38. Each round contains 10 matches (10 matches between 2 teams, i.e. 20 rows of data).

Input:

training = df.iloc[:-20] #select all the rows except for the last 20
test = df.iloc[-20:] #select the last 20 rows
print(training) #print to check
print(test) #print to check

Output:

  assistances  receivedBalls  recoveredBalls  ...  win  draw  defeat
0 5 127 7 ... 0 1 0
1 4 319 4 ... 0 1 0
2 5 297 2 ... 1 0 0
3 2 172 0 ... 0 0 1
4 0 202 2 ... 0 1 0
5 0 149 1 ... 0 1 0
.. ... ... ... ... ... ... ...
735 1 266 4 ... 0 1 0
736 4 214 1 ... 0 1 0
737 7 239 7 ... 0 1 0
738 3 254 3 ... 1 0 0
739 2 172 1 ... 0 0 1
[740 rows x 30 columns]
assistances receivedBalls recoveredBalls ... win draw defeat
740 3 187 3 ... 1 0 0
741 0 244 1 ... 0 0 1
742 2 134 2 ... 0 1 0
743 1 166 0 ... 0 1 0
.. ... ... ... ... ... ... ...
756 2 193 5 ... 0 1 0
757 3 138 3 ... 0 1 0
758 5 105 2 ... 1 0 0
759 3 131 1 ... 0 0 1
[20 rows x 30 columns]

When you print the training and the test data, you are able to see the dataframe dimensions. Note that the training dataframe has [740 rows x 30 columns], means that we have now a dataframe with the stats from round 1 to round 37. Note that the test dataframe has [20 rows x 30 columns], it means that we have a dataframe with the stats for round 38.

Normalize the data between 0 and 1

Check out the values that we have now. The values can variate from 0 to over 300. We need to normalize these values between 0 and 1. We will check each column (except for the last three, i.e. the results) for the maximum value and then divide all the values for that column by 10 or 100 or 1000. We can tackle this issue with this following code.

Input:

#normalize the data between 0 and 1
for e in range(len(training.columns) - 3): #iterate for each column
num = max(training.iloc[:, e].max(), test.iloc[:, e].max()) #check the maximum value in each column
if num < 10:
training.iloc[:, e] /= 10
test.iloc[:, e] /= 10
elif num < 100:
training.iloc[:, e] /= 100
test.iloc[:, e] /= 100
elif num < 1000:
training.iloc[:, e] /= 1000
test.iloc[:, e] /= 1000
else:
print("Error in normalization! Please check!")
print(training) #print to check
print(test) #print to check

Output:

  assistances  receivedBalls  recoveredBalls  ...  win  draw  defeat
0 0.05 0.127 0.07 ... 0 1 0
1 0.04 0.319 0.04 ... 0 1 0
2 0.05 0.297 0.02 ... 1 0 0
3 0.02 0.172 0.00 ... 0 0 1
4 0.00 0.202 0.02 ... 0 1 0
.. ... ... ... ... ... ... ...
736 0.04 0.214 0.01 ... 0 1 0
737 0.07 0.239 0.07 ... 0 1 0
738 0.03 0.254 0.03 ... 1 0 0
739 0.02 0.172 0.01 ... 0 0 1
[740 rows x 30 columns]
assistances receivedBalls recoveredBalls ... win draw defeat
740 0.03 0.187 0.03 ... 1 0 0
741 0.00 0.244 0.01 ... 0 0 1
742 0.02 0.134 0.02 ... 0 1 0
.. ... ... ... ... ... ... ...
757 0.03 0.138 0.03 ... 0 1 0
758 0.05 0.105 0.02 ... 1 0 0
759 0.03 0.131 0.01 ... 0 0 1
[20 rows x 30 columns]

Now we have two dataframes (one for training, one for test) and both are normalized between 0 and 1.

Shuffle the training and test sets

Why to shuffle the data? Shuffling data make sure that models remain general and overfit less. In our case we may need to shuffle the data because this dataset is sorted by their class/target, i.e. team/result. Here you will want to shuffle to make sure that your training/test sets are representative of the overall distribution of the data.

Observe in the training set that each row represent one team stats for one game. All matches are listed in order, it means that the team in row 0 played against the team in the row 1; the team in the row 2 played against the team in the row 3; and so on… It means that if the team in a even row (0, 2, 4, …) wins the match, the following team will have a defeat. Also means if the team in a even row get a draw as result, the following team also get a draw. It occurs because row 0 and 1 are the same match.

Shuffling the sets is a simple task and it can be done in multiple ways. See below where we use the sample method from Pandas as a way to solve this issue.

Input:

training = training.sample(frac=1) #shuffle the training data
test = test.sample(frac=1) #shuffle the test data
print(training) #print to check
print(test) #print to check

Output:

  assistances  receivedBalls  recoveredBalls  ...  win  draw  defeat
570 0.03 0.253 0.01 ... 1 0 0
23 0.01 0.180 0.02 ... 0 0 1
106 0.02 0.168 0.01 ... 1 0 0
462 0.02 0.180 0.03 ... 0 0 1
.. ... ... ... ... ... ... ...
501 0.03 0.153 0.01 ... 0 1 0
287 0.04 0.264 0.01 ... 0 0 1
346 0.00 0.243 0.01 ... 0 0 1
[740 rows x 30 columns] assistances receivedBalls recoveredBalls ... win draw defeat
755 0.04 0.343 0.09 ... 0 1 0
751 0.01 0.154 0.02 ... 0 0 1
753 0.02 0.154 0.03 ... 0 0 1
752 0.04 0.146 0.04 ... 1 0 0
.. ... ... ... ... ... ... ...
757 0.03 0.138 0.03 ... 0 1 0
759 0.03 0.131 0.01 ... 0 0 1
754 0.01 0.142 0.02 ... 0 1 0
756 0.02 0.193 0.05 ... 0 1 0
[20 rows x 30 columns]

The frac keyword argument specifies the fraction of rows to be returned in the random sample, so frac=1 means return all rows (in random order).

Split training and test sets into input and output

Next thing to do is split the training and test sets into input and output sets. It is a simple task and we will use the iloc indexer to do it.

Input:

#all rows, all columns except for the last 3 columns
training_input = training.iloc[:, :-3]
#all rows, the last 3 columns
training_output = training.iloc[:, -3:]
#all rows, all columns except for the last 3 columns
test_input = test.iloc[:, :-3]
#all rows, the last 3 columns
test_output = test.iloc[:, -3:]
print(test_input) #print to check
print(test_output) #print to check

Output:

 assistances  receivedBalls  ...  sucessfulPasses  unsucessfulPasses
752 0.04 0.146 ... 0.143 0.30
755 0.04 0.343 ... 0.337 0.68
758 0.05 0.105 ... 0.097 0.38
... ... ... ... ... ...
753 0.02 0.154 ... 0.146 0.37
757 0.03 0.138 ... 0.136 0.57
754 0.01 0.142 ... 0.136 0.64
[20 rows x 27 columns] win draw defeat
752 1 0 0
755 0 1 0
758 1 0 0
... ... ... ...
753 0 0 1
757 0 1 0
754 0 1 0
[20 rows x 3 columns]

You can print all the results if you want to check it. What we did here was split the training dataframe into two other dataframes: training_input and training_output. And we did the same with the test set.

Convert both output sets

We are almost there. The training input and the test input sets are ready. Now we may need convert both output sets. Why? Because both output sets contains three columns representing win, draw, defeat. If there is a 1 in the win column, it means that team in that round win the match. Same thing for the draw and defeat columns. It means we have three classes and we will need three nodes/neurons in the output layer of our model. To better address our model, we will convert these three classes into two classes. This is the same methodology that was used in the paper.

This new output will represents if a team won or not; if it was a draw or not; if it was a defeat or not. So in this way, we will train three different neural networks models. One model for the wins, one model for the draws, and one model for the defeats. This will be better explained in the Part II of this tutorial.

For this tutorial I will demonstrate the winners scenario. The losers and draws scenarios follow the same idea. If a team wins the match then the output will be valued as 1. If a team lost or draw the match then the output will be valued as 0. All we need to do is to convert the three columns output into just one column.

Input:

# separating the output into two classes: win or draw-defeat
# for the winners convert the output:
# from (1, 0, 0) to 1
# from (0, 1, 0) to 0
# from (0, 0, 1) to 0
def
convert_output_win(source):
target = source.copy() # make a copy from source
target['new'] = 2 # create a new column with any value
for i, rows in target.iterrows():
if rows['win'] == 1:
rows['new'] = 1
if rows['draw'] == 1:
rows['new'] = 0
if rows['defeat'] == 1:
rows['new'] = 0
return target.iloc[:, -1] # return all rows, the last column
training_output = convert_output_win(training_output)
test_output = convert_output_win(test_output)

You may want to create a function to solve this issue in order to make it easier for you to maintain/reuse your code.

And… we are done… with the data! What we have done so far:

  • Downloaded one of the three available datasets.
  • Loaded the dataset into Python using Pandas library dataframes.
  • Selected the features we want to use.
  • Splitted the full dataset into subsets that we need: (training input, training output, test input, test output).
  • All sets are shuffled and normalised between 0 and 1.
  • All sets are ready to be used… Finally!

In the Part II of this tutorial we will focus our attention to create and setup the neural networks models. We will need all those sets that we have prepared here in this first part.

See you in the Part II !!!

--

--

André Luiz França Batista

/* Computer Science professor at Federal Institute of Triangulo Mineiro. Interested in Artificial Intelligence, Data Science, Games and Web development. */