My predictions on FIFA World Cup 2018 was a bigger disaster. Here’s why.

About a week ago, the finals of FIFA World Cup 2018 were played and well, the outcomes might not have been the same as most of us might have predicted. Well, this world cup season must have been the most unpredictable ever, but is it really? The answer is yes. On 16th–22nd June 2018, during the *#DehubKenyaHack*, I intended to predict the winners of this season’s world cup, and guess what, I did: *Code*. But then how did that turn out? Well terribly wrong. However, the historical data that I used in building my prediction model suggested the unlikelihood of the FIFA 2018 turnout.

According to the model; I expected the champs to be Brazil, Germany as the runner’s up, with Argentina and Portugal taking the 3rd and 4th positions respectively. Turns out, none of these teams were anywhere close to these predictions.

Expected

1. Brazil

2. Germany

3. Argentina

4. Portugal

Observed

1. France

2. Croatia

3. Belgium

4. England

*The closest the model got to accuracy was the fact that Belgium qualified for the quarter finals, but then afterwards, eliminated by Germany.*

Semifinals:

Finals:

So what happened?

FIFA ranking, which is a major feature in the model, was created in the 1990’s, however the data considers all historical matches since the beginning of the FIFA world cup in 1930 for all participating teams. A great deal of data was therefore missing

With a training accuracy of 57.3%, (which of course we can safely ignore since it makes no sense in machine learning, — but not always) but since our test accuracy was just 55.1%, there was a good chance for the predictions to be inaccurate.

But its not just me,

Sportkeedaranked the top five teams as: Brazil, Germany, France,Spain and Belgium (which was not so bad). This ranking however considered the rating of different team’s players ‘on the basis of their strengths and weaknesses on paper’ which the model did not. (But at least we agreed on Belgium.)

About the model:

The model used a

logistic function: a sigmoid function, which takes any real input, and outputs a value between zero and one: That is: it is interpreted as taking input log-odds and having output probability.

Assuming t is a linear function, then

Our function can be re-written as:

P(x) is interpreted as the probability of the dependent variable equaling a “success” or “case” rather than a failure or non-case.

Inverse: the probability of the dependent variable equaling a “failure” or “non-case” rather than a success or case.

The inverse of the logistic function,

And equivalently,

So if I have to compound these major issues, since the World Cup is a relatively rare event (4 years), the available historical data is limited; that is, the sample size is relatively small and (maybe) obsolete. It is agreeable that statisticians in the various industries face similar problems when trying to predict unexpected events, or when working from flawed or incomplete data. And by simply passing this task over to the models that we build, might in most cases not be as effective as we wish it to be. Besides, many machine learning and statistical data mining techniques also hold the assumption that training (historical — in this case) data, which we use to train models, behaves similarly to the test data, to which the model is later tested on.

However, this assumption does not always hold when the data might be outdated, and it is impractical or even expensive (e.g time constraints in my case) to get more data that is current, and hold this assumption.

*Perhaps, to make the results more accurate next time:*

We may invest more on cleaning the data sets.

Improve the accuracy of different datasets as Andrew Ng from Coursera suggest the following approach while working on a dataset:

Separate your dataset into 3 different sets : the training set, the cross-validation set and the test set;

For polynomial regression (e.g), optimize the coefficients using the training set;

Find the best polynomial degree using the cross validation set: using the coefficients computed using the training set;

Finally, estimate the generalization of your model using the test set.

This way, the training set and the cross-validation set are used to build the model, hence yielding a better predictions as compared to the test set.

And finally, we can consider transfer learning. This will provide an approach to datasets by which one can stay relevant, by getting the best position of fitting models which are solely based on historical data. These models are the enriched with recent data from similar domains that could better capture better current trends (in our case for instance, we could consider player rating from their clubs).

If we manage to identify the areas of knowledge which are “transferable” to the target domain, we can then train the model by identifying the commonalities between the target task, recent tasks, previous tasks, and similar-but-not-the-same tasks: in other words, it will guide our model to learn explicitly from the relevant parts of the datasets.