Predicting the results of the 2014 FIFA World Cup

Our model predicts that the final match (July of 13th) would be between Brazil and Spain. As being Brazil predicted to be better than Spain by 13%, we would have a champion from South America after many years. For the third place, Germany would defeat to Argentina in a though game. Join me to see how this exciting tournament can be predicted with help of Statistics and Machine Learning.

Omar U. Florez
7 min readJun 9, 2014

The FIFA World Cup will begin this Friday. It’s a huge event that catches the attention of the entire world every 4 years. Trying to guess who is going to win seems to be an interesting task for a machine learner.

As we know, any informative guess is formed by two components: past information and future expectations. Interestingly, past information come from facts and history while expectations come from people opinions and their own subjective way to observe the world. Combining both types of information to precisely explain a real problem, like the FIFA World Cup, is hard. Let’s use human perception and historical data to approximate the probability that a team will win the world cup. Even more, let’s unroll every group and phase to see which teams are going to make it to the final.

Human Perception

Detailed information about the performance of a team could consider individual player performances, number of goal scored in the season, or experience of the coach in world cup games. The good news is that these can all be interpreted by human opinion. For that goal, we use current online betting odds (June 8th, 2014) from 8 well known betting companies (PaddyPower, WilliamHill, Ladbrokes, BetFred, bet365, betFair, betVictor, Winner, and SkyBet) to parameterize this distribution. These expected values represent an average of many real people estimations of how well a team will make it in the cup. Indeed, this is by itself a powerful statistics.

Betting odds provide meaning information about people expectations.

Historical Facts

For this part we can use the official FIFA ranking that monthly measures the performance of national teams based on several metrics (won games, scored goals, fair play, etc.). Additionally, I also included the number of times a team won the world cup before and if a team is local. These are all historical facts.

Historical data for every national team.

Machine Learning

We now have information to predict the performance of each national team. Our goal is to learn the behavior of the function f(x) that takes as input historical and subjective data (X). The learning part consist on finding the right combination (W) in the input data that mostly approximate the average opinion of people across 8 betting stores (Y). Known as continuous regression, this is a convex optimization that take advantage of the first derivative of the error function (derivative of Y wrt W) to iteratively converge the value of W to one that satisfy (i.e., approximate) the above equation. I used a kernel-based non-linear regression with Radial Basis Function kernel and obtained a mean squared error of 0.0886. In other words:

Average human expectation = W * [betting from 1 online + FIFA ranking + locality + championship history]

The lower the number in the predicted value of Y, the better as we want to know how high a team can make in a ranking. Note that, we will use this predicted ranking in every match, to decide which team goes to the next phase and eventually which one will the champion.

Let’s start the game!

Once the model is trained, it’s possible to compute a predicted ranking for each team as illustrated below. Note that the resulting ranking is different to the one provided by FIFA or the online betting companies. This is because it models the complex correspondences between variables when correlating together to a high or low ranking value for the entire data and not in a case basis.

Resulting ranking after training the model.

Using the ranking learned, let’s recreate the whole tournament in its Group, Second, Quarters, Semifinals, and Grand Final. In summary, we try to populate the following fixture.

1) Groups Phase

Group A:
Brazil vs Cameroon : Brazil
Cameroon vs Croatia : Croatia
Croatia vs Mexico : Croatia
Cameroon vs Mexico : Mexico
Brazil vs Mexico : Brazil
Brazil vs Croatia : Brazil
Points per team: [(‘Brazil’, 9), (‘Croatia’, 6), (‘Mexico’, 3), (‘Cameroon’, 0)]

Group B:
Australia vs Spain : Spain
Australia vs Holland : Holland
Australia vs Chile : Chile
Chile vs Spain : Spain
Holland vs Spain : Spain
Chile vs Holland : Holland
Points per team: [(‘Spain’, 9), (‘Holland’, 6), (‘Chile’, 3), (‘Australia’, 0)]

Group C:
Ivory Coast vs Japan : Ivory Coast
Colombia vs Greece : Colombia
Greece vs Japan : Greece
Colombia vs Ivory Coast : Colombia
Greece vs Ivory Coast : Ivory Coast
Colombia vs Japan : Colombia
Points per team: [(‘Colombia’, 9), (‘Ivory Coast’, 6), (‘Greece’, 3), (‘Japan’, 0)]
Group D:
Italy vs Uruguay : Uruguay
Costa Rica vs Italy : Italy
Costa Rica vs Uruguay : Uruguay
Costa Rica vs England : England
England vs Italy : Italy
England vs Uruguay : Uruguay
Points per team: [(‘Uruguay’, 9), (‘Italy’, 6), (‘England’, 3), (‘Costa Rica’, 0)]

Group E:
France vs Switzerland : France
Honduras vs Switzerland : Switzerland
Ecuador vs Honduras : Ecuador
Ecuador vs France : France
Ecuador vs Switzerland : Switzerland
France vs Honduras : France
Points per team: [(‘France’, 9), (‘Switzerland’, 6), (‘Ecuador’, 3), (‘Honduras’, 0)]
Group F:
Bosnia-Herzegovina vs Nigeria : Bosnia-Herzegovina
Argentina vs Iran : Argentina
Iran vs Nigeria : Nigeria
Bosnia-Herzegovina vs Iran : Bosnia-Herzegovina
Argentina vs Bosnia-Herzegovina : Argentina
Argentina vs Nigeria : Argentina
Points per team: [(‘Argentina’, 9), (‘Bosnia-Herzegovina’, 6), (‘Nigeria’, 3), (‘Iran’, 0)]

Group G:
Germany vs Portugal : Germany
Germany vs Ghana : Germany
Germany vs USA : Germany
Ghana vs Portugal : Portugal
Portugal vs USA : Portugal
Ghana vs USA : USA
Points per team: [(‘Germany’, 9), (‘Portugal’, 6), (‘USA’, 3), (‘Ghana’, 0)]

Group H:
Russia vs South Korea : Russia
Algeria vs Russia : Russia
Belgium vs South Korea : Belgium
Belgium vs Russia : Belgium
Algeria vs South Korea : South Korea
Algeria vs Belgium : Belgium
Points per team: [(‘Belgium’, 9), (‘Russia’, 6), (‘South Korea’, 3), (‘Algeria’, 0)]

2) Second Phase

Match between Brazil and Holland : Brazil
Match between Colombia and Italy : Colombia
Match between France and Bosnia-Herzegovina : France
Match between Germany and Russia : Germany

Match between Spain and Croatia : Spain
Match between Uruguay and Ivory Coast : Uruguay
Match between Argentina and Switzerland : Argentina
Match between Belgium and Portugal : Portugal

3) Quarters

Match between Brazil and Colombia : Brazil
Match between France and Germany : Germany

Match between Spain and Uruguay : Spain
Match between Argentina and Portugal : Argentina

4) Semi-finals

Match between Brazil and Germany : Brazil
Match between Spain and Argentina : Spain

5) GRAND FINAL

1st place:
Match between Brazil and Spain : Brazil
3rd place:
Match between Argentina and Germany : Germany

It’s been a great tournament. Locals defeated the current champion and got a huge support from their fans in the final Brazil vs Spain. Argentina vs Germany is also a great game, but tricky. Brazil and Argentina are neighbors and it can be expected that a large fan support for Argentina specially in this game. This can notably bias a victory of Argentina as opposite to a predicted victory of Germany. Let’s make time work for us.

Conclusion

Hope you enjoy it. I’m impatient to enjoy this event and see how accurate were the predictions. The beauty of considering human opinion is that it’s dynamic, so it changes over time. This means that the above model will become more accurate as the values from online betting values will be updated leading to more accurate predictions. The algorithm uses an online optimization that I implemented before for a similar problem. This means that it can capture the trend in the sequence generated by adding new evidence over time when updating the latent variables W. Stay tuned for more updates on this. Finally, discussions and suggestions on how to improve this work are kindly welcome at Omar.Florez@usu.edu

—Omar

Update June 09, 2014: The model matches 100% the predictions of famous statistician Nate Silver, editor-in-chief of ESPN’s FiveThirtyEight blog, for the leader in each group phase For the second position, his algorithm tends to favor teams from south america, which actually makes sense. In total, we are 81% (13/16) similar in which teams will make it to the Second phase:http://fivethirtyeight.com/interactives/world-cup/

--

--

Omar U. Florez

Senior Research Manager in AI at Capital One - Conversational AI Research team. Teaching computers to see, read, and understand | Views & opinions are my own