Modeling Predictions for the 2018 World Cup

Delaney Ambrosen
Kenyon College Sports Analytics
3 min readJun 14, 2018

A collaborative effort by Alexander Powell and Delaney Ambrosen

Every four years the world tunes in to watch 32 nations battle it out on the pitch for the World Cup trophy. The 2014 World Cup reached 3.2 billion viewers, with more than one billion people watching the final. It is the most watched live sporting event in in the world. And, in just a few days, this tournament will yet again capture the world’s attention.

Predicting and modeling the World Cup is significantly more challenging than predicting other tournaments because teams play less often than professional teams- often only every few months. Thus, models for the World Cup most either be based on smaller sets of data, or data that ranges further back into history. The model is based on data scraped from soccer-db.info using Python on every international soccer game, dating back to the late 1800s.

In order to build this model from so many matches throughout a long period of time, we created several weights for different aspects of the games. First, the model weights each match by the level of competition (for example, World Cup matches are weighted more heavily than the qualifier games, which are weighted more heavily than the Friendlies). It also has a time-dependent weight which is modeled using a logistical curve for each individual nation, so that recent games are weighted more heavily than games that took place in 1968 (we only looked at the last 50 years for the final simulation). From these weights, we built a mixture model that uses each team’s adjusted offensive and defensive values (goals scored, and goals scored against per match).

We then created a second iteration of the model in which we used each team’s value to re-weight their matches based on the level of competition. This allowed historically-more successful teams such as Germany and Brazil to be stronger than teams that play in lower level and smaller confederations by reducing inflated scoring values.

Using this model, we ran the simulation “hot” so that the games we simulated were added into the game database and would influence a team performance in the next simulated game. So, if Germany were to win a simulated game against Mexico, Germany’s offensive score would increase, and Mexico’s defensive score would decrease.

In each simulation, we go through each of the group stage matches. This is rather straightforward using the previous weights and assumptions but gets tricky in evaluating who advances from each group. We have two functions that work in conjunction with one another to evaluate who wins and the array of different tiebreaker rules FIFA has for the competition. It is a myriad of coding to look at points accumulated (3 for a win, 1 for a draw), goal differential, goals scored, head to head match up, and then, if all else fails, a coin flip. After the group stage, the winning and second place teams then face each other in the round of 16 matches, then the quarterfinals, semifinals and final game. The table below displays the probability of advancing to different stages of the tournament for the top (simulated) 15 teams in the tournament.

Our model predicts that Spain will win the 2018 World Cup, with Brazil and Germany in the second and third places. However, our model cannot take into account recent events that could have significant impacts on the teams, such as the recent news that Spain had fired their head coach. It will be interesting to see how our model compares to the real game results.

Delaney Ambrosen is an Economics major and Math minor at Kenyon College. Alexander Powell is a recently graduated Mathematics/Statistics major. You can email him at aepowell95@gmail.com and find his code for this project and others at his github page: https://github.com/powellae.

--

--

Delaney Ambrosen
Kenyon College Sports Analytics

Delaney is an economics major and statistics and computer science minor at Kenyon College.