Final Squads — Analyzing 1,472 Feet

Welcome to the inaugural post for “Analyzing the World Cup using Google Cloud” — a publication which carries forward work from the previous World Cup in 2014 (blog, predictions, and Strata session). Our lineup for this 21st FIFA World Cup is a bit different: some new players, new tools, and new strategies. Even with all the new wrinkles — we still have the same goal in front us: have fun analyzing the most maddening and beautiful game on the planet.

In this post we explore some of our thought processes on what questions we want to explore and what data we might need to achieve our GOOAAAALLLSSS!!!

Spoiler alert: there is zero tech nor Google Cloud talk in this post — rather a conversation about problem space and data challenges.

On June 14th, 2018–32 teams will compete to become the world champion of men’s football. Only 5 countries have won the previous 20 World Cups, with Brazil holding 5 wins and Germany with four. On the surface, the odds are foreboding for newcomers, but this may not be a “typical” World Cup. For example, 4-time champion Italy did not qualify this time around, nor did the United States. However, Iceland with a population of 334,252, qualified for their first ever World Cup appearance.

Regardless of the cathartic reasons, crazy conspiracies, and technical theories as to why certain teams qualified and others did not, we are faced with the forward looking question, “Who is going to win?”. The internet is filled with speculation: some predictions show Germany, others Brazil, others France. We thought it would be fun to kick this ball, but with a slightly different perspective focusing not only the who will win, but also on the why they will win.

Team v Player

With only 32 teams in the tournament, one might think this is an elementary predictive analysis challenge. To create a model one, could simply look at a team’s prior tournament performance(s) OR maybe would look at a team’s qualifying matches (10~20 depending on continent) OR maybe look at the players on each team. Each of these approaches has their own challenges.

Solely looking at a team’s prior tournament performances assumes team continuity which is a false assumption — with team rosters and coaches constantly changing. In the case of Iceland, they have no tournament priors. If you look at a team’s qualifying matches you have a limited number of observations which are biased due to continent grouping in addition to roster change(s)

Consider this example: Brazil owns the most famous kit in international soccer, but do the legendary performances of Pelé, Zico, Ronaldo, Ronaldinho etc have any actual effect on the team today? Of course not. Brazil lost to Germany 7–1 in the 2014 World Cup Semis while Neymar was injured and Thiago Silva suspended. On that day, the fact that they were missing their two best players meant far more than anything accomplished by Pelé, Zico or more recently Ronaldo or Ronaldinho. A projection based simply on the past team performances wouldn’t capture the absence of these world class athletes. Could a team based model have accounted for the missing weight of the two starters?

Although straightforward to acquire and analyze, team level data can be too coarse grained and lossy. This isn’t to say that team level data has poor predictive qualities, on the contrary it can be very powerful especially when combined with player level data. To reiterate, there are several factors that can impact a team’s performance: coaches, trainers, technical directors, kit managers, and politics. However, at the end of the day, the players on the field drive the majority of the team’s performance.

Hunting Clubs

As we started planning our analysis strategy, we decided to hunt for club and international player level data for each national team player, in addition to team data. Turning to more fine grained player level data expanded our analysis scope from 32 to 736 subjects as final rosters include 23 players per team. 736 is by no means big data challenge, but it does represent some additional challenges, namely timing and scarcity.

Timing: We started our project in early April 2018; however, final rosters weren’t due until June 4th. You can read the final rosters here. Therefore, if we wanted to implement player level analysis, we’d be time constrained. Once we get into World Cup matches the lineups aren’t due till 85 minutes prior to first kick — so we are also faced with time constraints in seeking more model accuracy (more on this later).

Resource Scarcity: Even if we knew the final 736 players — the club diversity for these players varies wildly (e.g. Iran’s National Team) and in some cases is very resource intensive to acquire. Acquiring data for a player with limited playing experience on a sub-par club team presents a tough cost/benefit challenge to which we had to optimize.

Scoping begets more challenges.

Estimating Rosters

With limitations on both time and data acquisition, we developed a methodology to estimate who might end up on each World Cup roster. This process, empirical in nature, began by looking at the most recent call ups for each country’s squad. Based on research, we added a few players here and there to create a pool of 801 players. The pool represents more than 50 different domestic leagues around the world; from the English Premier League playing host to 120 of our sample pool all the way down to Guinée Championnat National — which is represented by a sole individual.

21 different leagues projected to have 10 or more players represented in Russia this summer, making it a natural place to start. In many cases, these were important leagues to get data from; EPL, La Liga, Serie A, the Bundesliga and Ligue 1 all projected to have 50+ players at the World Cup spread among 18+ nationalities. Others had a high number of players projected at the tournament, but their coverage wasn’t as strong as they all play for the same country. For example, the Iran Pro League and Liga Panameña de Fútbol each projected to have 14 players in Russia, tied for the 15th most of any league. The problem is that each player projected to go to the World Cup from these leagues plays for the Iranian or Panamanian national team. On the flip side, the Portuguese Primeira Liga projects to send 16 players, a marginal upgrade from the Iran Pro League and Liga Panameña de Fútbol. The bigger difference is that those 16 players are spread out among 10 different countries meaning the addition of that one league can tell us something about almost ⅓ of the teams playing in Russia this summer vs. telling us a lot about one particular team.

Our goal was to have domestic data coverage for at least 75% of the final player pool along with international coverage for all 736 players. We will then use this individual player data, coupled with team level data to build out our predictive analysis.

Early Team Looks

While we were waiting on final rosters — we took a cursory look at team level data — scouring for some interesting (upset and underdog) potential through the lens of goals conceded.

Croatia conceded only five goals in their 12 qualifying matches and definitely have enough firepower to scare teams going forward, but a largely similar group crashed out in the group stage in 2014. Spain and England each conceded only three goals in their 10 qualifying matches, Germany and Portugal each conceded four. But clearly, none of those four would be considered “underdogs.” Each has won a major international tournament (all but England in the last decade) and are in bookmakers’ top 10 favorites to win the entire tournament. The best team from a defensive standpoint in CONMEBOL qualifying was Brazil coneding 11 in 18 games, but if anyone on Earth was surprised to see Brazil make a deep tournament run, they haven’t been following the sport.

In Asia, Japan went the entire first qualifying round (eight matches) without conceding a single goal, but that was in a particularly weak group, and went on to concede seven in their final 10 qualifying matches. Iran could be a potential defensive wildcard — as their 18 qualifying matches saw them concede only five goals. Iran has been to the World Cup four times, never advancing past the first round — and while their defensive record might indicate they are due to the break that streak — it would be a massive upset if they did — as their group includes Morocco (more on them later) and Iberian powerhouses Spain and Portugal (see above).

Both Costa Rica and Mexico finished CONCACAF qualifying under 0.7 goals conceded per game and they are both in groups with one perennial powerhouse (Brazil and Germany, respectively). Three other teams expect to fight it out for second place, so defensive resiliency could enable either one advance to the knockout round.

Each team qualifying from Africa did so by conceding fewer than a goal per match, but the standout was clearly Morocco. They conceded only once in 8 qualifying matches and have not been scored on in their last six competitive fixtures, including two shutouts against the Ivory Coast. As impressive as this record is, advancing past the group stage would be far more impressive. Remember, they are in that group with three teams already highlighted; Spain, Portugal and Iran. Morocco is currently +1600 to win group B and +50000 to win the tournament…

Fulltime

We look forward to the upcoming weeks of matches and international theater. In our next few posts, we’ll explore our data coverage, ETL processes, feature modeling, and hopefully a few inciteful predictions using Google Cloud. Right now it’s time to look at those final 1,472 feet.

--

--