Making World Cup Sausage with Cloud Dataflow and BigQuery

Eric Schmidt
Jun 14, 2018 · 7 min read

The 2018 World Cup is finally here. In the opening match, Saudi Arabia will take the pitch against the host country Russia, who is favored by the majority of analysts. In our opening post we explored the predictive qualities of player data versus team level data and the challenges of World Cup predictions. For this post, we dig into data sources and share some initial modeling output.

Similar to our work in 2014, we are focused on predicting match victors for this World Cup. More specifically we will be modeling the concept of expected goals (xG) for each team, diving into the likelihood that a shot will result in a goal. There are numerous approaches to xG modeling, as the World Cup plays out we will share the inner workings of our model. To accomplish our goals (no pun) we are using match and touch event data from Opta Sports, more specifically we are using their F8 (match stats) and F24 (event) feeds to harvest some predictive signal for these maddening matches.

Digging Into the Data

The F8 data includes timestamps for only a few important events (cards and goals) and is mostly familiar stats aggregated at the player and team level for each game — things like shot totals broken down by body part, whether they were inside or outside the box and the end result of the shot. You can find this data on most club and league sites; however, gathering at scale and with consistent quality would be a challenge. Opta provides quality structured feeds for a massive amount of leagues.

The F24 is a much more fine grained view, capable of telling a lot more of the game story. Imagine the field as an X,Y coordinate plane. More than 75 different types of events are included with location, player information and up to 20 pieces of supplemental information about the event telling you things like the body part used and the play type the event belongs to. One could use the F24 data to replay the majority of match activity. Gathering this information at scale with high quality is challenging at best as clubs and leagues do not publish this data. Fortunately, Opta provides wide and deep coverage for the leagues at the F24 level.

We built an ingestion pipeline with Cloud Dataflow and Apache Beam to process thousands and thousands of roster, schedule, F8, and F24 files to create a lovely BigQuery dataset. This way we could easily re-run our ingest processes as we needed to tweak any logic or mapping — ingesting 5 years of 20+ leagues with one single serverless job! We will have a future post on data pipelining.

Below is a snapshot of the F24 qualifiers output table in BigQuery coming in at 1.17GB. This is in that not super big, yet not small enough to easily manage with simple in memory data structures for aggregations and feature development.

It’s important to understand that even though our core predictive target is estimating the likelihood that a shot will lead to a goal — the underlying models for F8 and F24 usage is quite different. We built the one model using F8 data first, separating shots into four categories; open play outside the box, open play inside the box, free kicks and penalties. We then include player shot history to arrive at an expected value. We had to do this as some leagues have no F24 data and/or we decided that the investment to value of F24 would not provide meaningful player coverage thus F8 was good enough.

With the raw data loaded into BigQuery and the goal of creating the majority of our candidate necessary features in a BigQuery views, it was time to write some SQL. Most features came directly from the raw data; for example Opta includes a qualifier in the F24 feed saying that a shot was assisted (or not if the qualifier is absent), making for easy feature creation. Some features are categorical (for example shot origin play type) and some were calculated (for example shot distance), but assembling all of this data was done in SQL with BigQuery.

Building Features

Armed with squads, schedules, matches, and touch data we built several BigQuery views to seed our initial feature list. The beauty here is that we can do a great deal of aggregation and shaping in SQL versus doing this in Python or R code. As we make edits to our feature set we simply update the SQL. Below is 1.35GB of shots processed in 12 seconds.

The underlying view aggregates all of the events and qualifiers for all touches and creates our base feature list. It’s a big query.

Again, using views in BigQuery is super powerful as you can quickly iterate on your feature modeling and push a great deal of processing down to SQL avoiding wait time otherwise processing the data locally in your notebook environment.

Initial Modeling

With basisc feature workflow out of the way we transition to modeling. Given the nature of our prediction (probability of a goal from the particular shot in question) we decided to move forward with linear regression models given our prior experience with this data and problem domain. Along with normal preprocessing to standardize features and to separate our dataset into training and testing groups, we had to reorient some categorical features. All of this was easily done with sci-kit learn using Cloud Datalab as our notebook environment.

First we looked at all shots regardless of their path to the goal. Then we looked at clean and on-target shots.

All Shots

Our dataset for all shots (from output above) included:

  • 104,879 shots from 12 different competitions (EPL, Champions, La Liga, World Cup Qualifiers, …)
  • Trained on 77,311 shots between June 2006 and September 2017
  • Tested on 27,568 shots from October 2017 to May 2018
  • Overall expected goal value from test set: 0.109

Two of the main ways we assessed this model’s performance was with actual goal conversion rates (from the data we tested on) and our separate F8 dataset.

There is .003 difference between predicted and actual value. As you can see, our average expected goal value projection splits the difference between these two benchmarks.

From the 27,568 shots in our test set the average expected goal projection for true positives was 0.402. For true negatives it was 0.072. This is a significant indicator.

Shots On Target — Goalkeeper’s View

When we drill into goalkeeper performance, we face a much smaller dataset. This is because a keeper does not have any effect (at least in terms of an on-ball event…) on a shot that is blocked by the defense or is eventually off-target. With that in mind, to assess goalkeeping performance, we looked at:

  • 34,269 shots (on target)
  • Trained on 25,061 shots between June 2006 and September 2017
  • Tested on 9,028 shots between October 2017 and May 2018
  • Overall expected goal value from test set: 0.330

Again we used both the actual conversion rate of this test set and compared our F8 data to assess performance.

Similar to our all shots tests, our prediction vs. actual values for on target shots were very close. When looking at goal conversion it’s clear signal that shots on target convert at a higher rate to actual goals. This makes sense. Shoot the ball at the net!

IMPORTANT: Conversion rates are not uniform across all competitions. For example, the English Premier League conversion rate is different than Major League Soccer which is different than the Chinese Super League. In our opening post we explored the challenge of formulating a dataset for a tournament that happens on every four years in which the teams rarely play games of a similar significance.

The blending of historical league and World Cup data for our data sausage helped us bootstrap initial models. In future posts we will explore weights to handle league effects, competition quality and other factors that can contribute to more a precisely “spiced” sausage. For now this passes the sniff test, at least to move forward with our analysis.

Wrapping Up

Armed with data ingested into BiqQuery and a collection of actionable SQL views to drive our our initial models, we now turn to model refinement and actual World Cup match projections. This will be done on a per game basis as lineups are announced (85 minutes prior to the match). Stay tuned and good luck to your country of choice!

Analyzing the World Cup using Google Cloud

Data analysis and data science of the most beautiful game

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Start a blog

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store