From Football Newbies to NFL (data) Champions | A Winner’s Interview with The Zoo
In our first winner’s interview of 2020, we’d like to congratulate The Zoo on their first place win in the NFL Big Data Bowl competition! Please enjoy this joint Q&A between top competitors and teammates, Dmitry and Philipp.
Have follow up questions for them? Leave a comment below.
Let’s start with an introduction. Who is ‘The Zoo’?
We started competing together on Kaggle about a year ago. Back then, we had just started working at the same company in Austria as data scientists. We got to know each other outside of work and realized we had the same desire to get more practice with real-world machine learning tasks. We started doing a Friday afternoon hackathon, and one day decided to try a Kaggle competition just to see how it goes. Both of us had little Kaggle experience, so we went into it without any expectations outside of the desire to learn new things.
Since we tackled textual problems at work, it seemed obvious to start with an NLP competition (Quora Insincere Questions). We came up with a team name, and went for it. What started as a weekly hackathon ended in an intense two months of work, ending up with an astonishing and exciting win. Since then, it was difficult to stop — as soon as one competition was finishing, we were jumping to the next one. Being primarily motivated to learn, we tried to pick different types of tasks and team up with new people.
Hooked after your first competition! That happens 😁. How did you decide on the team name ‘The Zoo’?
At work, we set up several virtual machines, serving different purposes, and to avoid memorizing all the IP addresses, we started calling them by animal names. So you can hear stuff like “elephant has grown too big” or “falcon is so fast now” in our room, which makes sense to us, but for an outsider, it sounds like we might have gotten a bit insane and now believe we work in the zoo. The pictures of animals hanging on the walls labeled with IP addresses do not make it any better. That’s how the name The Zoo came up, and as it was a lucky charm for us from the start, we stuck to it.
After successfully competing in further competitions, we saw the NFL Big Data Bowl competition pop up. We were quite quickly intrigued by it and decided to participate in it from the start. While both of us are interested in various kinds of sports, primarily soccer, we had little to no knowledge about American football and the NFL. But not knowing details about the domain had not stopped us before, so it was luckily no show stopper for us as it turned out to become probably the best and the most enjoyable competition we have ever participated in. We even know some of the rules now, perhaps more than most people in Europe.
What was your background before entering this challenge?
Most of our achievements on Kaggle are the results of our teamwork over the past year. We are especially proud of them, as the competitions we participated in were from a wide range of areas, including NLP, tabular data, signal data, computer vision, time series. Now we can add sports analytics to the list, but before that the top results were:
- Quora Insincere Questions Classification: 1st out of 4037
- LANL Earthquake Prediction: 1st out of 4521
- IEEE-CIS Fraud Detection: 6th out of 6381
- Santander Customer Transaction Prediction: 7th out of 8802
- APTOS 2019 Blindness Detection: 9th out of 2931
Can you tell us a bit more about your personal backgrounds before joining Kaggle as a team?
Dmitry: I studied at Moscow State University and graduated as a specialist in applied math/data mining. It gave me a strong background in math, statistics, and computer science. I worked for financial institutions in the area of risk management for quite a long time but decided to switch to a more general data science role a few years ago. I tried a couple of Kaggle competitions 3–4 years ago and got my first gold medal back then, but after that, I had a break until around a year ago due to lack of time.
Philipp: I studied software engineering and economics in Austria, where I also finished my Ph.D. in computer science. My Ph.D. was highly data science-related, focusing a lot on statistical methods for studying phenomena on the Web. After my Ph.D, I was employed as a postdoctoral researcher in Germany following up on my research before I decided to switch from academics to industry, joining an insurance company in Austria. That means, I already had quite some experience in data science before joining Kaggle, but only started to really deep dive into machine learning since then.
So, we know you knew almost nothing about American football at the start of this competition. Did either of you have other any prior experience or knowledge that helped you succeed?
We had almost no domain knowledge before we started and relatively little after it finished. As in many other competitions, it is machine learning skills and experience which usually make the difference. We strongly believe that data scientists can adapt quite quickly to new kinds of data. However, NFL Big Data Bowl showed how important it is first to understand what stands behind the data and predictions, think what approach makes the most sense in this particular case, and proceed with the modeling accordingly.
Usually, it takes at least half the time of the competition to find out which techniques and which prior experience helps you to succeed. It was not an exception this time. Closer to the end, we realized that the most relevant experience we had was from NLP tasks and Quora Insincere Questions Classification competition in particular. What also became clear to us, was that top competitors are applying Transformers to this task. That may sound strange, but if you go through the solutions of Predicting Molecular Properties contest, it becomes clear how this powerful NLP approach can be translated to more “geometrical” tasks.
Let’s get technical. Share what you can about the data, the problem itself, and how you got started.
There is a lot of manually collected data in sports, including quite detailed numbers and stats per game and even per player. NFL worked hard on extending the analytical capabilities of the teams and introduced proprietary ‘Next Gen Stats’. In a nutshell, the NFL started collecting tracking data of each player of each game with the frequency of 10 records per second, which includes a player’s position, speed, acceleration, the direction of movement, and orientation. Since 2017, the data is also distributed back to the teams, allowing them to analyze all the plays of all the teams in the league. The NFL sponsored Big Data Bowl competition on Kaggle to show how much value this data can bring for analytics, focusing on this competition on the task of predicting how far a rusher will run.
As we were and still are American football newbies, it was inevitable to start with learning the basics. Therefore, we will begin by explaining the basics of American football just as we learned during the competition. Most of these things should be trivial to people who have watched at least a couple of NFL games, and we excuse ourselves for potentially explaining some of the details poorly.
The goal of the team with the ball (the offense) is to advance and eventually reach the opposite end of the field to score points. The team without the ball (the defense) is trying to prevent the offense from advancing by e.g., tackling the ball carrier. The game is split into plays or downs, which start with an offense player passing the ball to one of the teammates and ends with e.g., a ball carrier either getting tackled by the defense team or reaching the end of the field. The down starts from the spot to which the offense last advanced, the distance from that spot to the end of the field is often called yardline.
The offense has four downs to advance at least 10 yards. If the offense succeeds in advancing at least 10 yards, the number of tries is reset, and they have four downs to advance at least 10 yards from the new yardline. If after four downs offense fails to do so, the possession is given to the defense. During a play, each team has 11 players on the field, and each of them has a specific position and tasks assigned for that play (e.g., center, quarterback, running back). An NFL team usually has 53 players on the roster, most of the players are specialized on either defense or offense plays.
There is, of course, much more about the game rules and peculiarities, but we will focus only on the details necessary for the competition task.
Typically, a down starts with a snap, where the center throws or hands the ball backward to one of the backs, usually the quarterback. A quarterback is the leader of the offense. His job is typically to either hand the ball off to one of their running backs or to throw the ball to an open teammate. In some instances, the quarterback will run the ball himself. The competition was dedicated to the first types of plays, called running (rushing) plays. The input data was a snapshot of the play at the moment of handoff — the moment in time, when the quarterback hands the ball off to a running back (a role specialized on running plays). The goal was to predict how far the ball carrier advances forward by the end of the play.
The training dataset contained the data at the moment of the handoff from all the running plays of all the games of the 2017 and 2018 season. The public leaderboard was based on data from games that took place in September 2019. To eliminate any leaks, the private leaderboard was constructed from December 2019 games, while submission deadline set to the end of November. In other words, the models were assessed with the games which have not even taken place by the submission deadline, therefore forming a nice and robust way to validate them.
The evaluation metric is usually the first thing to check when one starts a competition. In this case, it was the Continuous Ranked Probability Score (CRPS), which neither of us has seen before. But, after exploring it for a bit, we realized there is nothing exceptional about it. Instead of predicting an average like in a regression case, you need to predict a cumulative distribution of yards, meaning you predict for each yardage the probability of the rusher running at the maximum that far. Even though we ended up optimizing CRPS directly, it was sufficient to treat the problem as a multi-class classification task with the outcomes ranging from -99 to 99 yards.
As a second step, we naturally delved into the data and started to think about feature engineering. The training set contains around 23 000 plays with variables for each of all 22 players, such as coordinates on the field, speed, acceleration, direction of movement, orientation, position, age, height, weight, name, jersey number, and even college name. Also, some general information, such as type of the formation, names of the teams, stadium, score, which down it is, how many yards are to go, stadium name, playing surface type, weather, temperature, humidity, wind direction, and speed are available. Of course, most of it turned out to be useless for the task, but such data gives plenty of room for creative feature engineering. Some of the general features can be used as predictors, but all of them are very weak. One could have expected that running plays have different expected values on the 4th down, when a team is losing, or when a successful rusher has the ball,
The most successful features tried to assess a rusher’s position against the defenders, which is quite logical. Variables such as distance to the closest defender, speed of the nearest defender, the speed at which the closest defender is approaching the rusher, were easy to generate and add to any model. Also, one can try to assess how much free space the rusher has in the direction of his movement, and there are several excellent ways to do that by e.g., using Voronoi areas. Next feature ideas can include the second closest defender, closest offensive player, the offensive player closest to the closest defender, and so forth. We then utilized these and similar features to model the problem with classic gradient boosted tree models like LightGBM and simple feed-forward neural networks, which worked reasonably well. But, after 1–2 weeks, we realized that such an approach is very limited, and there are multiple reasons why:
- The generated features are arbitrary and probably cannot assess the positions in full.
- The features usually take into account a single defender, leaving ten others out.
- All attempts to generate features across all defenders more or less failed. After thinking about it, this makes sense as each player has a different contribution in stopping the rusher, based on their relative locations, speeds, and movement directions.
- Attempts to generate features for every player for either LGB or feed-forward NN required some ordering of the players, such as by distance to the rusher. Any ordering would have been arbitrary and hence not optimal.
- One cannot solve these issues using player positions, as they are quite unreliable in this situation — some positions are entirely missing in some plays, and they are, to some extent, judgmental. Besides, during a running play, the attention of each defending team player is at the rusher, regardless of the role the defender had in the beginning.
- It is even more challenging to come up with useful features using data of the offense.
So we decided to start over and approach this task a bit differently using what we have learned before.
To simplify things, we decided to begin with removing the offense players from the picture for a bit, except for the ball carrier, of course. Now the whole setup becomes much more straightforward: one player is rushing towards the end of the field, and 11 defenders are trying to tackle him as fast as possible to minimize his progression. It is safe to assume that any of the defenders can end up tackling the rusher, just the chances of that happening depend on their relative locations and velocities. It is also likely that there is little to no cooperation between defenders at this stage — each one is merely trying to stop the rusher. So, the desired approach should follow these basic principles:
- Disregard dependencies between defenders.
- Do not impose any order of the players.
- Assess each defender against the rusher independently.
- Aggregate the assessments across defenders to make the prediction.
- Preferably, we have an automated approach to generate features.
After that, it became quite clear what exactly we should try out. We need a neural network that can work with an unordered set of players and which takes the data of each rusher-defender pair, learns the patterns from their tracking data, and then aggregates the results across all defenders. So, it is a convolutional neural network with a window size of 1 and a pooling on top. Instead of supplying coordinates and speed of the rusher for each defender, it also made sense to subtract them from the coordinates and speed of each defender simply. So for each defender, we are using location and speed relative to the rusher. After a couple of attempts, we managed to fit a NN outperforming our best LGB. And that was achieved without tuning the NN and without using data about the ten offense players.
Tuning the NN took us quite some time, but was very fruitful. Probably the most sensitive part of the structure was to define how the network extracts the information from a rusher-defender pair. A convolution is nothing else than a single dense layer applied to each defender repeatedly. So, by default, we capture only linear features. To allow more flexible feature extraction, it is typical to have a sequence of dense + activation blocks, which in this case is simply a sequence of convolution (window size = 1) + activation blocks. The optimal structure had three such blocks.
The next task was to bring the ten missing offense players back into the game. What is their responsibility during a running play? To the best of our understanding, they are only blocking the defense, trying not to let them come near the rusher. It also seems like they do not require cooperation between each other in these circumstances. In a nutshell, we see a similar setup as before — a defender is running forward, and ten players are stopping him. So, it was natural to apply convolutions over ten offense players per each defender and then do pooling.
Putting both steps together, we implemented the following logic:
- For each defender, we assess his position and movements against each of 10 offense players, provided their locations and velocities relative to each other and the rusher.
Then aggregate across offense players.
- We assess each defender’s position against the ball carrier in the same manner, utilizing how well the offense is blocking him from the previous step.
Then aggregate across defense players to predict the expected yardage of the rush.
Given that, the simplified neural network structure looks like this:
The first block of convolutions learns to work with defense-offense pairs of players, using geometric features relative to rusher. The second block of convolutions determines the necessary information per defense player before the aggregation. The final and last block consists of typical dense layers before the output of yards to be predicted. For pooling, we use a weighted sum between average and max pooling. We use ReLU activations and Batch Normalization and directly optimize CRPS metric. For training, we use Adam with a fixed one cycle scheduler and implement everything in Pytorch.
Evaluation and progression
At first, the cross-validation setup was a simple K-fold split grouped by game id, meaning that all plays from the same game always landed in the same fold. It worked well and gave a high linear correlation between CV and LB scores. But that changed as soon as we added acceleration as an extra variable. It turned out that speed and acceleration, being variables derived from the sequence of a player’s coordinates, were processed differently across seasons. We managed to improve the quality of these features by utilizing other available variables to adjust them but also decided to start assessing CV score using 2018 data only as it was more similar to the 2019 data that our model was evaluated within the end. That brought the correlation back to perfection.
Having such a correlation between CV and LB allows you to rely on CV in assessing model structures and parameters. That simplified our daily routine as we did not need to check public leaderboard regularly and could entirely rely on our CV setup, making us also not susceptible to the notion of overfitting public leaderboard. Also, it allowed us to fully judge the usefulness of an experiment we conducted, be it an additional feature, changes in the neural network structure, or some adjustments to the training routine.
In the end, we had both on the public leaderboard and also on the private leaderboard after the consecutive rescoring on future data the best score. Our distance to the second place was as large as the distance from the second place to 25th place.
One of the critical insights we delivered from our analysis is that the expected yardage of a running play after the hand-off does not significantly depend on individual players’ data. The expected yardage is defined by the positions and velocities of the players at that moment. On the one hand, it can give indications of how well a running back/defense/offense is performing against expected yardages during rushing plays. On the other hand, it suggests for analysis of how well a team does in building up the best rushing situations on the field. We can assess which tactical choices and which individual player performances help the team improve expected running play success, which players contribute more in defending from running plays, which offensive players are better in creating space for running backs and so on.
All the conclusions we made are coming only from modeling a specific type of play based solely on a single time snapshot of the tracking data. There is a huge potential to explore and extract value from it. We believe it is possible to add time components and also apply our ideas to passing plays to assess players, tactical decisions, and teams. But there are many other applications yet to be discovered.
What words of wisdom would you share with future Kaggle competitors?
Don’t worry about having domain knowledge to attempt a specific problem. The main thing we learned in this competition is that you don’t necessarily need domain knowledge or industry to successfully tackle the data science challenge. Sometimes it even can be an advantage, as you go in blindly without many prior assumptions that might wrongly steer your exploratory analyses. What both of us love, is to come up with simple and creative solutions to interesting problems, which is why this competition was ideally suited for us. You never really know what to expect from a competition or project beforehand, though, so it is crucial never to be intimidated by certain data or competition settings. Be bold! See it as a great learning opportunity. Even if things do not work out as expected in the end, it can be valuable to approach different kinds of problems to transfer acquired skills to the issues you are more regularly working with. For example, who would have thought that methods employed in a competition predicting bonds between molecules (convolutional layers, graph neural networks, transformers, etc.) would have been useful in a competition predicting rushing yards in American football?