There is no off-season in the cloud

Published in

Analyzing NCAA Basketball with GCP

13 min readAug 28, 2018

With the World Cup behind us and baseball meandering its way through late summer, you’d be forgiven for thinking that we had reached a lull in the Google Cloud sports data analysis universe. If you did, you’ll be pleased to know that our recent Google Cloud NEXT conference proved quite the opposite. Our showcase highlighting Google Cloud’s NCAA partnership demonstrated a new way to make the cloud more tangible than ever: by designing a basketball competition using elements of data science and predictive analytics implemented on a half-court in the middle of San Francisco’s Moscone Center. Naturally.

Some context: applied data science

This past year, the Google Cloud Developer Relations data analytics team worked on projects with entities from MLB, NBA, and the NCAA with the intention to gain new insights through applied data science. These projects ranged from looking at player performance metrics to predicting game flow to modeling the impact of potential rule changes. In addition to being fun, these projects provided unique lessons about data science and ideas for how we might convey them to a larger audience.

So as Google prepared to welcome tens of thousands of developers to NEXT’18 in San Francisco, we wanted to create an opportunity for attendees to have fun while learning how to tackle projects similar to what we’d been working on. Instead of just throwing a bunch of data at attendees, we thought it might be neat to build a smart court at which attendees would not only be part of the data creation process, but also receive that data for their own analysis.

We decided to focus on a deeper understanding of the mechanics of shooting a basketball. While there’s no shortage of information published around optimum release angle, spin rate, hoop entry angle, and so on, most of this information focuses on the attributes of a shot; we wanted to focus on the mechanics of a human. With better insight into shooting mechanics (like release height, arm motion, shoulder squareness to basket, head position, etc.), we could better understand how they influence shots.

Unsurprisingly, we’d need a lot of data. Turns out, it didn’t quite exist:

The NCAA has a wealth of team and player game data that enables one to analyze metrics like shooting percentage and offensive efficiency. This is helpful, but too coarse for our purposes — it doesn’t tell you where or how shots were made, for example.
The NBA as a league and several NCAA D1 teams use SecondSpectrum to analyze tracking data of on-court action. This is useful for conducting formation and game flow analysis, but doesn’t provide fidelity into player motion mechanics.
These are several high-quality shooting analysis systems like NoahBasketball that give players insight into the metrics of their shots (e.g. launch angle, launch speed, etc.). These systems help players improve their mechanics, but don’t provide fine-grained information on the components of their mechanics.
Various video analysis systems exist; however, these applications require building custom models that analyze human motion in the specific context of shooting a basketball, and/or require painstaking manual frame-by-frame annotation.
Custom tracking devices (e.g., accelerometer, gyroscope, gps) similar to what was used in the World Cup and have been used in the NFL for the past few years are very helpful for game flow tracking, but not so much when it comes to human mechanics.

Net-net (no pun intended), we didn’t see a system that would provide us with the data we’d need for our desired analysis. To get it, we would build a system using Google Cloud and a collection of other technologies, as well as a predictive model that would estimate the make probability of a participant’s shots in real-time. We would develop a diverse set of shooting mechanics data, and NEXT attendees would receive immediate insights around their jumpshot in an educational and interactive experience.

The smart court definitely wasn’t built in a day, and its foundational bricks rest pretty deep. Let’s dig.

Scoping and data generation

Given the challenges above, we decided to implement a 3D infrared (IR)-based system. These are typically used for virtual reality, robotics, animation, and constrained movement science applications, and have proliferated along with them. With a little bit of sweat, you can use them to perform real-time motion analysis: 3D IR tracking systems can track specific objects within a few millimeters of precision, and can produce highly accurate, high-fidelity data at high frame rates.

We considered buying the raw components and building the system ourselves, but we applied that golden rule of data science and business: “Am I an expert in this domain? If not, find someone who is and leverage their skills.” Our goal wasn’t to become elite 3D motion capture experts, so we went and found some.

We teamed up with Seattle-based Mocap Now to help us design, implement, test, and run the data capture system. Here are the basic steps of building out the motion capture workflow:

1. Define our final analysis goal(s), which included being able to track a basketball through an entire shooting motion: player movement, player catch, player release, and shot finalization. Here is one of our first “high-tech” diagrams used to brainstorm our workflow.

2. Map out the volume of space we wanted to analyze. Due to size limitations at the Moscone Center, we ended up building out an analysis volume that was 26’ deep by 34’ wide by 35’ high, which is roughly two-thirds of a half court of an NCAA basketball floor. This volume was split down the center of the rim through the free throw line.

3. Finalize the number of markers we’d capture on the player and the court. We chose left foot, right foot, left hand, right hand, left elbow, right elbow, head, ball, and basket. This gave us nine points to track using rigid bodies as markers: lightweight, reflective devices attachable with velcro. Below left is one of our test shooting participants demonstrating how the rigid body markers light up from the camera flash. (Below right is me wearing my UW Huskies jersey, because they’re awesome.)

4. Calibrate 34 high-speed IR cameras, in which camera count is a function of the desired volume area. These cameras shoot at 180 frames per second, scanning the analysis volume for, and then resolving, the location of each rigid body.

5. Test, test, test. We took approximately 4,000 shots in our Seattle-based performance stage, across 12 different player types (short, tall, beginner, novice, expert, etc.) all shooting a mixture of short- and long-range jumpers. Below is another member of our test group, Lak Lakshamanan, a wicked data scientist with a decent jumper.

6. With the basic capture workflow in place and a diverse group of test shooters, we began building out the workflow we’d implement onsite in San Francisco.

Data ingestion

Creating data is one thing, ingesting it and making actionable is a whole other ball of wax. Each of the nine rigid bodies emitted six data points at 180 frames per second, or 9,720 data points per second. We tracked the following location and rotation data (in addition to frame timestamp and other meta-data).

To capture this data in real-time, we implemented a socket listener using a .NET-based SDK optimized to process the marker data from the cameras. Using this SDK drove us to use Windows to host our initial ingestion processes, which had downstream impact on the rest of the architecture.

This data was captured locally, and after applying a bit of pre-processing, we immediately published it to the cloud to Firebase, Cloud Spanner, Cloud Pub/Sub, and Google Cloud Storage. In our architecture (more below), we published all raw data to Cloud Pub/Sub which emitted a real-time stream read by Cloud Dataflow (Apache Beam). This stream was processed and written to BigQuery as well as Google Cloud Storage. Cloud Spanner and Firebase were used for our real-time scoreboard and shot ranking. For all this work we relied on the .NET SDK for the Google Cloud to communicate with each service. (With former Microsofties and heavy Python and Java users on the team, the reviews of the .NET SDK factoring and overall user experience were strong — in some ways it was easier than the Python SDK , and most importantly, it got the job done.)

Data processing

After ingesting hundreds of two-minute shooting sessions, there were more than 500 million data points (9*6*180)*(120)*(500) loaded into Google Cloud Storage. In order to become actionable, the data needed to get into BigQuery, and for that, we needed Apache Beam and Cloud Dataflow.

Here is the top-level Apache Beam pipeline that reads off the raw data files, parses and analyzes the data, and then writes each shooting session to BigQuery:

public static void main(String[] args){Pipeline p = Pipeline.create(useCloudDataflow());
p.apply(FileIO.match().filepattern("gs://naismith-dev/sessions/*/*.txt")).apply(new ForceShuffle<>(25)).apply(FileIO.readMatches()).apply(ParDo.of(new ExtractFileFn())).apply(new SessionWriter());
p.run();}

There are several powerful aspects to this simple-looking graph:

It allows for easy switching between running the pipeline locally or on Google Cloud.
It’s mode-agnostic: for testing, data collection and processing could run in batch mode; for the eventual live feed in San Francisco, it could easily rely on Cloud Pub/Sub as the data source and process in streaming mode.
In batch mode, it shows a subset of files or just one file.
It allows for iteration. We adjusted how much processing was wanted or needed in the extraction function, and also ended up adding more logic to remove downstream burden on SQL processing and data shaping in pandas.

Armed with this Apache Beam graph, we executed a Cloud Dataflow job to ingest all the shot data. No need to worry about how to parallelize the graph — Cloud Dataflow optimizes the parallelization of each graph step and work items for you. For this graph, the pipeline only takes about four minutes of wall clock time to run.

Data exploration

With our test shooting data loaded into BigQuery, we moved on to data analysis and exploration. But in order to make sense of what was happening in each shot, we still needed some data. Specifically, we needed to determine when a shot was released, and then look backward and forward in the frames so that we could understand the shooting mechanics as well as calculate the shot metrics.

Fortunately, performing this enrichment and exploration was straightforward due to Python and R’s seamless access to BigQuery. We piped the raw frame data from BigQuery into a pandas DataFrame and built a collection of functions to calculate vector information about the ball and the shooter. For example:

(Note: The code above eventually went back into the Apache Beam graph in order to save post-processing time when performing data exploration. Using pandas to shape data inside an Apache Beam graph — very copacetic.)

For exploration, we started looking at basic descriptive metrics over a subset of shots:

Shot count: 1077Overall make percentage: 0.420Average shot distance: 13.148Average shot angle (offset cft): 91.976Average shot release angle: 47.833Average shot release speed: 14.478Average shot release height: 2490.105

And here is a scatter plot for make v. miss these shots:

Predictive modeling

We were ready to build a predictive model for the NEXT installation. Since we’d be receiving shooting data from a more diverse group than our test group, we wanted to build a model that was fairly straightforward, yet skillful enough to be meaningful.

Using the few thousand shots from our test data, we started to look at various features:

After additional feature analysis, we settled on the following features for what we called ‘the simple model’:

[‘distance’,’angle_to_rim’, ‘prior_1_distance’,’prior_1_angle_to_rim’,’prior_1_launch_speed’, ‘prior_1_launch_angle’,’prior_1_release_height’,’prior_1_release_dist_to_rim’,‘Prior_1_outcome’, ‘prior_2_distance’,’prior_2_angle_to_rim’,’prior_2_launch_speed’, ‘prior_2_launch_angle’,’prior_2_release_height’,’prior_2_release_dist_to_rim’,’prior_2_outcome’,‘priors_distance_avg’,’priors_angle_to_rim_avg’,’priors_launch_speed_avg’, ‘priors_launch_angle_avg’,’priors_release_height_avg’,‘priors_deviation_avg’,’priors_release_dist_to_rim_avg’]

Using distance and the angle-to-rim on current shot, we used the shooter’s prior two shot results plus the average data from all other priors to estimate a make probability for that particular shooter making a given shot. These features were used to train a TensorFlow DNNClassifier.

The game

Using this predictive model, we built a game to drive some competition while folks were shooting. The game is based on something we called the Naismith Score, which used the distance each shot was from the basket, the distance the active player moved since their previous shot, and the predictive model. Every .05 seconds we would re-score the make probability for the shooter and update the shot location data with the score potential for the shot. Score was specifically: (distance+deviation)*(1 + make probability)*(0|1). We realize that equation may look a bit harsh, but the fact is, you need to make buckets in basketball.

For example, below is Mingge Deng, one of our software engineers for BigQuery ML. Somehow he put up one session that locked him into 4th place, but then came back and went 0 for 13. You can see the sea of red misses on the three-point line, with a final shot that would have yielded a point bounty of 2947.

(Note: We utilized TensorFlow Serving via Docker to host our trained model on the same machine that was analyzing the real-time frame data. Latency from on-court movement to prediction was <40ms. TensorFlow Serving saved us a ton of time as we were able to rely on a pre-built container, a pre-built HTTP server, and the extensibility layers for serving in a production environment.)

So what happened?

Mingge got caught up in trying to make all three balls, ignoring the reward function of the Naismith Score. Over time, we had numerous savvy data scientists and sharpshooters starting to “game” the system, which incentivized them to shoot from far, then come in close, then move out far, since the scoring system was biased towards distance and movement. Moreover, if the model determined you were a good shooter (read: you made a few shots in a row relative to location and the broader shooting population), future good shots wouldn’t be rewarded as highly until you missed one and the model recalibrated. One could thus game the system by missing a long one, and still pick up a big score by running back in and making a short jumper.

That said, the reality was that in order to score you still needed to make shots. So even as we explained to participants how the model worked, thereby opening up the possibility of the model being gamed (or as we think of it, basketball engine optimization) the human element took hold and their mechanics became the next challenge. On the court, we were able to give real-time feedback on jump elevation, release point, and release trajectory, as well as provide the as-yet-unquantified data point of positive reinforcement to help them make a few more shots.

Moral of the story? Don’t change the game, change the player — specifically by helping them get better at shooting the rock.

The architecture

While our team of basketball nerds handled all the rebounds bouncing around the court, our team of Google Cloud products handled all the data bouncing around the court. All in, we used Firebase for session scoring, Spanner for leaderboard tracking, Google Cloud Storage for raw session data, TensorFlow to serve up the model locally, and a pipeline from Pub/Sub to Dataflow to BigQuery for data analysis (and future model retraining!).

We had wanted to demonstrate how Google Cloud might let you easily build scalable application workflows with turnkey serverless services, while also casually slipping in some data science fundamentals through a fun, applied experience. To be fair, while some might have been more interested in their shooting fundamentals instead, all participants still walked off the court and received an email with their user handle and unique link to a Colab notebook to access their raw data from their 60-second shooting session. Once broken out into a DataFrame, all that shooting data might translate into something looking like this — the motion (y-axis) of the ball during a shot.

A pre-built Colab notebook and some tips on how to start hacking on the dataset? Not a bad upgrade to your standard tech conference swag.

The score

With over 350 participants producing ~4,600 shots over two and a half days and countless audience members cheering (or heckling) their colleagues from the sidelines, the smart court was a highlight for many at NEXT. But it also gave us an opportunity to revisit some of the work from our March Madness efforts, including a long-awaited reveal of BQML and how it played into our predictive analysis — particularly around three-pointers.

While hanging out at the court, attendees were also able to immerse themselves into our previous work on Architecting Live NCAA Predictions: From Archives to Insights, diving deep into the process behind the live ads from this year’s Final Four.

In our next post, we’ll explore some of our findings around the mechanics of shooting a basketball (properly or otherwise). Meanwhile, you can get even more of a flavor of the smart court in action by checking out the showcase video here.

Until NEXT time — your Google Cloud sports nerds,