A Method to the Madness: Helping Solve NCAA Basketball Analysis Using Google Cloud

Alok Pattani
Analyzing NCAA Basketball with GCP
12 min readMar 11, 2020
Google Cloud March Madness Insights Dashboard

With conference tournaments in full swing and the Big Dance (if circumstances allow) mere days away, we are back with more NCAA basketball content from Google Cloud! This year’s NCAA Tournaments may not be open to the public, but our Google Cloud March Madness Insights dashboard will be. Our small team of data science and sports analytics enthusiasts remain ready to provide you with some truly unique content over the next few weeks (health guidance permitting). We’ll be showcasing both the technical aspects and the basketball-related results of our work using Google Cloud tools to gain valuable insights from NCAA basketball data, right here on Medium.

We’re very excited to add some new features and insights to our coverage this year, including:

  • Women’s basketball data and analysis: Every team or player metric we’ve created for men, we’ve applied to the women’s game as well. By building our pipeline to ingest and process data from the women’s game in the same way as we do on the men’s side, we can now evaluate Baylor’s dominant women’s team in all the same ways we look at the men’s.
  • Advanced player metrics: We took last year’s team-level analysis and applied similar principles to the individual player level, including some all-in-one metrics that account for competition faced. Want to see how Dayton star Obi Toppin stacks up to the top major-conference men’s players? Or exactly why Oregon’s Sabrina Ionescu is considered one of the greatest women’s college basketball players of all-time? We’ve got you covered.
  • Team résumé analysis: While it can be insightful to measure performance in various aspects of the game (shooting, defense, etc.), tournament selection and seeding often reflect something much more straightforward: which teams you played and who you beat. Our analysis of the quality of each win or loss on a team’s résumé, as well as an overall rating of how hard it is to achieve each team’s record given its schedule, will help identify the most deserving teams leading into Selection Sunday and Monday.

We’re also bringing back some of our top features from last season, including:

  • Schedule-adjusted efficiencies, pace, and Four Factors: We compute versions of these top-level basketball analytics to evaluate team offense, defense, and overall performance — popularized by Dean Oliver, Ken Pomeroy, and others — that adjust for the competition each team faced (very important in college basketball!). We laid out the methods behind these calculations last year, and have now applied them to women’s teams as well for this year.
  • In-game scoring metrics: The final score doesn’t always tell the whole story. Analyzing a team’s score progression within games using play-by-play reveals more about dominance, comebacks, and game control. We discussed how BigQuery allows us to scale these types of computations over millions of rows of play-by-play data a year ago, and we’ve put those calculations back into action this season, too.

All of this data resides in our dashboard, which we’d highly recommend bookmarking from now until champions are crowned on April 5 and 6. We’ll be updating all our metrics frequently, and maybe even adding in some new things there as the postseason progresses.

Using the Google Cloud March Madness Insights Dashboard

Each page contains a set of filters on the left that can be used to expand or limit the data for consideration by season, team properties, player-related attributes, or game-related information, depending on the page.

  • Use the “Sport” dropdown at top left of every page to toggle between men’s (“MBB”) or women’s (“WBB”) basketball.
  • Where available, use the “Through” filter to see how things looked on the useful cutoff point of the “Selection Date” each year (the other option is “Season Latest,” which represents the end of season for previous years, the most recent data for this year).
  • Pages also have a timestamp (usually toward bottom left) that indicates when the data feeding the tables or charts on each page was last updated.

Many tables also employ an extremely useful Data Studio feature called “Optional metrics.” This lets you add different values to the current table you are viewing (and take out others) by clicking on the icon with a gear located at the top right of a table, then adding/removing as desired. For instance, you can add each team’s raw efficiency ranks to the adjusted efficiency stats on the “Team Efficiency & Pace” page (as shown below) to see how schedule adjustment impacts a team’s numbers.

Below is a guide to the types of data-driven insights you can find using this dashboard, by page.

Team Overall Rankings

Here we have a table and chart displaying two overall team performance metrics. “Adj Net Eff” represents a team’s point differential per 100 possessions, adjusted for schedule (opponent and site of each game). The teams that have higher Adj Net Eff are the most powerful/”best” teams, and are more likely to do well going forward.

“Résumé” ratings/ranks are based on the chance of an average “top” (NCAA Tournament-type) team achieving the team’s W-L, given its schedule. These ratings reward accomplishment, assessing a team’s record by how hard it was to achieve (very similar to ESPN’s Strength of Record, which I played a large role in developing). Teams with better résumé ranks are the “most deserving” teams, having impressive results looking backward. (More detail on these calculations in another post soon.)

The chart at the bottom plots each team-season as a single point, with teams that are “more deserving” further up and those that are “more powerful” further to the right. Some interesting teams to examine are those toward the top right (in general, high quality teams) and somewhat off the black line (those whose résumé and true quality don’t align as much).

For instance, compare Virginia and Texas Tech, the two teams that played for last season’s national title on the men’s side:

Texas Tech has been very effective on a per-possession basis, rating higher than Virginia by a few points in adjusted net efficiency — Tech would likely be favored if the two teams faced off on the court. But UVA ranks much better by résumé, having big-time victories over top ACC rivals Duke, Louisville, and Florida State, as well as 10 other wins vs teams ranked in the Top 100 in adjusted net efficiency. The Red Raiders haven’t been able to put up nearly as many quality wins (and have six more total losses), so they’ll likely be placed a few seed lines down from the Cavaliers when the bracket is revealed on Sunday.

Team Efficiency & Pace

This page shows schedule-adjusted season-long values for pace as well as offensive, defensive, and net (overall) efficiency. This is similar to what you see on the home page of kenpom.com, with the added benefits of various types of filters, ability to view multiple seasons together, and, by virtue of including women’s teams, nearly twice as many data points to analyze.

This season, there are three women’s teams that have an adjusted net efficiency of more than +50 points per 100 possessions (the gap between top women’s teams and the average is much larger than the distance on the men’s side, every year). Using the dashboard, we can quickly see that those three teams dominate in different ways: Oregon with an amazing offense (~12 points per 100 possessions better than anyone else!), Baylor with its top-ranked defense (on par with their title-winning squad from last season), and South Carolina with the most balanced team, ranking second place in both categories.

Team Four Factors

Going one step beyond the efficiencies, we have ratings for each team in the Four Factors that comprise offensive and defensive play: shooting, turnovers, rebounding, and getting to/converting from the free throw line. (This is the only public-facing resource we know of with schedule-adjusted versions of these metrics!)

Kansas, Gonzaga, and Duke are the three top men’s teams in net efficiency. We can use their offensive and defensive Four Factors to see how different facets of the game help distinguish their performance on the way to achieving success.

On offense, all three shoot very well from the field, do a good job protecting the ball, and have pretty solid offensive rebounding rates. The teams are a bit more distinctive on defense, with Kansas really strong at limiting opponent shooting from the field, Duke also very good in that area and solid at forcing turnovers, and Gonzaga’s strengths being in defensive rebounding and limiting opponent free throw rate (while being relatively weak in the first two factors).

Two-Team Game-by-Game Résumé Comparison

This page is more complex, but represents a truly unique and dynamic way to evaluate and compare two teams’ résumes side by side — a super-useful feature as selection and seeding debates arise this time of year. Each long table displays the selected team’s team’s schedule and results, with a few résumé-oriented columns to add perspective:

  • “Exp Win%”: the chance of an average “top” (NCAA Tournament-type) team winning the game, given the opponent (rated according to adjusted net efficiency) and site. Lower values represent harder games.
  • “Win Value”: based on the team’s result (W or L) compared to Exp Win%. Winning against quality opponents (and away from home) gets higher (positive) values; losing to poor ones (especially at home) gets lower (negative) values.

(More details to come in a résumé evaluation-focused post.)

Game-level filters on the right allow you to drill down to specific games based on date, opponent rank (e.g. Top 50 in adjusted net efficiency), site, or result, with each team’s schedule table then being filtered accordingly. The summary table up top for each team shows W-L, total expected wins, and total win value across the games that fit the filters selected.

Going back to the Virginia-Texas Tech men’s comparison mentioned above, let’s look at their game-level results side-by-side.

The Cavaliers have built a stronger résumé on the back of some very close wins in tough games. Notice how many more non-“gimme” wins (those with win values in the 0.2 to 0.5 range) they have compared to what shows up on the Red Raiders’ résumé to date.

Player Season Stats

As promised, we have various advanced metrics for individual players based on box score stats from all games played. Many of our metrics are based on the calculations developed by basketball analytics experts and seen on the invaluable Sports Reference websites, like individual offensive and defensive ratings and win shares. But we go further, as we did with some of the top NBA draft prospects last June, by adjusting some stats for the level of competition faced and converting the win share calculations to a more college-friendly “Wins Above Average” (WAA) metric. WAA represents the total number of wins contributed by the player compared to an average D-I player with the same minutes, taking into account efficiency (more player-based on offense, more team-based on defense), offensive usage, and playing time. We have not found any other public-facing resource with these sort of all-in-one, schedule-adjusted metrics for both men’s and women’s college basketball at this scale.

These advanced metrics are really good for highlighting the amazing play of the aforementioned Sabrina Ionescu, arguably this season’s most dominant college basketball player — men’s or women’s. Ionescu ranks outside the top 50 in women’s Division I in points and rebounds per game (she does lead the nation in assists per game), but most fans and analysts would agree that she is the game’s best player, by far. Taking into account all her box-score stats and the very impressive competition her team has played this year, her +11.8 adjusted WAA — nearly three wins ahead of everyone else! — validates this. Depending on how far the Ducks advance in the NCAA Tournament (they’re a lock for a top seed), Ionescu could challenge for the top women’s WAA in the six seasons for which we have data. She already owns three of the top eight single-season marks during this span.

Some other notes on the metrics shown on this dashboard page:

  • Sorting by rate stats, like offensive and defensive rating, can lead to some funky results with players with very few minutes or possessions. The “% of Team Min Played” and “% of Off Poss” filters on the left can help with setting a minimum threshold for qualification on those rate stats, and also isolating to players who have taken on various shares of their team’s offensive load.
  • The “optional metrics” for this table include raw versions of some of the advanced metrics, for comparison’s sake.
  • No, there is no player in the men’s game this season that approaches the amazing 0.40+ win shares per 40 minutes that Zion Williamson posted at Duke last year.

Player Game Stats

We also have player stats on the game level, including schedule-adjusted ones similar to those on the season level from the previous page. We also include non-adjusted game score, a metric specifically designed for measuring single-game productivity. This page is probably the best use of BigQuery’s BI Engine, as the underlying table contains more than 1.3 million rows, which covers every player in every game across men’s and women’s D-I basketball the last six seasons, and can be filtered and sorted in various ways in a matter of seconds!

We can see that the top men’s game by Adj WAA in all six seasons for which we have data came earlier this year: Marquette’s Markus Howard posting a +0.602 vs USC back in November, when he scored 51 points on nearly 80% true shooting in a 22-point win. Howard has a penchant for this type of peak single-game performance, accounting for three of the top 10 individual player games by this metric during the span of our data.

Team In-Game Score Metrics

As mentioned earlier, we’ve used play-by-play data to measure how long each team has spent leading, tied, or trailing — by any amount, and with different margin thresholds (10+ points and 20+ points). These are available on both the team season and single game levels. A couple interesting notes from these reports:

  • The top women’s teams in percentage of minutes leading are the three dominant teams mentioned above, followed by a more surprising conference champion: Princeton, from the Ivy League. The Tigers have lost only once this season (by two in overtime at Iowa), and have spent 55% of minutes played leading by double digits.
  • What’s the smallest amount of time a men’s team spent leading in a win this season? How about Oklahoma, who just last weekend won a game at TCU in which they didn’t lead till Austin Reaves hit the game-winner with 0.5 seconds left!

Take a deep breath, as we just covered a lot of ground! But hopefully that sets the table for all the interesting things that are possible to learn with our NCAA basketball metrics and dashboard this season.

Over the next few weeks, be on the lookout here, on Twitter, on LinkedIn, and even on some of the men’s tournament game broadcasts themselves, as Steve Sandmeyer, Eric Schmidt, Elissa Lerner, and myself will be sharing some of our unique Google Cloud NCAA basketball insights. Our results will help you debate and discuss who should get into the tournament field, then help assess the strongest teams and players you want to rely on when filling out your brackets, and finally serve as a valuable complement to your NCAA Tournament viewing experience. Embrace the madness of March!

--

--