From College to the Pros with Google Cloud Platform (Part 2)

Published in

Analyzing NCAA Basketball with GCP

15 min readJun 20, 2019

The 2019 NBA Draft is nearly here! In part 1 on looking at the journey from college to the pros through data, we laid a lot of groundwork, namely how we:

Used Sports Reference’s college basketball site and a useful Python API to get nine seasons of NCAA player and team game and season data into BigQuery
Created advanced college basketball player stats on the game and season level using views in BigQuery
Adjusted those ratings for schedule to get a better assessment of player performance using AI Platform Jupyter Notebooks
Created an interactive Data Studio dashboard that allows examination of more than 40,000 player seasons and more than a million player games in seconds, thanks to optimization using BigQuery BI Engine.

In this post, we’ll demonstrate how to go even further on this data science journey in GCP. By getting some more data, we’ll look at past NCAA players who were drafted into the NBA to help provide some perspective on this year’s prospects.

Before we dig in though, it’s worth noting that any attempt to use data science to project players’ NBA success and deciding who to draft based on that (if it can be called “science” at all) is precarious. Actually, the NBA Draft is a risky proposition in general. This isn’t just an idle caveat: in the 2012 NBA Draft, the #1 pick was Kentucky’s Anthony Davis, a generational talent that the Lakers just agreed to trade a quarter of their team and much of their future draft capital for. But the next year, that same #1 pick was UNLV’s Anthony Bennett, who bounced around the NBA, barely making any team’s rotation, before being out of league midway through his 4th season. Talk about a crapshoot! As NBCSports’ Tom Haberstroh recently noted, despite having more data and more ways to assess players than seemingly ever, NBA teams may actually be getting worse at drafting.

All that said, let’s definitely still attempt to project how great Zion Williamson, Ja Morant, R.J. Barrett, and others might turn out to be.

Getting More Data on NBA Draft Prospects

While the schedule-adjusted NCAA player metrics we’ve created will be helpful, if we’re going to have any success in using data to project college players into the NBA, we’ll need more than those alone. Remember, not everyone gets drafted. In our dataset of nearly 17,000 NCAA Division I men’s basketball players from 2011–12 to 2017–18 (excluding this past season), just over 2% went pro. Only 60 players get drafted each year (30 NBA teams, two rounds), and some of those picks are used on non-collegiate prospects. Moreover, not all NCAA players who do well in the NBA were top players in college by the stats — either basic or advanced. College success and pro success don’t overlap as much as you’d expect.

So, we need some way to whittle down our large set of college players to a more manageable group of players who actually have some real chance of being drafted. Fortunately, sites like NBADraft.net help with this each year, ranking the Top 100 NBA Draft prospects on a “Big Board” that evolves leading up to draft day. Some of these prospects are international players or others who didn’t play NCAA Division I men’s basketball. But generally, 80–90% of prospects did play at D-I schools, so we have some college statistics for them.

Pulling in Big Board data for 2019 and past years not only allows us to filter down our list of college players to true prospects, but also tells us what basketball scouts thought of the player’s ability to succeed in the NBA prior to him being drafted. Since this ranking is often based on much more than a player’s stats (like his athleticism, Combine measurements, NBA fit, character, and other things that are harder to quantify using college performance alone), it may provide new information that helps project NBA success.

As we did with getting college data into useful shape in Part 1, we used open-source software to help us get those Big Boards and other data into BigQuery. This time, we used R, a free software environment that is very popular among statisticians and data scientists. Conveniently, R has the nbastatR package, which can be installed directly from GitHub using the devtools package. nbastatR has useful functions that can pull NBA-related data from various sources, including the NBA’s official stats site, NBADraft.net, and Basketball Reference.

While there wasn’t a great way to run R on GCP until recently, thankfully, AI Platform R Notebooks were launched in beta just a few weeks ago! Like Python notebooks, these notebooks are managed, web-based interactive development environments for R. Service account keys allow us to connect R with BigQuery, which lets us write data that was gathered in R back to our main database. Below is a screenshot of some of the setup cells in our notebook to gather NBA Draft-related data using nbastatR (and other R packages).

AI Platform R notebook setup on Google Cloud Platform

Beyond NBADraft.net Big Boards, we also added a few other data sets to aid our analysis:

Past NBA Draft information: who was picked at each pick, by whom, dating back to 2011, all gathered using nbastatR’s draft function.
Recruiting Services Consensus Index (RSCI) player ranks: ranks of players before they got to the NCAA, which may provide additional info to help predict performance. This was gathered using R’s web scraping tools (namely, rvest and associated packages).
A few NBA player honors: who made the All-Rookie, All-Star, and All-NBAteams for the past several seasons, to get some indication of which players were successful in the NBA. We gathered this manually.

After creating these tables in BigQuery, it was time for one of the most common necessary evils in sports analytics: player ID mapping! BigQuery can help with some clever name matching (by player and year), but a lot of players still had to be mapped manually so that we could join all relevant players across our various data sets.

Pulling all of our data together in one place is worth the effort though, and provides some major upside. For one, we can now merge all this data in BigQuery and allow for combined [NCAA stats]-[Draft prospect data]-[NBA success] analysis in our Data Studio dashboard. The second row of filters on Page 1 of that dashboard allows for filtering down to draft prospects (by rank), only draftees (by pick number), or only players who went on to big NBA success (made an All-Star or All-NBA team). This opens up the ability to answer questions like these very quickly:

Who had the highest schedule-adjusted ORtg among this year’s Draft prospects? (Texas’ Jaxson Hayes, at 140.9)
Who had the most adjusted win shares in a college season among players not taken in the lottery? (UConn’s Shabazz Napier, with 10.1 in 2013–14)
How did NBA All-Stars stack up in raw and adjusted stats in their final college seasons? (see below)

Pro tip: use the “MostRec Yr” filter to restrict the dashboard to the player’s latest college season only, something we do frequently in our analysis.

Does Schedule-Adjusting College Stats Matter for NBA Draft Prospects?

We spent a lot of time in part 1 talking about schedule-adjusting player stats, particularly advanced ones like off/def ratings and win shares, and how taking into account the level of competition faced in each game helps more accurately evaluate the significance of collegiate performance. But we wanted to know, are adjusted stats more indicative than raw stats in determining which NCAA players might go on to NBA success?

A first cut at this question is to filter our dashboard to those players who have made an NBA All-Star appearance or All-NBA, and compare their raw and adjusted stats in their final college season. There are only 18 such players for which we have college data since 2010–11, so we realize that excludes many current NBA stars. However, on average, the adjusted stats were considerably better: adjusted ORtg 5.1 points higher, adjusted DRtg 5.0 points lower, and adjusted win shares 1.3 wins higher. All but one player — Damian Lillard of Weber State, a small school in the Big Sky conference — had better adjusted stats than raw ones in their final college season.

Given that 18 is a pretty small sample size and heavily skewed toward power-conference schools, we took a more rigorous look. Our next cut was the subset of drafted college players who made an All-Rookie team. As big of a step down from All-Star/All-NBA as that may seem, it at least separates those who had even marginal NBA success from those who didn’t contribute much. So in addition to All-Stars and All-NBA players, All-Rookie selections expand our sample to 71 players.

Each of those 71 players was ranked among NCAA prospects within his draft year in raw and adjusted win shares per 40 minutes (a rate stat that does well in evaluating a player’s overall efficiency, as opposed to total value), so we could see which ranking painted each player in a better light amongst his peers going into the draft. The plot below shows each player’s raw rank on the x-axis and adjusted rank on the y-axis (hover over names to get more info on when/where the player was drafted, click and drag to zoom in), with the diagonal line corresponding to equal ranks:

As you can see, a good majority of these “successful” NBA players (52 of the 71) lie above the diagonal line, indicating better (lower) adjusted ranks than raw ranks. This includes a few stars that NBA fans just finished watching in the Finals — Kawhi Leonard, Klay Thompson, and Draymond Green — as well as a few others that are in the news lately with free agency fast approaching — Kemba Walker, Jimmy Butler, and D’Angelo Russell. Each of those players ranked better in adjusted win share rate than raw in their respective draft classes, meaning adjusted win share rate would’ve done a better job forecasting their success. Based on this sample, NCAA adjusted win shares per 40 minutes does seem to have some greater predictive value toward NBA success than the raw version of the stat does.

So… How Great Will Zion Be?

All that is great, but let’s get to the fun! There’s little doubt the Pelicans will take Duke phenom Zion Williamson, one of the most hyped pro prospects in recent years, #1 overall. So how good will he be in the NBA?

In case it wasn’t clear from part 1 (it probably was), Zion Williamson’s lone college season was one of the most amazing in our nine seasons of data. It rated:

5th-highest in adjusted wins above average (WAA), despite the fact that he missed nearly six full games
Highest in adjusted offensive rating among players who took on more than their average share of team offensive possessions (minimum 500 minutes & 20% OffPoss%)
Highest (by a good margin) in adjusted win shares per 40 minutes (min. 500 minutes)

Let’s use the dashboard to see who else ranked highly in adjusted win shares per 40 minutes among all other top-10 prospects in our nine seasons of data:

Highest adjusted win shares per 40 minutes among Top 10-ranked NBA Draft prospects from 2011 to 2019

If the three names below him are any indication, Zion is in great company. Kyrie Irving, Karl-Anthony Towns, and Anthony Davis are All-Stars already in, or heading into, the primes of their NBA careers. There are some disappointing top-5 selections down the list, as Cody Zeller and Derrick Williams didn’t live up to expectations, but they were also less efficient overall in college. Generally this list seems to portend well for Zion.

Williamson is also rated #1 on the NBADraft.net’s Big Board, was top-5 in the RSCI rankings a year ago, and was only a freshman in 2018–19 (players who do well earlier in college tend to leave early for the NBA, and therefore are generally preferable to older college players with similar production). So, pretty much every possible feature we have available to put in a predictive model seems to be in Zion’s favor. Without even running the analysis, it seems safe to say he has a very good chance of being a great pro. (ESPN Analytics’ sophisticated Draft projection model, which uses information like this and more, gives Williamson a 72% chance of playing at an All-Star level in his first four seasons, one of the highest such percentages in the model’s history.)

Ja Morant vs R.J. Barrett

After the Pelicans take Zion at #1, most reports and mock drafts have the Grizzlies selecting Murray State sophomore Ja Morant at #2, and then the Knicks likely taking another Duke freshman, R.J. Barrett, at #3 (if they don’t trade the pick). Morant and Barrett were both 1st-Team All-Americans last year and seem to have high NBA potential. There is some debate on whether Barrett, who rated #1 in the 2018 RSCI rankings (ahead of Williamson), is being overlooked and is a better pro prospect than Morant. Time will tell, of course, but comparing these two players using raw and schedule-adjusted college stats is a great way to reframe this debate.

R.J. Barrett and Ja Morant raw and adjusted statistics from the 2018–19 season

In terms of raw stats, Ja Morant put up a higher offensive rating while shouldering a major load (more than 36% possession usage while on the court, compared to Barrett’s 31%). He also displayed an roughly equal defensive rating in 2018–19. This translates to 8.2 win shares, which is second-highest among 2019 draft prospects. Barrett lagged behind with 6.2 win shares, 14th-highest among players on the Big Board.

But things flip when adjusting for the fact that Barrett played an extremely challenging schedule (both in and out of conference) at Duke, while Morant generally faced more mediocre competition in the Ohio Valley Conference. Looking at adjusted stats, Barrett’s offensive rating is slightly higher and defensive rating is several points better, translating to 9.1 adjusted win shares (fifth-highest among Draft prospects) compared to 7.7 for Morant (13th). That said, Morant fans can point to the fact that he performed very well when he did face power-conference competition in 2018–19, putting up 0.17 adjusted wins above average or more in games against Alabama, Auburn, Marquette, and Florida State.

Barrett also has youth on his side — he’s a year younger than Morant, and Morant’s freshman season performance wasn’t nearly as good. Both are undoubtedly worthy candidates of being top selections Thursday, but we wonder if the prospects might rank differently if NBA teams were using adjusted college stats more prominently.

Looking at ALL 2019 College Prospects

For a complete look at all 2019 college prospects, we decided to show NCAA adjusted win shares per 40 minutes vs. NBADraft.net Big Board rank in the interactive scatterplot below (hover over names to see schools and actual values, click and drag to zoom in). Going in prospect order from left to right, we can see which prospects were more or less efficient than we’d expect at generating wins in college last season.

Zion (top left) is literally almost off the chart, as we’d expect at this point, but the second-highest displayed name is a pretty surprising one: Gonzaga junior Brandon Clarke, the 25th-ranked prospect. In fact, Clarke’s 0.368 adjusted win shares per 40 minutes in 2018–19 is the third-highest among any player with at least 500 minutes in our nine seasons of data, behind only Williamson and the senior season of former Wisconsin star Frank Kaminsky. Clarke’s adjusted offensive and defensive ratings are within decimal points of Zion’s (though with a lower Off Poss%, which matters). Given this, it’s easy to see why Clarke is being considered an analytics darling in this draft class and might go as high as the lottery, despite his Big Board ranking suggesting otherwise.

But recent college production is only one piece of a prospect’s profile. At almost 23, Clarke is on the older end of draft prospects, and his two seasons at San Jose State before transferring to Gonzaga were much less impressive. Kaminsky, who was also hyper-efficient in his final season in college and entered the draft at a similar age, has struggled to make much impact in the NBA after the Hornets took him #9 overall in 2015. But if a team can get Clarke in the mid-to-late first round, it seems like a worthwhile risk given his standing among the other prospects likely to be available at that point.

Some other interesting prospects from the plot above:

Oregon freshman Bol Bol (#19), a top-10 recruit and son of former NBA giant Manute Bol, had the fifth-highest NCAA adjusted win share rate among 2019 prospects. But that came in only nine games, as he missed most of the season with a foot injury. Was his high college efficiency a harbinger of a productive NBA career, or do small sample size and injury concerns mitigate his projection? It’s no wonder some have Bol as the most polarizing prospect in this year’s draft.
Duke freshman Cam Reddish (#15), the third member of the Blue Devils’ lauded 2018 recruiting class, had some inconsistencies throughout the season and finished 67th among college prospects in win share rate. That doesn’t seem to bode well for his NBA future, but it could be that he struggled to fit in with Williamson and Barrett dominating the ball so thoroughly. There is precedent for some top-20 prospects with low efficiency as college freshmen still having NBA success: Zach Lavine, Jaylen Brown, and Andre Drummond are ones that emerge using the dashboard when looking for situations similar to Reddish’s.
Tennessee junior Grant Williams (#39) would likely go much higher than he’s projected if college performance translated directly to the NBA. The two-time SEC Player of the Year ranked in the top-5 in NCAA in adjusted win shares, wins above average, and win share rate (min. 500 minutes) in 2018–19. But history shows other juniors and seniors with second-round Big Board rankings and high NCAA win share rates can be a mixed bag: Josh Hart, Mike Scott, and Malcolm Brogdon have become solid NBA contributors, but others like Sindarius Thornwell and Nolan Smith struggled to find significant roles.

Next Steps

You’ll notice that we stop short of a “full” NBA projection model that ranks top college prospects based on college stats, Big Board rankings, recruit rankings, and other data we might be able to gather. That would be the next logical next step given the data we’ve put together, and GCP tools like BigQuery ML and AI Platform notebooks can surely help.

But what, exactly, do we project? As in a typical data science/machine learning task, there are many caveats around measuring the success of an NBA player that need to be considered before we go about modeling. For instance:

Choosing a true NBA outcome measure: we used NBA honors as a proxy for which players turned out to be “good.” But there are many more quantitative ways to evaluate players — the NBA version of win shares, box plus-minus, real plus-minus, and others — that would allow more continuous evaluation of NBA success (i.e. not just “award-winning player or not”).
Choosing a time frame: even once we choose a metric, we’d need to decide what portion of the player’s career we are interested in projecting. Some players contribute early, while others take some time to develop into stars (current favorites Steph Curry, Kawhi Leonard, James Harden, and Giannis Antetokounmpo took 4–5 seasons to make their first All-Star appearance). If we choose a longer time horizon, we’ll have less data to work with, since we haven’t observed enough of the careers of recent draftees yet.
Incorporating variance into projections: do we want to model around highest average value, maximize the chance of drafting a future All-Star (even if there is some downside), or sacrifice upside and minimize the chance of taking someone who won’t contribute at all? A good model will provide accurate projections, a better one will also account for the range of possibilities around the mean projection.

If we were helping out an NBA team with their draft modeling, we might build multiple models using different outcome measures and time frames, attempt to account for variance in each projection, and then employ the outputs that best fit the team’s outlook at the time of the draft (e.g. does the team need a contributor now? is the team starting a rebuild? etc.). If you’re curious to see how some models are currently addressing these questions, take a look at the analytically-based projections from ESPN’s Analytics Team, Kevin Pelton, FiveThirtyEight, and others. These do a nice job of navigating these issues while offering projections for all prospects.

But not every data-related project has to end up with a fancy machine learning model spitting out predictions. Sometimes just organizing the data, using subject matter-expertise to add valuable calculations, and getting it in a usable format is a big and meaningful step in a data science journey. Our dashboard combining college and NBA information is a really fun and useful tool that can be used in the player evaluation process, and schedule-adjusted advanced stats for players can become a great addition to the “college-to-pro” draft projection toolbox.

We hope that the process we detailed in these two posts around ingesting, transforming, combining, analyzing, visualizing, and interacting with data has helped illustrate the power of various GCP tools working together to create data-driven insights. Enjoy the Draft!

Special thanks to Elissa Lerner