Settling the GOAT Debate Once and For All with Data?

Dehao Zhang
The Press Box
Published in
7 min readMar 26, 2024
Photo by Cédric Streit on Unsplash

It’s that time of the year again. The NCAA Division I Basketball Tournament (aka March Madness) just started last week, and I am sure many people are excited and will be following over the next few weeks.

In the realm of pretty much any sport, we hear people say something like “Player X/Y/Z is THE BEST player”, or even “the Greatest of All Time (GOAT)” and they would often go back and forth on why the mentioned player is the best and how other players are not.

As an analytical person, my natural question is always: best in terms of what?

There are tons of player or team level stats out there but there simply isn’t one that directly measures the overall “goodness” or “greatness” of a player. Otherwise, we just need to sort it and see who’s at the top, and there would be no debate. Another way to compare is to have the mentioned players play against each other and check the results. However, we all know that’s not always feasible either.

So how can we approach this with a data-driven mindset?

In this post, I will walk you through a series of steps that help you tackle this challenge, with simple math and an example on NBA players. I believe this framework generalizes well beyond sports and can be applicable in many areas of personal or business decision-making.

We all know the famous quote “If you can’t measure it, you can’t improve it”[1]. In this context a small pivot similar to “If you can’t measure it, you can’t compare or rank” applies well here. With that, let’s get right into it!

Step 1: Establish the Measurement Foundation

When it comes to approaching complex concepts like the overall greatness of a player, they are usually a composite of multiple dimensions, so it is important to breakdown the high-level concept into lower-level ones and try to identify those first, before jumping straight into specific attributes and metrics.

To borrow from another field, developer productivity is a high level and complex topic. A 2021 paper presented a framework called SPACE that outlines 5 dimensions (Satisfaction & Well-Being, Performance, Activity, Communication & Collaboration, and Efficiency & Flow) at the individual/team/system level to provide a comprehensive view of components that go into productivity [2]. Many of us may associate developer productivity with lines of code produced by time but that is just one example metric along the “individual activity” dimension.

Similarly, in NBA, people would often talk about how good a player is based on how many 30 or 40 points game they had. You can probably see that is only looking at one aspect. Here for illustrative purpose, we may consider the following 4 dimensions:

  1. Scoring Ability
  2. Champions and Awards
  3. Efficiency and Versatility
  4. Impact on Team Success

In general, one principle that I found useful when identifying these dimensions is MECE (Mutual Exclusive and Collectively Exhaustive), which is a fundamental concept in problem-solving [3]. In the context here, we would want to cover the ground as much as possible and at the same time avoid overlapping elements. One practical tip is to come up with edge cases and see if they can be balanced out through other dimensions. For example, player A has won many champions with minimal contributions to the success. In this case, while their score on the “Champions and Awards” dimension might be higher, their scores in the other ones might be low.

Establishing the foundation is critical and would require domain knowledge and acumen. Subject Matter Experts may already have an intuitive sense of what these are but for non-experts, a bit research would be highly valuable.

Step 2: Identify Metric(s) and Collect Data

Next, we need to identify metrics. We can consider a Venn diagram with one set that represent metrics that are great indicators of those dimensions and another set representing measurable and available data. The intersection would be our ideal candidates, or low hanging fruit. Depending on the data availability, try to have at least one metric for each dimension. When you have multiple metrics, run a correlational analysis to see if there is redundancy. For example, a player’s average points per game (PPG) probably have high correlation with their number of games with 20+ points, so we might want to only include one of them instead of both.

Through some research, I have chosen 10 example metrics for the 4 dimensions as shown in the below table. Fortunately, there are public datasets for many sports scenarios (data.world, the nba_api python package, etc.) [4] [5]. Make sure to check its license and usage terms for any commercial use, however. In addition, some data cleaning and processing may be needed.

Example of 10 Identified Metrics. Definitions of Advanced Stats: TS, PER, WS, BPM

For many metrics like PPG, people often wonder whether we should use the career average, or from their peak season. If we find the correlation between them to be relatively low, we can always either 1) adopt one, run a sensitivity analysis at the end and see if the result changes materially when the other metric is used instead, or 2) include both metrics and assign weights.

Step 3: Determine the Scoring Function(s)

In order to determine a player’s score, we need to come up a scoring function for each of the metric. We can think of this similar to the feature scaling/normalization step in classical machine learning problems to avoid bias by the original scale of the metrics. Common techniques include binning, normalization, standardization, and custom logic. Oftentimes the desirable treatment depends on the type of data and distribution. There isn’t really a single answer here, and we may need to go through some trial-and-error to see what makes the most sense. For example, we might decide that the scoring difference between someone never won MVP vs 1 MVP award should be higher than that between 5 and 6 MVP awards, so it might make sense to make this function increasing and concave down.

Tip: If you are not an expert, it might be good to validate with a SME on the logic of the choice.

Step 4: Assign Weights

We are getting close to the final answer! Now we need to assign a weight to each of the scoring metric. Make sure the weights sum up to 100%. In this example I have decided to allocate 20%, 50%, 20%, and 10% on the 4 dimensions. See the assigned weight to each metric in the below table:

Example Weight Assignment Across the 10 Metrics

Similar to the scoring function, the choice of weights can be made based on domain understanding, or “learned” through supervised ML in some cases. When we get to the next step, it is generally advisable to do a sensitivity analysis to see the impact of the chosen weights on the final score.

An alternative order is to assign weights and calculate a “raw total score” first, before applying the scoring function to get to the final result. The idea remains the same.

Step 5: Report, Analyze, and Interpret

Now is the moment of truth! With the individual scores and assigned weights we can take the dot product as the final score. In the NBA example, here are what the list of top 25 turn out to be (drum roll, please):

Example Top 25 Rankings with Calculated Individual/Total Scores

One quick observation is that the difference between the Top 3 players (Michael Jordan, Lebron James, and Kareen Abdul-Jabbar) is pretty small, which suggests that the order between them is likely to change given a different configuration in Step 3 or 4. It’s likely that people would reach different ranking outcome because of the chosen combination of metrics/scoring/weights. The final step, however, is deterministic.

From another angle, say you are a big fan of Bill Russell and are not happy with his ranking, you can choose to put more weights on the number of titles and MVP Awards and put less weights on the other metrics, and then you will likely see his ranking moved up!

As mentioned in Step 4, we would want to do some sensitivity analysis as well to test the robustness of the choices made in Step 2–4. You can parametrize the weights on a scale similar to the screenshot below, to help us analyze the impact of the weights.

Example Slicers for 4 Chosen Metrics: PPG, Titles Won, Triple-Doubles, and True Shooting Percentage

Closing Thoughts

It’s always interesting to think about the different models behind the ranking systems of the world, from the college football playoff system to the various college ranking systems. The ranking outcomes matter, but what goes into the ranking system is also important to study and understand. To me, the outcome is like the tip of the iceberg, but the bulk of work and logic is underneath. No matter if you are working as a data professional, product manager, strategy, etc., understanding the nuance in the measurement system itself, the “why” behind the choices, and potential drivers would be important to amplify impact and value.

Photo by SIMON LEE on Unsplash

Hope you have enjoyed this post and let me know if you have used it to come up with a ranking in the field you are interested in and/or if you find this framework helpful!

--

--

Dehao Zhang
The Press Box

Data Science | Workplace Analytics | Operations Research | Building Intelligent Solutions