Using machine learning to predict the 2017 NBA All Star rosters

I love basketball, so a lot of my projects revolve around that. There is an unlimited amount of data that can be mined for an endless amount of projects.

I write my code using the PyCharm IDE on both Windows (bleh) and OSX (meh). I collaborate with my brilliant friend and Python mentor, Devin. He has contributed to a lot of my projects. For my basketball research, I utilize the fantastic Basketball Reference website.

I’m interested in machine learning, which is a type of artificial intelligence. It’s a deep topic, and I am a beginner to it, but I’ve used it for classification (like sorting objects into categories) and regression (it’s like identifying a relationship between objects). To determine which NBA players would make the all star team, I found it best to do use Decision Tree classification model. (All Star? Y/N)

In order to figure out which players would be named an all star, we have to analyze past statistics. Each data point is a “feature”. What are some characteristics that would make an NBA player an All Star? He probably scores a lot of points, plays a lot of minutes, gets to the free throw line, etc… We can extract many features from the data set. Steals, turnovers, blocks, rebounds, shooting percentages, player age, games played, etc… But if you try to train the computer on too many features, that creates a lot of noise and can lead to something called “overfitting.” It probably isn’t important to know what color of sneakers the player wears and what type of gum he likes to chew.

So what features are important for predicting All Star selections? We can use a nice module from the scikit-learn Python library to determine feature importance. The computer looks through our big CSV file of NBA players and all their respective statistics over all the seasons (the file I used dates back to 1980). Included in the file is a column called “All Star”. That is populated by a 1 (True) or 0 (False). The computer can start to see and “learn” features that are associated with All Stars.

Examples of “impressive” stats are highlighted in yellow. Red is bad.

Here are the top 10 most important features from our data set expressed as percentages.

So after learning from past players, we can turn the algorithm loose on the players from the 2016–17 season. We tell the computer, “hey, we don’t know who should be an All Star. How bout you look through all the player’s stats, and knowing what you know about what makes an All Star, you try to ‘classify’ them for me.” Something to note here, I didn’t tell the computer how many classifications it needed to make. I didn’t say there had to be a certain number of players from the Eastern/Western conferences. It just looked through the cold hard stats and said, “ok, here’s what I think.” so using the list of features mentioned above, here is a list of the predictions.

After the All Star rosters were finalized on 1/26/17, I plugged them into a spreadsheet and compared them to my predictions. How did I do? 20/24 classifications were correct. that’s 83% — Pretty good! I editorialized some of the analysis and notes. I said that if the algorithm failed to predict an All Star, then that player must be overrated *cough*Jordan*cough* — If the algorithm predicted a player and they were not named to the All Star team, then I call that a snub! *cough*Lillard*cough*

Statistical data pulled from