Summary for Our Project
Our first blog is “Introduction of Our Project”, in which information background, research background, plans for further investigation, definitions and outline our project are listed.
1.1 Our Big Idea
Given the availability of the NBA statistics for teams, their opponents, our aim is to combine them with Elo rating system (which is a method for calculating the relative skill levels of an entity) and predict the point spread given the face-off between any two NBA teams.
1.2 Measurable Progress
In our project, we used data cleaning and data preprocessing techniques to deal with over 20 datasets. After that, we merged them into a final data frame. Besides this, we implemented Elo Ranking System in NBA with three modifications based on the experience. We also added several interesting data visualization plots like radar charts, scatter plot with the help of the library Plotly. Our baseline model is the Linear Regression model. The result of our baseline model shows that we are confronted with overfitting problems. So we tried PCA to do the dimensional deduction. In order to improve the prediction result, we not only tried unsupervised learning methods like Linear Regression (after PCA), but also some supervised learning methods as well, for example, distance based models (like SVM, Lasso Regression) and tree based model (Random Forest). Our final Random Forest model was successful and worked great with the current 2018–2019 NBA data. We think the prediction accuracy is very high by referring to already existed research results and also the fact that there are too many uncertainty and accidental events in NBA regular seasons. Finally, we also put forward our future works to improve our models, add more crucial features and utilize our models in sports betting.
2. Data Acquisition and Preprocessing
Our second blog is “Data Scraping and Data Cleaning”, in which installation of beautiful soup, dataset resources, web scraping and data cleaning are included. And our third blog is “Elo Score”, in which the introduction to Elo Rating System, Implementation of Elo in NBA are included. We also combined updating Elo score dataset with all other dataset. Finally, we merged everything into a final data frame.
The size of final data frame consists of 6567 rows and 137 columns. Considering that we assign the same Elo score for each team at the beginning of 2013–2014 NBA regular season, the first 600 rows can not reflect the ability difference among teams very well. Thus, we drop first 600 rows of the data frame. For baseline model and advanced models, we split the final in-sample dataset into training dataset (70%) and testing dataset (30%).
Our fourth blog is “Data Visualization”, it includes radar charts for team statistics and also the visualization of the Elo score. In order to have a better visualization, we use scatter plot in Plotly to show the fluctuation of the Elo score. For the Northwest Division, below is the scatter plot for the fluctuation of Elo score. The exact Elo score for each team in any game can be found in the figure. In this report, we can only provide with a screenshot of our plot.
We can also find out the fluctuation of Elo score individually by clicking the label of the team twice. The figure below shows the individual fluctuation of Elo score for Minnesota Timberwolves.
Our fifth blog is “Machine Learning”, in which we tried baseline model and 4 advanced models. We use the out-of-sample dataset to make the prediction. The prediction error figures of each advanced models are also provided.
The methodology of our project can be summarized as the below flow chart:
The following figures show the prediction error of Random Forest, SVM, Linear Regression after PCA, Lasso Regression.
Numerical Evaluation for our 4 advanced models is shown in the table below:
Our final Random Forest model was successful and worked great with the current 2018–2019 NBA data, but there are numerous improvements that could made.
6. Future Works and Challenges
6.1 Accidental Factors
There exist accidental events in NBA games, for example, injuries, player suspensions, etc. For some crucial basketball player like LeBron James, Stephen Curry, these accidental events have much influence on the prediction. So, we are trying to working with real-time data and take into account current events such as these accidental factors.
6.2 Sports Betting
Due to the popularity of NBA, many people and companies are trying to predict the outcome and point spread of games. Many NBA betting problems are about predicting the interval of point difference. So we plan to use the point-spread prediction, and create a probability interval that can be utilized in sports betting.
In each NBA regular season, there are only 82 games for a team. It will limit our prediction accuracy. What is more, NBA games are also changing. In recent five years, the Golden State Warriors seems to lead the popularity of shooting three pointers. In this new season, many games have very large point differences, which are very difficult for prediction. Thus, these remain big challenges for us.