Finding similarity to UConn in College Basketball teams to find the next NCAA champion

Calvin Chu
INST414: Data Science Techniques
3 min readDec 8, 2023

--

Insight:

For my insight, I want to calculate the similarity of College Basketball teams based on three query: Their efficiency and winning percentage against D1 team, turnover and offensive rebound, and field goal, 2 point, 3 point percentage with comparison to UConn. I decide to find similar team to the championship team of 2023, UConn to help fans to build the bracket for 2024 NCAA basketball championship. This insight would be useful for College basketball fan to increase their chances of winning bets on any offensive scoring from their bracket.

Data Collection

The data I collected was from Kaggle with 363 rows and 22 columns of each college Basketball team in 2023. The features I would use for similarity compare to Uconn mainly are offensive stats such as the offensive/defensive efficiency per game and winning chances percentage against D1 team as query one , turnover and offensive rebound as query two, and field goal percentage, 2 point percentage and 3 point percentage as query three. For my similarity metric, I decided to use Euclidean distance for query three and Cosine Similarity for query one and two.

Queries lists (finding)

Query one( offensive/defensive efficiency per game and winning chances percentage against D1 team as query one)

Query two(turnover and offensive rebound)

Query three( field goal percentage, 2 point percentage and 3 point percentage)

Data cleaning/software

For my software, I used sklearn.metric.pairwise to import both cosine similarity and euclidean distance for calculating the similarity for the queries. This software help with creating the similarity among the three query I build for this insight. For data cleaning, I filtered the data to fill any Null values throughout the data frame with 0 to avoid any errors in calculating similarity. Then I subfilter the columns and remove the Conference league, WAB( chances of participating in the march madness), Postseason stat, seed because these columns are irrelevant to calculating offensive similarity. Finally, I set the College team as the index to finish the matrix and ready for calculation.

df = pd.read_csv('cbb23.csv')
df = df.set_index('TEAM')
df = df.drop(columns=['CONF','WAB','POSTSEASON','SEED'])
df.fillna(0)

Limitation

For this data, I would say what could be missing in this data that would help with calculating similarity would be having shooting percentage and overall winning percentage for the season per team. This would help to understand and see teams with highest winning percentages similar to Uconn to better choice which teams to pick for overall 2024 bracket. Another missing data that would be helpful is to have free throw stats or how successful on scoring free throws in a game. This data including the data in query three would further the search on finding team with better scoring ability like UConn. This could lead to bias because some team can have higher free throws than other which could change the ranking on the queries similarity for more accurate and precise calculation. A limitation in this data could be the season data as this data is from most recent 2023 and it is focused on one season of college basketball. Most teams could change next season and may change their strategies that could affect their future stats/chance of championship

link to Github

https://github.com/Calvinchu20/INST414Medium.git

--

--