College Basketball Games Clusters

Pengtong Yang
INST414: Data Science Techniques
3 min readMay 12, 2022

After few searches of dataset attempts for clusters. I found a interesting dataset about basketball. The dataset is data collected from college basketball games. My insight for this dataset is to find the numbers of games won by the number of games played in three clusters or three groups.

To classify the dataset by similarity, I separated the games won (W) variable into three clusters.

Libraries used are SkLearn, pandas, and matplotlib.

I used plt.scatter to create the simple scatterplot. I used KMeans and k-vlalues to create numbers of clusters. I also used cluster_centers to create the centroid for each cluster.

I started with a basic scatterplot of Numbers of Games played by Number of Games Won. As shown in the scatterplot below the data points are spread out to right side of the graph.

The number of k-value are equal to 3, which means scatterplot is separated into three clusters, with centroid which is the center of each cluster. Clusters 1 is represented in blue color, Cluster 2 is represented by red color, and the cluster 3 is represented by purple color. The centroid of all clusters is represented by black stars.

Bugs Encounters:

1. Spellings, as first time dealing with clusters, there are many misspelled in the coding. For instance, KMeans if often being misspelled as Kmeans or Kmean. Clusters are being misspelled as custer.

2. If the k-value in the cluster is assigned to larger number such as 5, the clusters in the scatterplot will be more spread out. If the number is smaller than 3, only the corresponds cluster with number assign will be shown on the graph. To solve the problem, the number of k-value should be accurate and needed double check.

Limitations:

The dataset had 2455 observations and 25 variables, this dataset might seem like a very good dataset with sufficient among of observations and variables. As further examine deeper into the details of the dataset the data provided are very similar, which limited the number of clusters to be used. Also, some of variables aren’t in numbers such as TEAM: which represents college basketball school teams in letter, CONF: which represents conference area with half in letter and the other half in numbers, POSTSEASON: The number of rounds that the team ended with which is also combination of letter and numbers. It is very difficult to use all the variable to create clusters with a deep data cleaning process. Finding a clean dataset is also very important in this case.

Takeaways:

Spellings are extremely important for more advanced coding such as clusters. A minor mistake could waste hours of time finding the spelling error that can be avoid more extra caution. Clusters are very used in many ways, especially with the color and centroid included. Unlike the tradition scattered plot, the graph with clusters is becoming easily to read and understand at first glance. A more advance coding project such as clusters required more time on planning and organizing the data. Planning phase can be drawn a simple graph on dividing the coding into main steps and minor step with each main step, this will make the whole process of project much smoother.

--

--