First Experience with Machine Learning: An Unsupervised Approach — Utilizing K-Means to Classify Mushrooms
First and foremost, let us explain our actions and the rationale behind them. This article encapsulates an account of our inaugural venture into the realm of machine learning, a collaborative project undertaken as an integral part of our college studies. In this pursuit, our objective was to craft an application proficient in discerning between edible and poisonous mushrooms. To realize this goal, we harnessed the Mushroom Classification dataset and employed the unsupervised K-means method.
For the execution of this project, we opted for the Google Colab environment, utilizing the Python programming language. Additionally, we leveraged the AI development platform WandB and incorporated several libraries such as Pandas, NumPy, Matplotlib for graph creation, Pytest, Pickle, TensorFlow, and various functions from the Scikit-learn library.
Why mushrooms?
For our inaugural experience, we sought a subject that was both intriguing and aligned with our shared interests. During our research into potential biological datasets, the mushroom dataset captivated our attention. The prospect of exploring the world of mushrooms and implementing our own machine learning algorithm to derive insights from this newfound knowledge ignited a sense of excitement and curiosity.
What is kmeans?
K-Means is an unsupervised machine learning algorithm that clusters data into distinct sets based on their characteristics and similarities. Being an unsupervised algorithm means we don’t initially know the number or composition of these clusters. Instead, we employ analytical techniques and the K-Means algorithm itself to discover and define these clusters.
This project uses kmeans in an unusual way. Unlike traditional applications, our dataset already contains the information about the groups that K-Means would typically aim to discover. The objective here is to utilize K-Means to cluster the data without prior knowledge of the desired labels and subsequently analyze the results.
The functioning of K-Means is as follows:
- Initialization: The algorithm starts by randomly selecting K centroids, where the number of centroids is defined by the user.
- Point assignment: Each data point is assigned to the nearest centroid.
- Centroid update: The centroids of each cluster are recalculated as the average of the points assigned to that cluster.
- Repetition: Step 2 and 3 are repeated until we reach the ideal position of the centroid.
- result: At this point, the grouping process is complete, and each data point is assigned to a specific cluster.
Before the kmeans process, let’s examine the preprocessing
In the preprocessing phase, it is essential to transform the dataset into a format suitable for use in K-means. This involves taking precautions to ensure the integrity of values and performing normalizations on the raw data. Although two common issues, duplicated rows and empty values (NaN), are not present in the dataset, we still need to address the normalization of values by converting categorical values into numeric ones.
To achieve this, the following code snippet employs a simple function called Encoder:
This code efficiently resolves the categorical values in our dataset, concluding the preprocessing step and preparing the data for the subsequent K-means process.
Validating Preprocessing Data for Our Project
To ensure the success of creating the K-means model, it is indispensable to verify that the preprocessing has been executed accurately. Performing tests on the dataset resulting from the preprocessing phase is crucial. For instance, columns containing values in string format can potentially prevent the K-means , as the algorithm is designed to operate with quantitative data. Additionally, the presence of outliers — values that deviate significantly from the dataset’s overall distribution — can distort K-means results.
With these considerations in mind, we will delve into the data validation phase. During this stage, we will conduct various tests to confirm that the data within the columns has been appropriately handled during preprocessing:
Segregating data in training and testing
After all data normalization we need to separate our pre-processing database into two other databases, the training data to create our model and the test data to validate the same, thus ensuring its robustness and generalization capabilities.
e accurate predictions oIn this stage, we establish a reliable way to assess the model’s ability to makn new, real-world data. This essential step in the machine learning pipeline not only validates the model’s performance but also enhances its reliability and applicability in diverse scenarios.
This is the segregation data code:
This section of code splits the database into 80% training data and 20% test data using a stratified random division with a single stratify variable to ensure that the training and test sets contain the same proportion of values from each target label, preventing bias in the model, which is ideal for k-means clustering. To ensure reproducibility, we also create a seed variable that controls the randomness of the split.
The splits variable is a dictionary that stores the training and test datasets that the train_test_split function from scikit-learn returns. The only thing we need to do now is save our segregated data somewhere, for which we use the weight and biases platform, an AI developer platform.
Searching for the Ideal Number of Clusters in the Creation of Our K-means Algorithm
To determine the optimal number of clusters, we employ various metrics to approach the best value. These metrics include:
- Elbow Method: This metric gauges the inertia, or in other words, the sum of squared distances from each data point in a cluster to the centroid of that cluster. The smaller this metric, the better.
Silhouette Score: The silhouette score measures how well a given data point is classified within its own cluster in comparison to other clusters. It is derived by taking the average of all points and assigning a score ranging from -1 to 1. A high score indicates that the object is very similar to its own cluster and dissimilar from other clusters.
Silhouette Coefficient: In addition to evaluating the average Silhouette score, this metric visually incorporates the proportion of values on the graph along with the silhouette of these values. It provides a nuanced perspective on the quality of cluster assignments.
Cluster Analyses: Utilizing a graph, we assess the proportion of values within each cluster, employing a 2D simplification.
Result of the search about the ideal value of clusters
Following the elbow method and the average silhouette score, the optimal choice for the number of clusters in our K-means model is 3. Close behind are 6 and 2 clusters, which have high silhouette scores. Although the 6 clusters have an even higher silhouette score than the 2 cluesters, it is less favored in the elbow method, making it less advisable to choose.
Additionally, the graphs depicting the silhouette coefficient and clusters reaffirm favorable outcomes for 2 and 3 clusters. Notably, the 2-cluster configuration outperforms the 3-cluster arrangement due to a more uniform distribution, especially evident when scrutinizing the numbers between the silhouette labels.
It is crucial to emphasize that each of these methods serves as a heuristic approach to approximate the ideal value and does not offer universally applicable solutions. In our specific case, a definitive answer is discernible within the original dataset. By examining the data, we observe that our target label comprises 2 values (edible and poisonous). Consequently, we already possess a predefined choice for K-means.
Training the Model
After determining the optimal number of clusters for our algorithm, we proceeded with training the K-means model using two clusters. The goal is to empower the model to categorize data into two distinct groups: edible and poisonous mushrooms. As mentioned earlier, K-means performs this grouping based on characteristic values. For instance, the characteristics of a mushroom include its height, shape, etc. Therefore, we leverage the data present in each record to create categories. K-means calculates the distance between these values, with this distance being the determining criterion to classify a mushroom as either edible or poisonous.
It’s important to note that, before initiating the model training, we removed the ‘class’ column, which contained the categories of edible and poisonous. This removal is crucial to prevent information leakage during training, ensuring a more accurate evaluation of the model’s performance:
After determining the optimal number of clusters for our algorithm and removing the target label the model is now ready for training. We achieved satisfactory results by running K-means as follows:
Testing and Validation of the Model
After training the model, it is ready to analyze a new dataset containing information about mushrooms. The objective is to group this information, classifying each mushroom as either poisonous or edible.
It’s important to recall that during the data segregation phase, we divided our dataset into two distinct parts. One part was reserved for training the model, while the other was designated for testing. Therefore, when running the model with a new dataset, different from the one used in training, we ensure that the model can correctly group new instances not seen before.
In this way, we achieved an accuracy of approximately 84%. Although K-means is not a perfect algorithm, displaying some limitations in precisely recognizing values, we can assert that our model has an accuracy of 84% in determining whether a mushroom is edible or poisonous.
Next, we present two comparative graphs. The left graph is generated from the data from the test set, displaying the total number of edible and poisonous mushrooms. Meanwhile, the graph on the right is constructed based on the results obtained by our model.
Final considerations of our experience
This project was part of our study on machine learning, in which we utilized the K-means algorithm to classify parameters as edible or poisonous. Every step, from pre-processing to model training, was crucial. The choice of the mushroom dataset not only piqued our interest but also unveiled the intricacies of K-means in handling datasets, requiring only adaptation to a format understood by K-means. The result of this revealed a wide range of areas in which K-means can be employed to group data and recognize it as a specific type. Pre-processing the data, careful validation, selecting the optimal number of clusters, and training the model were vital steps that led us to an accuracy of around 84% in the algorithm results.
In summary, this project not only delved into K-means but also demonstrated its applicability in a biological context. This journey has significantly contributed to our understanding of machine learning and marks just the beginning in this fascinating world.
Credits
My linkedin account: linkedin.com/in/valmir-francisco-581222288
My github account: github.com/valfra0425
Special thanks for my co-worker: linkedin.com/in/claudio-henrique-8047a7266
His github account: github.com/ClauHenrique
Files
Notebooks: https://github.com/valfra0425/project-mushrooms