Analytics Vidhya
Published in

Analytics Vidhya

CLASSIFICATION METHOD FOR ESTIMATING THE NUMBERS OF RINGS OF ABALONE

The iridescent surface inside a red abalone shell from Northern California (the adjacent coin is 25 mm in diameter) (Source: https://en.wikipedia.org/wiki/Abalone)

Introduction

Abalone is a common name for any of a group of small to very large sea snails, marine gastropod molluscs in the family Haliotidae.

Other common names are ear shells, sea ears, and, rarely, muttonfish or muttonshells in parts of Australia, ormer in the UK, perlemoen in South Africa, and the Maori name for three species in New Zealand is pāua.

Abalone are marine snails. Their taxonomy puts them in the family Haliotidae, which contains only one genus, Haliotis, which once contained six subgenera. These subgenera have become alternate representations of Haliotis.The number of species recognized worldwide ranges between 30 and 130 with over 230 species-level taxa described. The most comprehensive treatment of the family considers 56 species valid, with 18 additional subspecies.

The shells of abalones have a low, open spiral structure, and are characterized by several open respiratory pores in a row near the shell’s outer edge. The thick inner layer of the shell is composed of nacre (mother-of-pearl), which in many species is highly iridescent, giving rise to a range of strong, changeable colors, which make the shells attractive to humans as decorative objects, jewelry, and as a source of colorful mother-of-pearl.

The flesh of abalones is widely considered to be a desirable food, and is consumed raw or cooked by a variety of cultures.

In this paper, we will try to predict the rings of abalone using a classification method.

Methodology

The dataset is retrieved from http://archive.ics.uci.edu/ml/datasets/Abalone. Then we check the data to make sure there are no missing values in the data.

Then we preprocessing the data and then we do the classification process with several classifiers.

The classifiers we will use are logistic regression, random forest, and SVM.

Import Important Libraries

Before we go through all the processes, first we import all the libraries that we will need.

Getting the Dataset

The dataset is retrieved from http://archive.ics.uci.edu/ml/datasets/Abalone. Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings is the value to predict.

b

Now we get the dataset

Now take a look at the data conditions

From the information above, all features are continuous variables except for the Sex feature. Then in the Height feature, the minimum value is zero. This possibility calls for a missing value in the data and we will process the missing value.

Next, take a look at the target in this case in the Rings column

We can see that the target is 1 to 29 (but there is no 28), so the classification we are going to do is a multi-class classification

Data Preprocessing

Dealing with missing values

We first check how many missing values are in the Height feature and which class is it in.

The number of missing values is 2 and is in the infant sex. Then we change the value 0 to null. We will fill in the missing value with the average Height feature for the infant gender

So we will fill in the missing value with 0.107996

Encoding categorical features

As we have seen, the Sex feature is a categorical feature so we need to encode that feature. We’ll do one-hot-encoding for this.

After we do the encoding, the column increases to 11.

Splitting data

We will separate the data into 3 parts, namely train set, the test set and validation set. We do this because it is impossible for us to cross validate the data because there are several targets that only have the amount of 1.

To speed up learning on classification, we first standardize the data

Data standardization

To speed up learning on classification, we first standardize the data

Classification

We will classify the dataset using four classifiers, namely logistic regression, random forest, and SVM.

We will also determine the best parameters for each classifier. For determining the best parameter, we do not use cross validation because there are several targets that have a total of 1. To determine the best parameter for each classifier, we use the simple grid search method.

Logistic regression

The parameters that we will set for classification using logistic regression are C and solver. Since this is a multiclass classification, we determine the solver between newton-cg, sag, saga, lbfgs. Multiclass we set multinomial and penalty we set l2.

With the best parameters, we get a score for the training set of 0.29 and a testing set score of 0.25

Random forest

The parameters that we will set for classification using random forest are criterion, max_depth, dan max_features.

With the best parameters, we get a score for the training set of 0.29 and a testing set score of 0.27

Support Vector Machine

The parameters that we will set for classification using SVM are kernel, C, and gamma.

With the best parameters, we get a score for the training set of 0.31 and a test set score of 0.26

We can summarize the score for each classifier as follows

We can observe that every model’s accuracy are below 0.3, which is relatively low and hard forforecasting. This might due to the large number of levels and the highly imbalance in our target.

Summary

In this report, three classification algorithms were used to predict the target feature. For each method, simple grid search method were used. We use simple grid search in hyperparameter tuning and get the best performance of each model based on their accuracy. In the end, we present and discuss the results in forms of training set score and testing set score. Based on the above analysis, Random Forest model have the best accuracy among all classification models. However, it’s not significant higher than other models. Further analysis is needed due to the limitation.

References

1. https://en.wikipedia.org/wiki/Abalone
2. Han, Yizhen. 2019. Machine Learning Project — Predict The Age of Abalone. RMIT University
3. Andreas C. Müller and Sarah Guido.2017.Introduction to Machine Learning with Python

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store