Abalone Sustainability and Profitability — Sex Classification and Weights Prediction from Physical Measurements via on R

Application of Multivariate Analysis and Classification on R. Forecasting profitability and enhancing sustainability through the weights and gender prediction by inspecting its physical size.

Hshan.T
The Startup
12 min readFeb 27, 2021

--

Abalone diver, Tasmania. Photo: Stuart Gibson

Intro

Abalone is a type of marine snail with high nutrition values and economic values, almost whole abalone, including viscera and shell can be processed and serving as sources of income for fishing industry. High market demand for abalones has led to overexploitation, raising public’s concerns on environmental issues. Governments enforce strict law and regulations on abalones harvesting to ensure sustainability of abalones. Analyzing data to find relationship between multiples variables in data collected to aid innovation ideas in designing equipment as solution to both the profitability and sustainability through instant result provided underwater.

Exploratory Analysis

Dataset adopted here is modified data from research on population biology of abalones (1994), retrieving from UCI Machine Learning Respiratory. There are 4177 records, 9 variables and no missing value. Information on variables is as follow:

Table 1: Variables information.

R code:

In the dataset, abalones are classified into three categories, F-female, M-male and I-Infant, such that “I” is prohibiting to be harvested to ensure sustainability of abalones population, while “F” and “M” bring has different economic values depending on the market trend. Descriptive statistics summary for numerical variables is presented below in Table 2 to give us the rough idea how the data look like.

R code:

Table 2: Statistics Summary for Numerical Variables.

“Height” of 0.0 is doubtful on reporting correctness. The ranges are varying a lot across data frame, minimum of 28.0 for “Rings” and maximum of 226.0 for “Whole weight”. The wide ranges are worth for attention as it may introduce bias to distance-based analysis result and modeling process. These two variables have the minimum and maximum standard deviation as well, implying spreading width of datapoints around their respective mean value. Difference between median and mean for all variables are within 6. It is noticeable that median is smaller than mean for “Length”, “Diameter” and “Height”. While for the rest, their mean is larger than median. This induces possibility of right skewness for the former and left skewness for the latter.

Figure 1: Pairplots of All Variables for Different Sex Groups.

Figure 1, there are outliers seen on boxplots on first row for numerical variables from second columns onwards. Apparently, density plots for categories “F” and “M” appear overlapping over a large area. “I” can be better differentiated from the remaining two categories observing some characteristics as we can see some part of green area covered are apart from another two groups. Plots on diagonal are giving an initial idea of it might be challenging to differentiating “M” and “F” with very limiting dissimilarities on appearance features. “Height” has rather symmetric curves with long tails towards right, mainly due to present of outliers as shown in boxplots. Other variables distributions are skewed, first three numeric variables are exhibiting negative skewness, last five variables are positively skewed. The variables are highly but not perfectly correlated with correlation coefficient of more than 0.5, except pair of “Shucked weight” and “Rings”. Positive correlation implies that two variables values will move in the same direction. When one increases, another one increases, and vice-versa. Generally, “I” group samples are majority having smaller values for all variables available.

Figure 2: Countplot of “Sex”.

For categorical variable, “Sex”, number of samples of each category do not differ much, with maximum difference of 221 between “F” and “M”. Evaluation of metrics will not induce severe bias in judging model performance. But its appropriateness is questionable in some scenario, for example, if transforming it into to a binary problem, such as “I” or “Not I”. Thoroughly consideration required on deciding evaluation metrics and the cost associated with misclassification.

Data dimensions are expected not to be a challenge in this analysis, no dimension reduction approach will be applied, avoiding losing any useful information. Results in this section assist a quicker and more efficient decision on transformation required on subsets of data applied in constructing resolution to relevant problem.

Discussion and Results

Harvesting an infant abalone may cause endangered crisis to the population. Abalone of different sex has different body composition with distinct economic values. Discussion below will be fundamental idea supporting logic behind the equipment development.

Part 1:

This part aims to build model predicting “Sex” of abalones by measurement of their physical features, “Length”, Diameter” and “Height”. “Sex” is high influential to sustainability and profitability. Obviously, this is a classification problem, and it will be addressed with different methods, including multiclass classification and one-vs-all which transform a multiclass problem into multiple binary class problem. Linear discriminant analysis (LDA), quadratic discriminant analysis (QDA) and support vector machines (SVM) are proposed as solution to either type of classification problem.

Data Transformation and Scaling

Preprocess procedure, scaling and outlier eliminating is crucial especially for SVM as its distance-based model, to ensure consistent range of values and reduce impact of abnormal datapoints for all variables involved. The separating boundaries determine suitability of type of discriminant analysis to be adopted.

R code:

Figure 3: Pairplots of Transformed and Scaled Data.

Note the reduced outliers and skewness on transformed data. “Height” remains unchanged, as the curves are symmetric with few outliers lying far towards the right tail, expected the skewness is due to outliers. The ‘box’ of boxplots (1st quartile to 3rd quartile) for infant group (Green) lying near the edge of another two groups, without overlapping, this can be possible boundary identifying “I”. “I” samples with abnormally large variables values might be confusing on their “Sex”.

Table 3: Statistics Summary for Transformed and Scaled Data.
Figure 4: Pairplots of Transformed, Scaled and Outliers Eliminated Data.
Figure 5: 3D Scatter Plot of Transformed Data with Different Group Combinations to Visualize Linearity of Boundaries (RMarkdown output should provide rotatable plots).

Decomposing 3D visualization of datapoints in Figure 5 show that “I” is separable from “F” and “M” to certain degree. “I” samples (Orange dots) appears mixing in samples of other groups at larger values, on non-diagonal plots. The first plot gives overview that “F” and “M” samples boundary linearity is doubtable. Hence, QDA might work better for distinguishing “F” and “M” than LDA.

1.1 Multiclass Classification

1.1 Linear Discriminant Analysis (LDA)

R code:

Table 4: Confusion Matrix for LDA Classifier Built with Original Untransformed Data.

Original set of data (Sex ~ Length + Height + Diameter) achieved an accuracy of 0.5188 with LDA classification. This result will be used as benchmark for accessing models in this sub-section.

Table 5: Confusion Matrix for LDA Classifier Built with Transformed Data.

It is observable that LDA with transformed data attained a slightly higher accuracy of 0.5250. This improvement in accuracy mainly due to more correctly classification and reduced misclassification for group “I”. There is still not much difference seen on classification performance for groups “F” and “M”.

There is a number of important assumptions to be met to validate application of model on the data:

· Multivariate normality

· Covariance equality

· Linearly separable Groups

R code:

Table 6: LDA Assumptions Testing.
Table 7: Skewness and Kurtosis of Transformed Data.

Some skewness and kurtosis are not close to zero, all the statistics tests results are significant. None of the assumptions is fulfilled, LDA is not a good model. Due to the non-linearity concluded, next model proposed is QDA without strict assumptions.

1.1.2 Quadratic Discriminant Analysis (QDA)

R code:

Table 8: Confusion Matrix for QDA Classifier Built with Transformed Data.

Significant increasing misclassification for “M” has deteriorated the overall accuracy to 0.519 which is lower than LDA. Nevertheless, both LDA and QDA accuracies level around 0.52 is less satisfying and they do not differ much.

1.1.3 Support Vector Machine (SVM)

This is a distance-based model, not parametric model as LDA and QDA. There is no assumption on variable distribution, but it emphasizes on measurement scales. SVM recorded cross-validated accuracy of 0.5291, which is currently the highest. The increment is remained insignificant. Challenge in classifiying “F” and “M” might be main contributor to poor performance as we see earlier, this can be further proven by breaking down to binary scenario in next section. To current extend, SVM is the best model for multiclass classification.

Binary classification task will encounter imbalance class problem. Since there is no assumption made on cost associated with misclassification of each group, assuming it is equally important. F1 score finds the balance between precision and recall, it is adopted as metrics.

R code:

Table 9: Performance Evaluation for Binary Classification (“I” or “Not I”).

1.2 Binary Classification for “I”

LDA assumptions are proven to be violated, hence we are not experimenting LDA on binary classification, “I” or “Not I”.

R code:

Table 9: Performance Evaluation for Binary Classification (“I” or “Not I”).

Logistic regression has the lowest accuracy but higher F1 score than QDA. SVM has the highest score for both accuracy and F1 score. Therefore, it is reasonable to justify that SVM is the model most capable in discriminating “I” out of all samples.

1.3 Binary Classification for “F”

R code:

Table 10: Performance Evaluation for Binary Classification (“F” or “Not F”).

Terrible performance on F1-score attaining maximum of 0.3 only. Among classifier experimented here, SVM has the worst F1 score, it is rejected and QDA with moderate accuracy and highest 0.33 is chosen. However, this is a poor result that is attainable by chance. It does not showcase predictive power of classifier.

1.4 Binary Classification for “M”

R code:

Table 11: Performance Evaluation for Binary Classification (“M” or “Not M”).

SVM has lower accuracy but higher F1-score than logistics regression. Emphasizing F1-score due to imbalanced group data size and there is only 0.0017 difference in accuracy which is very minimal. SVM is chosen as best estimator.

Result Summary:

Part 2:

This part focuses on profitability analysis based on its “Shucked weight” and “Viscera weight” which is regression problem. The company is seeking for solution to estimate abalone potential economic value from physical size “Length”, “Diameter” and “Height”. A model should be first built to predict “Shucked weight” and “Viscera weight”, and then forecasting economic value with application of linear combination of weights estimation and fluctuating prices.

Predictors: {Length, Diameter, Height}

Dependent Variables: {Shucked weight, Viscera weight}

2.1 Predicting “Shucked weight” and “Viscera weight” From “Length”, “Diameter” and “Height”

Data Transformation and Scaling

Statistics summary for variables involved are summarized in Table2. Although “Sex” is not taking into consideration in this part, pair plots of data without distinguishing groups are similar to Figure 1, including skewness, correlation and outliers, observable in Figure 6. There are non-linear relationships spotted between dependent variables and predictors. Some transformations will be required to fit a good linear model on data, transformed result is attached in Figure 7.

R code:

Figure 6: Pair plot of Original Data without Differentiating by “Sex” Class.
Figure 7: Pair plots with Transformed and Outlier Removed Data.

Multivariate Normality Tests

Although transformed data fails all both multivariate and univariate normality tests, but more symmetric more normal-like shaped curves for transformed predictors than original dataset are seen on Figure 7. Despite of failing normality hypothesis testing, multivariate normality assumption for predictors is accepted by inspecting their distribution plots on diagonal.

Multivariate Linear Regression Model Fitting

Considering linear model for each dependent variable separately, all predictors are significant for both dependent variables. Each univariate linear model is explaining around 88% of variation in dependent variable. MANOVA result which analyzes multivariate linear model gives statistically significant result, highlighting importance of all predictors to be included.

R code:

Residual Diagnostics on MLR

Figure 8: Decorrelated Residual Pairplot.
Figure 9: Residual Plots
Table 12: MLR Assumptions Testing.

Figure 9, red lines show small degree of curvature, there is a trend of how residuals are scattered. The MLR model residuals failed normality tests and has non-zero skewness and kurtosis, violating normality assumption on residuals. Nonetheless, residual density plots on Figure 8 show that the curves are rather symmetric with long tails towards both ends. Numerous outliers as marked in Figure 9.

R code:

Next model is built with those residual outliers removed to test their influential on model performance. Variation explained by each univariate linear model improves to 0.913 and 0.886 for “Shucked weight” and “Viscera weight” respectively. MANOVA gives significant result where all the estimates coefficient in linear model are non-zero. Residual diagnostics for this new model failed normality hypothesis tests as well, but with lower skewness and kurtosis achieved. Outlier is bringing negative impact to the previous model.

Figure 10: Residual Plots with Residuals Outliers for First Model Is Removed.

Although the model explains more variability in dependent variables, there is a higher degree curvature trend for residuals plots. Further removal of residual outliers has worsened the model, residual normality is still violated.

Consequently, due to constraints of limited extent of this analysis, combining experimental results above, outliers are the main reason of residual normality violation. The spread of symmetric curve with normal-like shape than untransformed predictors, around the peak is pretty similar on Figure 8. We accept residual assumption for multivariate linear model. The first multivariate linear model is adopted for forecasting weights, X. As stated earlier in data transformation section, we are assuming normal distributed transformed predictors. Therefore, given the transformed predictors, dependent variables are jointly normally distributed by applying property of multivariate normality.

Assuming sample size is large enough, sample mean vector and sample covariance matrix are good estimates of parameters.

2.2 Profitability Index, S

Adopting joint probability model for dependent variables, X, in previous section, S is modelled as univariate normal distribution by applying property of multivariate normal.

R code:

Conclusion

When the multiclass problem is transformed to several one-versus-all binary classifications, there is obvious difference between sample size for each category. It induces great amount of bias in modeling objective function. In this case, environmental sustainability should be emphasized for the cases to distinguish “F” and “I” out of the samples. It is suggested to include higher cost on misclassification for cases related to sustainability, as destroyed environment cannot be restored. Performance of classifiers built in this report are not too satisfying. Additional features including appearances characteristics, may be included to get a clearer boundary between “F” and “M”.

As for profitability analysis, the joint probability model for “Shucked weight” and “Viscera weight” follows normal distribution by assuming transformed predictors are normally distributed, despite of violation of a series of normality tests. As sample size is large, a number of outliers that cause slight departure from normality at the tails might distort the distribution and fail the tests. Further analysis on outliers is desired for better modeling or Copula method can be considered.

Discussions above are for learning and exploring purpose. A more proper workflow shall be planned and conducted to build and assess a valid model. Overall process may be modified, such as splitting dataset up into training, validating and testing sets or applying cross validation for estimating more accurate and convincing performance.

--

--