Effect of Target Transformation on Abalone Age Prediction Using Regression Model
Dimas Buntarto
Introduction
Regression is one method to determine the causal relationship between variables with other variables. In simple regression analysis, the relationship between variables is linear, where changes in the variable X will be followed by changes in the variable permanently.
Regression is an analytical method commonly used to see the effect between two or many variables. Generally, regression analysis is used to make predictions or forecasts. Meanwhile, the relationship between these variables is functional which is realized in a mathematical model.
The variable whose value is predicted is called the target. In regression, the target has a continuous value.
In the real world, sometimes found targets that are not normally distributed. This sometimes affects the model’s performance on the data.
Before doing the analysis, sometimes the target is transformed first, so that the distribution is close to normal.
This paper will try to determine the effect of target transformation on the analysis using regression models. The data we will use is the abalone dataset.
Objective
- Knowing the effect of target transformation on the regression method used to predict the age of abalone.
Methods
1. Getting dataset from website http://archive.ics.uci.edu/ml/datasets/Abalone.
2. Checking the dataset.
3. Doing Exploratory Data Analysis to the dataset.
4. Doing feature enginering to the dataset.
5. Comparing regression models with and without target transformation.
The regression models that we will use are:
1. Linear Regression
2. K Nearest Neighbors
3. Decision Tree
4. Support Vector Machine
As for the target transformation, we will use the np.sqrt
About The Dataset
he dataset is retrieved from http://archive.ics.uci.edu/ml/datasets/Abalone. Given is the at-tribute name, attribute type, the measurement unit and a brief description.
Importing Libraries
Getting The Dataset
We get the dataset first. We get the abalone dataset through the site http://archive.ics.uci.edu/ml/datasets/Abalone
From the table above we can see that the data has 4,177 samples with 9 features.
Checking The Dataset
We first check the condition of the data which includes the types of features and data descriptions.
All features are float and integer except for the `Sex` feature. From the information above, we can find out that there are no missing values in the data.
We look at the description of the data to ensure that the data is informed.
Information on the data states that there are no missing values in the data. However, if you look at the data description, there is a value of 0 in the height measurement (the Height
column). This 0 value is probably a measurement error in the data so we have to deal with it first.
Handling measurement error
The data in the Height
column may have measurement errors. We first check the position and number of the 0 values.
There are 2 samples that are estimated to have measurement errors in the Height
column. The two empty values are in the infant gender (I).
Since there are only 2 samples, we can omit the two values. It’s just that in this paper we will replace it with the average value for the sex of infant (I).
We can see that there are no empty values in the dataset.
Splitting the Dataset
Before we separate the data into train data and test data, we will first determine the age of the abalone. The age of abalone is obtained by adding the number of rings by 1.5.
The age of abalone has been determined.
Next, we will separate the data into groups of train data and test data. We will determine the size of the test data as much as 30% of the dataset
To make it easier for the next step, we will separate the existing variables.
Exploratory Data Analysis
Sex feature
Let’s start analyzing data train. We start by looking at the gender condition (the Sex
column). Because the type of data on gender is categorical data, we will analyze it using a pie chart.
If we look, the sex distribution is quite balanced, which is about 30 percent for each gender.
We will still use this categorical feature in later modeling.
We will use one-hot encoding for further processing of this categorical feature.
Numerical features
Next, let’s look at the distribution of the numerical features.
From the graph above, we can see that the numerical features are not normally distributed, except for the Height
feature which looks slightly normally distributed. The Length
and Diameter
features have values that are distributed to the right of the curve, which means that their skewness is negative. The other features have a value distribution that is on the left of the curve, which means the skewness is positive.
Let’s see if there is an outlier in this numeric features
We can see that every numerical feature has an outlier. The number of outliers is the possibility that the data is not normally distributed.
We will not remove this outlier because we fear that we will lose information in the dataset.
For numerical features, in the next process we will scale continuous features with MinMaxScaler
Target distribution
Next, we will look at the target distribution (Age
column)
In the graph above, we can see that the target is not normally distributed and the possible value distribution is on the left of the curve, which means that the skewness is positive.
To make sure, we check if there is an outlier in the target.
The existence of outliers that cause the target is not maximally distributed. We also will not remove these outliers for fear that we will lose information on the dataset.
We will perform transformations on the target in the hope of improving the performance of the model.
Pearson correlation
We will see how the correlation between features, especially numerical features with target.
Each numerical feature has a linear correlation with the target, only in the range of 0.5 which means the linearity to the target is neither too strong nor weak (moderate).
Furthermore, we can see it in the following scatter graph
We can see that there is a linear relationship between the numerical features and the target, even if calculated by Pearson correlation is moderate.
Preprocessing and Feature Engineering
We first perform categorical feature engineering. We will use one-hot-encoding to encode the categorical features.
We can see that there is an increase in the column in the dataset to 11. Let’s look at the train set for the result.
Next, we separate features with the target.
Modelling
We will first scale the features with MinMaxScaler
. We will also use a pipeline
to simplify the process. We will also use GridSearchCV
to help estimate the best parameters, where cv equals 5.
For the process of using transformed targets, we will use the TransformedTargetRegressor
.
We will transform the target with np.sqrt
.
Linear regression
We start with a linear regression model. For linear regression, the parameter we estimate is fit_intercept
.
Linear regression without target transformation
Linear regression with target transformation
Nearest Neighbors Regression
The parameter we will estimate is n_neighbors
Nearest Neighbors regression without target transformation
Nearest Neighbors regression with target transformation
Decision Tree regression
The parameters we will set for the Decision Tree regression are max_depth
and max_features
Decision Tree without target transformation
Decision Tree with target transformation
Support Vector Regression
The parameters we will be setting are C
and gamma
.
Support Vector Regression without target transformation
Support Vector Regression with target transformation
Calculation of all the models above, we summarize in the following table.
Discussion
In linear regression, the target transformation is able to reduce the mean square error value. It is even able to improve the performance of the model. However, the target transformation in linear regression actually reduces the complexity of the model. This can be seen when using linear regression with a transformed target, the model actually experiences underfitting.
In KNN regression, the model has complexity to the data (overfitting) either without or with the target transformation. Transforming the target in this model actually reduces the model’s performance and increases the mean square error value. Although the target transformation is able to slightly cause the model to be more general in the use of the same model.
The effect of the target transformation is quite good shown in the Decision Tree regression. Target transformation is able to reduce the mean square error value. Although it reduces the complexity of the model, the target transformation is able to make this model more general to the data.
Target transformation has an unfavorable effect on the SVR model. The target transformation actually increases the mean square error value and also reduces the performance of the model. The target transformation in this model also increases the complexity of the model, so that the model becomes less general to the data.
Conclusion
From all the calculations above, it can be concluded:
- Target transformation has a different effect on each model.
- Of the four models used above, the model that can be used as a predictor is linear regression without model transformation. Because the model is the most generalized to the data, even though it is not the one with the smallest mean square error.
Further Analysis
In the future, it is necessary to analyze using other regression models as well as other target transformation methods, such as boxcox
, np.log1p
, np.log
and so on.
References
1. http://archive.ics.uci.edu/ml/datasets/Abalone
2. https://thomassoseco.blogspot.com/2019/02/skewness-dan-kurtosis.html
3. https://www.kaggle.com/ashokkumarpalivela/abalone-age-prediction-end-to-end-project
4. https://scikit-learn.org/stable/auto_examples/compose/plot_transformed_target.html
5. https://www.merdeka.com/jateng/regresi-adalah-metode-untuk-menentukan-sebab-akibat-kenali-jenis-dan-contohnya-kln.html?page=2