## Regression using Principal Components & ElasticNet | Towards AI

# Prediction of Relative Locations of CT Slices in CT Images

## Predicting the relative location of CT slices on the axial axis of the human body using regression techniques on very high-dimensional data

Regression is one of the most fundamental techniques in Machine Learning. In simple terms, it means, ‘predicting a continuous variable by other independent categorical/continuous variables’. Challenge comes, when we have high-dimensionality i.e. too many independent variables. In this article, we will discuss a technique of regression modeling with high-dimensional data using Principal Components and ElasticNet. We will also see how to save that model for future use.

## Getting Data & Problem Definition

We will use Python 3.x as the programming language and ‘sci-kit learn’, ‘seaborn’ as libraries for this article.

Data used here can be found at the UCI Machine Learning Repository. Dataset name is “Relative location of CT slices on axial axis Data Set”. This one contains extracted features of medical CT scan images for various patients (male & female). Features are numerical in nature. As per UCI, the goal is ‘predicting the relative location of a CT slice on the axial axis of the human body’.

Let’s explore the dataset to understand it more clearly

From the above dataset, we can see that variables named as ‘value0’, ‘value1’,.. ‘value383’ contain feature values of CT scan images for each patient. The last variable is ‘reference’. This ‘reference’ is our target variable and it contains the relative location of the CT slice.

So, our problem is predicting ‘reference’ by other features. There is a total of 384 features (independent variables like ‘value0’, ‘value1’.. etc) and the total dataset size is 53500. As our target variable ‘reference’ is continuous in nature, this is a regression problem.

## Analyzing Data

As there are too many features, our task becomes more complicated. Choosing the right set of those is very important over here. Are all of these features equally important? Or are these co-related with each other? We will try to find these answers.

First, we need to do dropping off unnecessary variable **‘patientId**’, separating features and target variables.

`df = df.drop(['patientId'], axis=1)`

df_y = df['reference']

df_x = df.drop(['reference'], axis=1)

Therefore, the data frame ‘df_y’ is our target variable and the data frame ‘df_x’ contains all features.

Principal Component Analysis (PCA) can reveal a lot of details and reduce no of features. Let’s see how can we do that.

We have done PCA which can hold up to 95% of the variance in the data. It came out that there can be a total of 212 Principal Components (PC) responsible for 95% variance.

‘pca_vectors’ look like this:

We can treat these PCs as features. So, by this, we are able to reduce dimensions from 384 to 212 (almost 44.7 % reduction in dimensions).

We can reduce more, but may have to compromise with the accuracy of the solution. For that reason, we will stop here and use these PCs as our features. Please remember that PCs are virtual features and these don’t exist physically i.e. it is not possible to draw any physical resemblance with data. Also, each of these PCs is not co-related with each other and hence the multi-co-linearity problem is gone.

We will now see, how these PCs are explaining our target variable ‘reference’. We will do a ‘*regression plot*’ using the first 3 (most important) and last 3 (least important) PCs. (as these are sorted in decreasing order by the percentage of variance explained)

Regression Plot

We can see that the first 3 PCs are quite dominant and influencing ‘reference’. Next is the last 3 PCs.

Regression Plot

As regression lines are almost parallel to the x-axis, we can say that the last 3 PCs are least important.

## Building the Machine Learning model

Now, it is time to do actual work i.e. building the model. We already have a feature set. We will use a regularised linear regression model. ElasticNet gives better accuracy when dealing with large no of features. It is a tradeoff between ‘Lasso’ & ‘Ridge’ regression. Its regularisation parameter is given by α* (alpha). *For α = 1 it becomes ‘Lasso’ and for α = 0, it becomes ‘Ridge’. We should set α in between 0 and 1 for better accuracy. α is a hyperparameter over here.

First, we should split the data into training and test set

We should run cross-validation for the optimal hyperparameter. We will keep α values in test range (0.1, 0.3, 0.5, 0.7, 1.0)

Let’s see the accuracy

`print('R2 value : ', r2_en)`

Almost 85% variance is explained by the model. This is quite good.

Let’ see the values of coefficients, α value and intercept

`print('Intercept: ', regr_en.intercept_) `

print('Alpha: ', regr_en.alpha_)

print('Coefficients: ' , regr_en.coef_)

## Analyzing Result

We got 85% accuracy as per linear regression methodology. Now we will see how did it get influenced by the most important and least important PCs like the previous analysis.

We will do a *‘regression plot’ *using the estimated *‘reference’* value by our model instead of the original value (unlike previous analysis).

Using the first 3 PCs

Regression Plot

Using the last 3 PCs

Regression Plot

We saw significant improvement in Figure 10 as compared to Figure 7. This means estimated values are doing good.

We can get residuals or errors of estimation for test data like below

`residuals = test_y - regr_en.predict(test_x)`

Now, we will see *‘residual plot’ *for test data

We can see that there is no pattern in ‘residual plot’ and it is completely random. As per the linear regression methodology, it indicates a good model.

## Building the pipeline for production-ready model

For usage in production, we need to build the model in a machine learning pipeline fashion.

We can predict any real-time data instance using this pipeline model

`pl_test_x[20:21]`

‘reference’ value in the original dataset

`pl_test_y[20:21].values`

Now, let’s predict the ‘reference’ value using our model and compare with original dataset value

`pl_model.predict(pl_test_x[20:21])`

There is a negligible deviation.

## Persisting the model for future use

We can persist in the model and load it on-demand basis for any further use

## Conclusion

We learned how to use Principal Components to describe features and dependencies, build a regression model and predict values. There are also other regression models. Readers of this article can try it. Jupyter notebook for this can found on Github.

## References

[1] Principal Component Analysis — https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c

[2] Elastic Net, Lasso & Ridge regression — https://medium.com/@yongddeng/regression-analysis-lasso-ridge-and-elastic-net-9e65dc61d6d3

Recently I authored a book on ML (https://twitter.com/bpbonline/status/1256146448346988546)