Regression using Principal Components & ElasticNet | Towards AI

Prediction of Relative Locations of CT Slices in CT Images

Predicting the relative location of CT slices on the axial axis of the human body using regression techniques on very high-dimensional data

Avishek Nag
Jun 9, 2019 · 6 min read
Image for post
Image for post

Regression is one of the most fundamental techniques in Machine Learning. In simple terms, it means, ‘predicting a continuous variable by other independent categorical/continuous variables’. Challenge comes, when we have high-dimensionality i.e. too many independent variables. In this article, we will discuss a technique of regression modeling with high-dimensional data using Principal Components and ElasticNet. We will also see how to save that model for future use.

We will use Python 3.x as the programming language and ‘sci-kit learn’, ‘seaborn’ as libraries for this article.

Data used here can be found at the UCI Machine Learning Repository. Dataset name is “Relative location of CT slices on axial axis Data Set”. This one contains extracted features of medical CT scan images for various patients (male & female). Features are numerical in nature. As per UCI, the goal is ‘predicting the relative location of a CT slice on the axial axis of the human body’.

Let’s explore the dataset to understand it more clearly

Image for post
Image for post
Figure 1
Image for post
Image for post
Figure 2

From the above dataset, we can see that variables named as ‘value0’, ‘value1’,.. ‘value383’ contain feature values of CT scan images for each patient. The last variable is ‘reference’. This ‘reference’ is our target variable and it contains the relative location of the CT slice.

So, our problem is predicting ‘reference’ by other features. There is a total of 384 features (independent variables like ‘value0’, ‘value1’.. etc) and the total dataset size is 53500. As our target variable ‘reference’ is continuous in nature, this is a regression problem.

As there are too many features, our task becomes more complicated. Choosing the right set of those is very important over here. Are all of these features equally important? Or are these co-related with each other? We will try to find these answers.

First, we need to do dropping off unnecessary variable ‘patientId’, separating features and target variables.

df = df.drop(['patientId'], axis=1)
df_y = df['reference']
df_x = df.drop(['reference'], axis=1)

Therefore, the data frame ‘df_y’ is our target variable and the data frame ‘df_x’ contains all features.

Principal Component Analysis (PCA) can reveal a lot of details and reduce no of features. Let’s see how can we do that.

Image for post
Image for post
Figure 3
Image for post
Image for post
Figure 4

We have done PCA which can hold up to 95% of the variance in the data. It came out that there can be a total of 212 Principal Components (PC) responsible for 95% variance.

‘pca_vectors’ look like this:

Image for post
Image for post
Figure 5

We can treat these PCs as features. So, by this, we are able to reduce dimensions from 384 to 212 (almost 44.7 % reduction in dimensions).

We can reduce more, but may have to compromise with the accuracy of the solution. For that reason, we will stop here and use these PCs as our features. Please remember that PCs are virtual features and these don’t exist physically i.e. it is not possible to draw any physical resemblance with data. Also, each of these PCs is not co-related with each other and hence the multi-co-linearity problem is gone.

We will now see, how these PCs are explaining our target variable ‘reference’. We will do a ‘regression plot’ using the first 3 (most important) and last 3 (least important) PCs. (as these are sorted in decreasing order by the percentage of variance explained)

Regression Plot

Image for post
Image for post
Figure 6

We can see that the first 3 PCs are quite dominant and influencing ‘reference’. Next is the last 3 PCs.

Regression Plot

Image for post
Image for post
Figure 7

As regression lines are almost parallel to the x-axis, we can say that the last 3 PCs are least important.

Now, it is time to do actual work i.e. building the model. We already have a feature set. We will use a regularised linear regression model. ElasticNet gives better accuracy when dealing with large no of features. It is a tradeoff between ‘Lasso’ & ‘Ridge’ regression. Its regularisation parameter is given by α (alpha). For α = 1 it becomes ‘Lasso’ and for α = 0, it becomes ‘Ridge’. We should set α in between 0 and 1 for better accuracy. α is a hyperparameter over here.

First, we should split the data into training and test set

We should run cross-validation for the optimal hyperparameter. We will keep α values in test range (0.1, 0.3, 0.5, 0.7, 1.0)

Let’s see the accuracy

print('R2 value : ', r2_en)
Image for post
Image for post

Almost 85% variance is explained by the model. This is quite good.

Let’ see the values of coefficients, α value and intercept

print('Intercept: ', regr_en.intercept_) 
print('Alpha: ', regr_en.alpha_)
print('Coefficients: ' , regr_en.coef_)
Image for post
Image for post
Figure 8

We got 85% accuracy as per linear regression methodology. Now we will see how did it get influenced by the most important and least important PCs like the previous analysis.

We will do a ‘regression plot’ using the estimated ‘reference’ value by our model instead of the original value (unlike previous analysis).

Using the first 3 PCs

Regression Plot

Image for post
Image for post
Figure 9

Using the last 3 PCs

Regression Plot

Image for post
Image for post
Figure 10

We saw significant improvement in Figure 10 as compared to Figure 7. This means estimated values are doing good.

We can get residuals or errors of estimation for test data like below

residuals = test_y - regr_en.predict(test_x)

Now, we will see ‘residual plot’ for test data

Image for post
Image for post
Figure 11

We can see that there is no pattern in ‘residual plot’ and it is completely random. As per the linear regression methodology, it indicates a good model.

For usage in production, we need to build the model in a machine learning pipeline fashion.

We can predict any real-time data instance using this pipeline model

pl_test_x[20:21]
Image for post
Image for post
Figure 12

‘reference’ value in the original dataset

pl_test_y[20:21].values
Image for post
Image for post
Figure 13

Now, let’s predict the ‘reference’ value using our model and compare with original dataset value

pl_model.predict(pl_test_x[20:21])
Image for post
Image for post
Figure 14

There is a negligible deviation.

We can persist in the model and load it on-demand basis for any further use

Image for post
Image for post
Figure 15

We learned how to use Principal Components to describe features and dependencies, build a regression model and predict values. There are also other regression models. Readers of this article can try it. Jupyter notebook for this can found on Github.

[1] Principal Component Analysis — https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c

[2] Elastic Net, Lasso & Ridge regression — https://medium.com/@yongddeng/regression-analysis-lasso-ridge-and-elastic-net-9e65dc61d6d3

Recently I authored a book on ML (https://twitter.com/bpbonline/status/1256146448346988546)

Towards AI

The Best of Tech, Science, and Engineering.

By Towards AI

Towards AI publishes the best of tech, science, and engineering. Subscribe to receive our updates right in your inbox. Interested in working with us? Please contact us → https://towardsai.net/contact Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Avishek Nag

Written by

Machine Learning practitioner & Author with work experience on Python, Spark-ML, Java & Big data

Towards AI

Towards AI is the world’s leading multidisciplinary science publication. Towards AI publishes the best of tech, science, and engineering. Read by thought-leaders and decision-makers around the world.

Avishek Nag

Written by

Machine Learning practitioner & Author with work experience on Python, Spark-ML, Java & Big data

Towards AI

Towards AI is the world’s leading multidisciplinary science publication. Towards AI publishes the best of tech, science, and engineering. Read by thought-leaders and decision-makers around the world.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium