Regression is one of the most fundamental techniques in Machine Learning. In simple terms, it means, ‘predicting a continuous variable by other independent categorical/continuous variables’. Challenge comes, when we have high-dimensionality i.e. too many independent variables. In this article, we will discuss a technique of regression modeling with high-dimensional data using Principal Components and ElasticNet. We will also see how to save that model for future use.
Getting Data & Problem Definition
We will use Python 3.x as the programming language and ‘sci-kit learn’, ‘seaborn’ as libraries for this article.
Data used here can be found at the UCI Machine Learning Repository. Dataset name is “Relative location of CT slices on axial axis Data Set”. This one contains extracted features of medical CT scan images for various patients (male & female). Features are numerical in nature. As per UCI, the goal is ‘predicting the relative location of a CT slice on the axial axis of the human body’.
Let’s explore the dataset to understand it more clearly
From the above dataset, we can see that variables named as ‘value0’, ‘value1’,.. ‘value383’ contain feature values of CT scan images for each patient. The last variable is ‘reference’. This ‘reference’ is our target variable and it contains the relative location of the CT slice.
So, our problem is predicting ‘reference’ by other features. There is a total of 384 features (independent variables like ‘value0’, ‘value1’.. etc) and the total dataset size is 53500. As our target variable ‘reference’ is continuous in nature, this is a regression problem.
As there are too many features, our task becomes more complicated. Choosing the right set of those is very important over here. Are all of these features equally important? Or are these co-related with each other? We will try to find these answers.
First, we need to do dropping off unnecessary variable ‘patientId’, separating features and target variables.
df = df.drop(['patientId'], axis=1)
df_y = df['reference']
df_x = df.drop(['reference'], axis=1)
Therefore, the data frame ‘df_y’ is our target variable and the data frame ‘df_x’ contains all features.
Principal Component Analysis (PCA) can reveal a lot of details and reduce no of features. Let’s see how can we do that.
We have done PCA which can hold up to 95% of the variance in the data. It came out that there can be a total of 212 Principal Components (PC) responsible for 95% variance.
‘pca_vectors’ look like this:
We can treat these PCs as features. So, by this, we are able to reduce dimensions from 384 to 212 (almost 44.7 % reduction in dimensions).
We can reduce more, but may have to compromise with the accuracy of the solution. For that reason, we will stop here and use these PCs as our features. Please remember that PCs are virtual features and these don’t exist physically i.e. it is not possible to draw any physical resemblance with data. Also, each of these PCs is not co-related with each other and hence the multi-co-linearity problem is gone.
We will now see, how these PCs are explaining our target variable ‘reference’. We will do a ‘regression plot’ using the first 3 (most important) and last 3 (least important) PCs. (as these are sorted in decreasing order by the percentage of variance explained)
We can see that the first 3 PCs are quite dominant and influencing ‘reference’. Next is the last 3 PCs.
As regression lines are almost parallel to the x-axis, we can say that the last 3 PCs are least important.
Building the Machine Learning model
Now, it is time to do actual work i.e. building the model. We already have a feature set. We will use a regularised linear regression model. ElasticNet gives better accuracy when dealing with large no of features. It is a tradeoff between ‘Lasso’ & ‘Ridge’ regression. Its regularisation parameter is given by α (alpha). For α = 1 it becomes ‘Lasso’ and for α = 0, it becomes ‘Ridge’. We should set α in between 0 and 1 for better accuracy. α is a hyperparameter over here.
First, we should split the data into training and test set
We should run cross-validation for the optimal hyperparameter. We will keep α values in test range (0.1, 0.3, 0.5, 0.7, 1.0)
Let’s see the accuracy
print('R2 value : ', r2_en)
Almost 85% variance is explained by the model. This is quite good.
Let’ see the values of coefficients, α value and intercept
print('Intercept: ', regr_en.intercept_)
print('Alpha: ', regr_en.alpha_)
print('Coefficients: ' , regr_en.coef_)
We got 85% accuracy as per linear regression methodology. Now we will see how did it get influenced by the most important and least important PCs like the previous analysis.
We will do a ‘regression plot’ using the estimated ‘reference’ value by our model instead of the original value (unlike previous analysis).
Using the first 3 PCs
Using the last 3 PCs
We saw significant improvement in Figure 10 as compared to Figure 7. This means estimated values are doing good.
We can get residuals or errors of estimation for test data like below
residuals = test_y - regr_en.predict(test_x)
Now, we will see ‘residual plot’ for test data
We can see that there is no pattern in ‘residual plot’ and it is completely random. As per the linear regression methodology, it indicates a good model.
Building the pipeline for production-ready model
For usage in production, we need to build the model in a machine learning pipeline fashion.
We can predict any real-time data instance using this pipeline model
‘reference’ value in the original dataset
Now, let’s predict the ‘reference’ value using our model and compare with original dataset value
There is a negligible deviation.
Persisting the model for future use
We can persist in the model and load it on-demand basis for any further use
We learned how to use Principal Components to describe features and dependencies, build a regression model and predict values. There are also other regression models. Readers of this article can try it. Jupyter notebook for this can found on Github.
 Principal Component Analysis — https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c
 Elastic Net, Lasso & Ridge regression — https://medium.com/@yongddeng/regression-analysis-lasso-ridge-and-elastic-net-9e65dc61d6d3
Recently I authored a book on ML (https://twitter.com/bpbonline/status/1256146448346988546)