Compressing A Cascaded Regression Model: How to Sparsify the Matrix

Project done by Jinbae, a machine learning intern at BinaryVR

Before We Get Into

From Where Did This Project Start?

First, we want to explain basic notions we will be dealing with in this article and why we conducted this project. If you already know about CRM, LRM, and spare approximation, we may skip this introduction.

CRM, Cascaded Regression Model, is a modified/developed model based on the idea of a Linear Regression Model, a.k.a. LRM. LRM is a statistic/machine learning methodology in finding a hyperplane R that fits well to the given data that is indicated as coordinates. In the matrix equation, this means modeling R from equation Y = X*R+c.

Image for post
Reference: Linear Regression Wiki Page

The idea of build CRM is very interesting. CRM splits its regression process into several steps and solves each step by mixing linear layers and nonlinear layers. This is why people often call this method as ensemble learning.

Thanks to this unique modification model, we have started to solve non-linear tasks by applying linear approaches. For example, a very well-known use case of CRM is object detection and is frequently used in face recognition, pose detection, etc.

Image for post
Reference: Cascaded Classifiers Wiki Page, Ensemble Learning Wiki Page

Our project is rooted in a very distinctive feature of CRM — that CRM’s matrix multiplication can be massively compressed through sparse approximation process. After training, a subset of coefficients of the regression matrix is considered insignificant, meaning that they contribute very small and can be compressed without hurting the overall performance. Also, passing several steps help to dilute the performance drop impact so that the model can be compressed further.

The question is, how much we can compress the CRM model with minimal performance drop. How much can we sparsify a model matrix without hurting the overall CRM’s performance quality? And which compression method will result in a minimal loss of performance quality? These questions are the starting point of planning this project.

How to Sparsify the Matrix in CRM

On this R&D project, we will be experimenting sparse approximation in CRM. This article deals with performance comparison depending on various regularization methods and different sparsity degree, another sparse approximation method called SVD, and post sparse approximation after the SVD process. We included a detailed practical experiment process, the reasoning behind the experiment, and the conclusion and insights. At the end of the article, we will discuss further some thinking points and potential ideas for future experiments.

[1] Project description

A goal of the project:
Building a sparse Cascaded Regression model through sparse approximation

Theoretical approaches to attempt:
1) Sparse approximation with different regularization methods
- Loss value comparison depending on the regularization methods and sparsity degree
2) Sparse approximation after matrix factorization
- Singular Value Decomposition and sparsifying affiliated components

[2] What we will carry here

1. Which R Model Will Perform Better After the Compression?
- Model building with different regularization methods
a) How to sparsify the matrix
b) Different regularization methods
c) Charts: loss value comparison in R for different regularization methods

2. Another Approach to Overcome the Loss/Sparsity Threshold
- Truncated Singular Value Decomposition and Sparsifying Factorized Components
a) Limitation of the previous approach
b) Singular Value Decomposition and Truncated SVD
c) How to sparsify the truncated SVD model
d) Charts: loss value comparison in truncated SVD for the number of singular values

3. So, Who’s the Winner?
- The conclusion of the experiment
a) Sparse approximation comparison
b) Future direction

Which R Model Will Perform Better After the Compression?

Model Building with Different Regularization Methods

How to Sparsify the Matrix

To put it simply, we want to build a model matrix R in the equation ‘X*R=Y.’ Usually, if you build a model R with traditional machine learning/deep learning methods, the size of R can get enormous that it cannot run in certain conditions you want, for instance, in a mobile phone. That is why we need to compress the model. As the model is expressed in a matrix form, the compression process often means matrix sparsifying process.

In calculating the equation ‘X*R=Y,’ the more the R has zero value components, the less time the calculation takes. This matrix with numerous zero value components is called a sparse matrix while reengineering R into sparse R is called matrix sparsifying process. By the nature of the compression process, unfortunately, matrix sparsifying process results in performance drop depending on the sparsity degree so you will need to find the compromising sparsity degree. We call this sparse approximation.

So how can we engineer R sparse? How can we make R to have numerous zero components? The simplest way to come up with is this: Just change component values from smallest into 0. Then, retrain R to compensate for the performance drop. Although the exact same performance quality is not guaranteed, this helps to recover a certain amount of it. We will set a threshold of dropped performance quality and achieve the maximum percentage of component values of 0 in matrix R that qualifies the threshold.

Different Regularization Methods

We set diverse regularization methods as independent variables and compared the depending performance qualities. Regularization is the process of adding information in order to prevent overfitting. All the regularization methods we experimented are listed below.

1. Ridge (L2)
2. Variable Selection: Lasso (L1), Exponential Lasso (EL)
3. Group Selection: Group Lasso (GL)
4. Bi-level Selection: Group Exponential Lasso (GEL)
5. Mixed method: ElasticNet (Elas: L1 + L2), Sparse Group Lasso (SGL: L1 + GL)

First, we compared the loss values from the regularization methods for the initial training of R. Second, we compared loss values after sparsifying and retraining each R without regularization factor. These two comparison experiments are designed to figure out which regularization method is practically effective. Our third and final step is applying the best regularization methods through the previous two attempts into the initial R build and retraining that R with different regularization methods. The last experiment is designed to verify if the best method in the initial training differs from that in the retraining process.

Image for post

Charts: Loss Values in R with Different Regularization Methods

Image for post
[Chart 1]

This is a performance comparison chart in the initial regularization training. From this experiment, Sparse Group Lasso performed the best, resulting in the smallest validation loss, however, the overall validation loss values of other methods were similar.

Image for post
[Chart 2]

This is a performance comparison chart after the sparsification and non-regularization training. The experiments were conducted in step0 and at 50% sparsity rate. Unlike the chart1, chart2 verifies that L1 is slightly better. But still, the difference is marginal that it does not carry meaningful insights.

Image for post
Image for post
[Chart 3]

Next, we retrained sparse R with different regularization methods. Chart 3 demonstrates that the regularization retraining Rs perform better than the non-regularization retraining R.

Summary

Throughout the experiment, we concluded that we can retain up to 50% or more sparsity within minor performance drop. The validation loss values did not show a significant difference across the different regularization methods. Nonetheless, an interesting point is that the orders of the methods depending on the validation loss value are similar in all three experiments. It indicates that the good methods in the initial feature selection often result in better performance even in the retraining. For example, the best method with the smallest validation loss value was Sparse Group Lasso over all three experiments.

The real problem is above all the discoveries: no matter which methods we applied, model R could not reach the goal sparsity, over 80%. These previous trials failed to come up with a single strong regularization method that would overcome the threshold of both the performance and the size.

Another Approach to Overcome the Threshold Loss

- Truncated SVD and Sparsifying Factorized Components

Factoring R Before Sparse Approximation

How can we sparsify R more without hurting overall performance? What if we see R not as a whole but into several components? Matrix factorization allows you to take a glance at different characteristics of the whole matrix. For example, 30 can be factorized into 2*3*5 and you will realize that 30 will have characteristics of 2, 3, and 5 at the same time. In succession of this idea, we decided to factorize R first and then sparsify each component of R.

Singular Value Decomposition and Truncated SVD

There are various matrix factorization schemes and here, we used Singular Value Decomposition method. SVD factorizes an arbitrary matrix A into U*S*V as described below.

Image for post
Reference: Singular Value Decomposition — From a book ‘Foundations of Data Science’ Chapter 3.4 (Avrim Blum, John Hopcroft, and Ravindran Kannan, 2018)

If you use SVD, each decomposed matrix of A bears a distinctive character. Let’s look at S, for example. S is a matrix with all zero components except the diagonal coordinates{1, …,s}, which we call singular values. Considering the multiplication among matrixes, U and V can be compressed by how many singular values S has. The low-rank approximation becomes possible, depending on the number of singular values of S. This compression method is called the truncated SVD. By applying the method, you can control the compression level with the number of singular values.

Image for post
Reference: Trace norm regularization and faster inference for embedded speech recognition RNNs (Markus Kliegl, Siddharth Goyal, Kexin Zhao, Kavya Srinet and Mohammad Shoeybi, 2017)

How to Sparsify Truncated SVD Model

First, we conducted a low-rank approximation based on the number of singular value. Then, to step further the truncated SVD compression, we sparsified U of the truncated SVD matrix and retrain it to achieve additional sparsity.

How to:
1. Train a regressor with ridge regularization
2. Factorize the R matrix into three matrices (USV)
3. Truncate the U and V
4. Retrain the U and V
5. Sparsify the U
6. Retrain U and V iteratively with zero-constraint

Charts: Loss Value Comparison in Truncated SVD Depending On Singular Values

Before sparsifying U, we compared the performance of native R with the performance of the retrained R’s depending on the number of singular values.

Image for post
Image for post
[Chart 1]

Take a look at where the validation loss is increasing rapidly, in other words, where the R’s performance is dropping rapidly. The retrained truncated SVD showed better results compared to the previous truncated SVD.

Image for post
[Chart 2]

Next, let’s compare the performance of the truncated SVD model and that of the retrained truncated SVD model. The graph below indicates that retrained truncated SVD model performs slightly better than the regular retrained SVD model. However, the gap is so narrow that it does not hold significant meaning.

Image for post
[Chart 3]

The third graph compares the performance of the regular truncated SVD model and the truncated SVD model after sparsifying U. while maintaining the same level of the performance quality, how much can we sparsify the matrix? The red dotted line shows the performance of a regular truncated SVD model, the blue line shows the performance of sparsified U, and the yellow line shows the performance of retrained sparsified U. The graph indicates that 45% sparsity in retrained sparsified U shows the same performance of a regular truncated SVD model.

Summary

Overall, we compressed R by truncating singular values first and then sparsifying U. From the first compression, we observed that R with only half of the singular values still performed likewise the original R. From the second compression, we additionally sparsify 4–50% of space through U approximation. Refer to Chart 3 above.

So Who’s the Winner?

In conclusion, the results were as we expected. It was more effective in performance to sparsify U of the outcome of the truncated SVD onto R instead of simply sparsifying R. When R is factorized, we could see the tendency of each component. This enabled to separate smaller insignificant parts from the big junk of significant parts and vice versa. We also found some interesting points that we should experiment in the future. Let’s look through them in this chapter.

Future Direction 1: Relationship Between Loss and Singular Values

What do you see from the graph below? The graph almost seems like a 1/x graph and we can see that the loss exponentially decays. That being said, let’s focus on the red box from the graph — a few initial singular values are in charge of most of the performance.

Image for post

This leads to the idea that we need to try out how many singular values should be left for engineering the optimal CRM. The other variable to control is the number of steps. Experimenting to find an optimal pair of these two variables will allow us to set a better engineering setting.

Future Direction 2: Understanding Hidden Relationships Among Features

Another possible experiment to soup up the performance is related to the feature selection. Although not dealt in-depth here, selecting features and controlling the feature values are important in building R. While modulating the feature values arbitrarily, some feature seems to share smaller parts related to other features. Group selection can also be a good way to enhance the performance so we recommend you to give it a try.

Image for post
Reference: group selection in a high-dimensional model (Jian Huang, Patrick Breheny and Shuangge Ma 2012), group exponential lasso for bi-level variable selection (Patrick Breheny, 2015)

Unfortunately, we cannot share the detailed result in this blog. However, we believe this thinking process could lead you to meaningful results in the sense that you understand the hidden relationships among features and compress in order of least relevant components.

Explore open positions: https://angel.co/binaryvr/jobs
Send your resume for the internship: contact@binaryvr.com
Learn about working at BinaryVR: What Made Engineers from Tech Giants Gather at a Small AI Startup?

We are BinaryVR; aiming for seamless interaction between AI and people’s daily lives in the computer vision field. We develop the world’s top quality facial motion capture solutions, HyprFace and BinaryFace, keeping our core value in constant evolution.

Written by

Hyprsense develops real-time facial expression tracking technology to light up the creation of live animation.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store