An Ultimate Guide to Sparse Analysis: A Mathematical Perspective

A comprehensive guide for analyzing sparse systems with applications in Machine Learning and Computer Vision

Sai Darahas A
16 min readAug 26, 2020

In this article, I will try to demystify the idea of sparsity, one of the most important concepts empowering the prevailing multimedia industry. We shall learn different methods to analyze systems with sparse representations and discuss some specific applications, in domains like machine learning and computer vision, where these methods had a profound impact.

First, we shall begin by understanding how the data is represented in matrix form, then we will dive into a few important definitions and concepts — such as singular value decomposition — that would help us to appreciate the intricacies of different algorithms behind these applications.

Keywords: Singular Value Decomposition (SVD), Compressed Sensing, Regression, Recommendation Systems, Econometrics.

A vector or a matrix is said to be sparse if it has only a few non-zero entries.

But, why do we want systems with sparse representations anyway?

We are drowning in information and starving for knowledge. — Rutherford D. Roger

Precisely, with the advent of Big Data, we need a reliable way of storing and representing the data. This brings us to the idea of sparsity, whose main goal is to reduce the dimensionality of the data for a more compact representation.

Some general applications of sparse analysis include econometrics, genome-wide association, image denoising, image inpainting, image super-resolution, video surveillance, face recognition, and compressive sensing. We will look at a few of these applications towards the end of the article.

This article tries to combine all these applications and will present a comprehensive explanation of sparsity and methods to analyze such systems. I must warn, it can be a little heavy in math, so I recommend readers to have some basic knowledge of linear algebra and optimization before reading any further. Nevertheless, I will try to define new terms and illustrate important concepts as and when required throughout this article. Sit tight and let’s begin…

The data is represented in matrix form as shown below.

This matrix A is called data or system matrix having m rows and n columns. It is also called a dictionary in compression applications, comprising the coefficients of discrete cosine transform (DCT) or discrete wavelet transform (DWT) of the input, while in image super-resolution A is a transformation matrix that converts a low-resolution image into a high-resolution image.

In machine learning, each column vector in the data matrix A represents a feature. For example, consider you want to buy some groceries in the market. Each item is characterized by features such as weight (in kgs or pounds), percentage of discount, the price per unit, and so on.

As stated earlier, the data matrix, here A, is considered sparse if it is represented efficiently using only a few of its entries.

Now let us recall few important definitions in linear algebra

Span

The set of all linear combinations of vectors v1, v2, …,vn is called the span of these vectors and is written as Span{v1,…,vn}.

For example, if vectors v1, v2, and v3 are linearly independent then

  1. Span{v1} is a line in R
  2. Span{v1,v2} gives R²
  3. Span{v1,v2,v3} gives R³

Basis

Let V be a vector space, then a set B of vectors is the basis for V if the vectors in B are linearly independent and Span B = V.

For example, {[1, 0, 0], [0, 1, 0], [0, 0, 1]} are the basis for vector space R³.

Rank

The rank of a matrix is defined as the maximum number of linearly independent row or column vectors in the matrix.

For example, the rank of the matrix

is equal to 2.

Note: For any given matrix, the row rank equals the column rank.

Orthogonal

A square matrix U is said to be orthogonal if U’U = I. The rows and columns of orthogonal matrices are called orthonormal basis i.e., each has length one and are mutually perpendicular.

Solving system of linear equations

The general form of the matrix equation is Ax = b.

Here A is the system or data matrix, x is a vector comprising the solutions of the equation, and b is the output. If we unravel each of these terms, we get

It is important to note that the output b can also be represented as a linear combination of columns vectors of A with elements in vector x as

This is just a more intuitive form of matrix-vector multiplication.

If the equation Ax=b does not have a solution, which is often the case in the real world, we can instead find the best approximate solution by minimizing the squares of the difference b-Ax, called the least squares as shown below.

More specifically, our goal is to minimize the sum of squares of errors between Ax and output b.

However, if A is an under-determined i.e., it contains more unknowns than equations, there is no closed-form solution. For any under-determined system, the number of columns is greater than the number of rows. Therefore, there are more number of features than observations or samples, m<n. Moreover, these systems usually have infinitely many solutions, making the model too explicit. The solutions can easily overfit the data, affecting the model to become incapable of generalization.

To solve such problems we need to restrict the solution space by incorporating some prior knowledge about the data, using regularization.

Singular Value Decomposition (SVD)

Before talking about regularization, let us look at SVD, one of the most important concepts in linear algebra which assert that any matrix, for instance, X, can be decomposed (or factorized) into a product of three matrices.

where both U and V are orthogonal matrices and Σ is a diagonal matrix.

We refer to the values σ1,σ2,…, σn as singular values, vectors u1,u2,…, un are left singular vectors, and vectors v1,v2,…, vn are right singular vectors.

The singular values are always non-negative and are in descending order i.e., σ1 ≥ σ2 ≥ ….. ≥ σn, corresponding to the relative importance of each term in the expansion

Each term, in the above sum, has rank 1. After all, it depends only on one column vector and one row vector (remember for any matrix row rank is equal to column rank). Thus, the matrix X approximates to first k-terms giving the closest k-dimensional representation of matrix X.

For example, the best rank-one approximation of the matrix X is given by u1σ1v1'.

This notion of decomposing the matrix as a sum of rank 1 matrices can be used to solve the rank minimization problem stated as

where ||.||F is called Frobenius norm.

Essentially, we are trying to find the best rank r approximation of matrix A by minimizing the Frobenius norm between approximate A^ and A.

A Frobenius norm of a matrix is the square root of the sum of the entries

The solution to this problem is sparse and can found using singular value decomposition, in the sense that all except first r-singular values are trivial.

There are several variants of singular value decomposition such as principle component analysis (PCA), Robust PCA, linear discriminant analysis (LDA), etc.

Note

It turns out that the right singular vectors are nothing but the eigenvectors of the matrix X’X and the left singular vectors are the eigenvectors of matrix XX’. The eigenvalues of the symmetric positive semidefinite matrix X’X are given by the squares of corresponding singular values σ1,σ2,…,σn of X.

Regularization

As discussed earlier, using regularization we can constrain the solution space by incorporating some prior knowledge into the problem, such as the solution x is sparse (or small).

Considering the solutions of the equations Ax=b sparse will make the system more robust and in turn prevent the model from overfitting.

The norm of the matrix achieves the regularization. A norm is a function from a vector space that satisfies certain properties related to scalability and additivity and takes the value zero only if the input vector is zero.

When the model regularized using l2 norm the problem of finding the optimal solutions is called Ridge, the model regularized using l1 norm the problem of finding the optimal solutions is called Lasso.

Among the norms related to the regularization of the model, only l1 norm and lp norm (0<p<1) are proven to give sparse solutions. Nevertheless, we shall discuss each one of these norms in the coming few sections.

l2 norm

The l2 norm or euclidean norm of a vector x is

As discussed earlier, we are trying to find the optimal solutions for the least-squares problem

By regularizing the equation Ax=b using l2 norm, we get

We can visualize the solution space of this problem in 2D as shown below.

Apparently, while l2 norm can regularize the model it cannot yield sparse solutions.

By swapping the constraint and objective we can rewrite the same problem as shown below.

This reformulates into the Lagrangian form as

Notice that here we added an extra term λ ||x||2 that affects the regularization of the model.

This way of regularizing the model using a l2 norm is called a Ridge problem. The parameter, in fact, hyperparameter λ controls the amount of regularization. There is a one-to-one correspondence between λ and S, where both constrain the values of x and optimal values of these parameters are usually data-dependent.

l0 norm

The l0 norm of a vector x is the number of all non-zero entries of x

By using l0 norm to regularize, the problem becomes

We can visualize the solution space in 2D as shown below.

We can see that, though we obtain sparse solutions using l0 norm, they are not unique. Moreover, since the l0 norm is non-convex, solving this problem is NP-hard. However, there are some greedy algorithms such as Matching Pursuit, to solve such problems.

l1 norm

The l1 norm of a vector x is

By adding the l1 norm to the above problem, we can obtain sparse solutions i.e., zeroing the effect of lesser important features. For example, if

The solutions are simply

Therefore, the problem to solve becomes

We can visualize the solution space in 2D as shown below.

Again, by swapping the constraint and objective we can rewrite the same problem as shown below.

This is called Lasso Problem, with constraint S limiting the values of x, thereby forcing the optimization algorithm to yield sparse solutions.

The same problem is reformulated into the Lagrangian form as

Though the Lasso yields sparse solutions, unlike Ridge, there are no closed-form solutions. However, we can use several methods including matching pursuit, orthogonal matching pursuit, fast iterative shrinkage algorithm (FISTA), and alternative direction method of multipliers to solve the optimization problem.

Note

Sometimes the Lasso problem is also called a Basis Pursuit problem. Essentially both are the same and used interchangeably. However, the term Lasso is more popular in the mathematics community.

lp norm

The generalized lp norm for a vector x is

where 0<p<1.

By adding lp norm to the given problem

We can visualize the solution space in 2D as shown below.

By swapping the constraint and objective we can rewrite the problem as shown below.

The same problem reformulates into the Lagrangian form as below.

Using lp norm though we can obtain sparse solutions, the lp norm is non-convex and hence the problem becomes NP-hard.

The figure shown below summarizes all the norms for regularization

Out of all these norms, the l1 norm is the best choice for regularization.

Applications

Now, we shall examine a few applications where sparsity is employed. In all these applications, the formulation of the problem remains mostly the same only the approach for finding the solutions changes.

Econometrics

Econometrics deals with predicting quantities related to the economy of a business or any entity per se. It involves testing and evaluating various economic theories and hypothesis that help government and businesses to formulate appropriate policies.

Regression is widely used for solving various problems in econometrics. It is a technique for finding the correlation between a dependent variable (y) and a series of independent variables (xi). In linear regression, the dependent variable has a direct relationship with the independent variables as shown below.

For example, if a person working in a company received a wage y, then x1 can be the amount of training, x2 can be his years of experience, and so on.

Note that in the above equation, the weights of each independent variable xi is denoted by α1, α2,…..αn and x1, x2,…, xn do not correspond to the values of vector x in Ax=b. They are actually the values of each row in matrix A and α1, α2,…αn are in fact the values of x in Ax=b.

We know that the general form of matrix equation is Ax=b.

This can also be formulated as a regression problem, where the output b is simply a linear combination of elements in matrix A i.e., matrix-vector multiplication of elements of A and vector x as shown below.

Each element x, also called weight, indicate the extent of contribution of each feature to the final output. Thus, finding the optimal values of x, denoted by x*, is usually the objective of such problems.

Note: In linear algebra, the same problem is stated as finding the coefficients x* that gives a vector in Span<a_1,a_2,….a_n> closest to b.

The closed-form solution is

X+ = (X^TX)-1X^T is the pseudoinverse or generalized inverse (more specifically Moore-Penrose inverse) of X.

Note: The pseudoinverse itself is computed using Singular Value Decomposition (SVD).

The general assumption across all the regression algorithms is that these feature vectors ai are independent of each other i.e. ai forms an orthonormal basis.

Implementation

from sklearn.linear_model import LinearRegression lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_

We solved the regression problem by finding the optimal values, x*. However, if the system is under-determined i.e. when the given dataset is essentially small, we can leverage different regularization techniques discussed earlier to prevent the model from overfitting.

Netflix Movie Challenge

Netflix launched this challenge in 2006 to improve the recommendations of movies to its customers. SVD played a central role in most of the top-performing algorithms.

The provided data in this challenge is largely scarce with several empty entries.

Each entry corresponds to the rating of a customer to a particular movie and each column has several ratings consigned by different customers.

The main goal was to recover the complete matrix, from the given limited observations, by finding the best approximation of ratings in the empty entries. These ratings can be used for recommending movies to other customers bearing similar patterns.

Therefore, we are essentially trying to seek an approximate of A^ that imputes all the missing entries in A, a problem called as matrix completion.

One way to solve this is by constraining the rank of A. So the optimization problem to be solved is

or simply

This problem is non-convex and requires some heuristic knowledge to find local minima. We start with an initial guess for missing values and then iterate by minimizing the rank until convergence.

An alternative approach is to use the nuclear norm for which exact solutions are derivable. The nuclear norm of a matrix is simply the sum of its singular values. It is also called the trace norm.

We can define a project operator, over subset Ω of observed entries, that replaces the missing entries with zeros and leaves the observed entries alone

Therefore, the optimization problem to be solved becomes

where ||.||* is the nuclear norm

We again start with an initial guess of missing entries, find the full rank of the resultant matrix, and then threshold its singular values (using SVD) to obtain new estimates of missing entries. This process is repeated until the problem converges.

Note: In practice, we allow the A^ to make some errors on the given data to prevent the model from overfitting.

You can find similar applications of recommendation systems in amazon.com, Target, Walmart, etc.

Video Surveillance

In video surveillance, each input frame B of the video separated into background L (stationary part) and foreground E (moving parts).

Since the background is essentially the same in every input frame, it can be characterized by a low-rank matrix (a rank minimization problem).

On the other hand, the moving parts take only a limited part of the input frame and thus represented using a sparse matrix (a regularization problem).

Therefore, the resultant problem formulated as shown below.

This can be solved using singular value decomposition, where the background L is decomposed into the product UΣ V’. Since the rank of the L is constrained to k, we are only interested in first k singular values. Therefore we can rewrite the problem as

However, this is an NP-hard problem so we need to introduce the nuclear norm. The resultant problem is

where ||.||* is the nuclear norm

Note:

If the singular value decomposition of A is A=UΣ V’. Let σ denote the vector containing all singular values of A. Then

||σ||0 equals the rank of A

||σ||1 equals the nuclear norm of A

||σ||2 equals the Frobenius norm of A

Compressed Sensing

Compressed sensing (also known as compressive sensing, compressive sampling, or sparse sampling) is a technique for efficiently acquiring and reconstructing a signal, by finding solutions to under-determined linear systems.

The flowchart of a typical image acquisition process is as shown below.

Compressed sensing is based on the assumption that the image can be represented sparsely by sampling at a frequency less than the Nyquist frequency. The image is expected to have some redundancies, in fact, it is impossible to compress any image that has no redundancies — such as pure noise. These redundancies are essentially due to correlations present in the data.

So, the goal of any compression algorithm is to decorrelate these correlations present in the data, identify the importance of each and in turn, if feasible, reduce some to zero.

The image is compressed by sampling at a frequency (M) lower than the Nyquist frequency (N). This sparse representation of the image can be easily transmitted or stored and when required it is reconstructed to emulate the quality of the original image. However, we can expect some perceptual redundancies, if not some loss, since this is a compression after all. The recovered signal obtained from Ax*.

Consider the following example

The image on the left is the original image in grayscale. The image on the right is the compressed image formed by reducing some intensity values to zero.

You can see the effect of reducing the intensity values of some pixels to zero. Though the resultant compressed image is sparse, this method is naive since we are inadvertently losing too many details.

Below is a much better compression!

Compressed image

The concept of sparsity is employed in almost every domain and therefore it’s important to know the underlying intricacies to appreciate these applications. There are many other applications of sparsity in domains like neuroscience, medical imaging, etc but maybe for another article.

I hope you enjoyed reading it. Let me know your comments below 👇.

References

  1. Statistical Learning with Sparsity: The Lasso and Generalizations
  2. Machine Learning A Bayesian and Optimization Perspective: Chapters 9 and 10
  3. The Netflix Recommender System: Algorithms, Business Value, and Innovation Carlos A. Gomez-Uribe and Neil Hunt, Netflix, Inc.

Originally published at https://saidarahas.com on August 26, 2020.

--

--