Norms: essential for optimisation

Most optimisation problems reduce to norm minimisation.

Umer Hasan
3 min readMay 12, 2020

Norms are a way to measure the size of a vector, matrix, function, or a tensor. There are many different types of norms useful for different things. Usually, norms are known as the L-p norm with a number assigned to p, for example, p=0, p=1, p=2, p=∞.

visualisation of vector norms, how to calculate vector norms for norm 1 norm 2 norm ∞
The lines represent the vectors of size 1 (unit vectors) that maximise the respective norms. As we go to 0, we will collapse the diamond further until we reach a cross shape.

The p norm for any p is essentially taking the pth power of all numbers in a matrix or a vector, combining all resulting values, and then taking the pth root. This is most clearly visualised with the L-2 vector norm where we take the Euclidean distance by using the Pythagoras theorem.

In the L-0 norm, we are looking for the vectors with the maximum number of 0s. This is the number we would like to know to investigate sparsity in a dataset. Data with a lot of 0s is easier to work with and we may want to optimise to have the higher L-0 norm and hence the most sparsity.

The L-1 vector norm is better known as the Manhattan distance, we might want to use this in an ML model as a regularization technique. Similarly, the L-∞ norm is also used for this. For more information on regularization here is an article by Prashant Gupta.

The L-2 norm is probably the most common. It is the Euclidean distance as mentioned above for a vector. It is so important for matrices that there are 3 types of L-2 matrix norms.

  1. The induced L-2 matrix norm is regarded as the maximum amplification factor a matrix provides. This can be seen by vectors which lie on a unit circle in 2d or a n-dimensional sphere in higher dimensions. This norm is also the same as the largest singular value of the matrix. If we transform the circle using our matrix, the singular value represents the different scaling factors in each direction. Below we can see we scale to become wider horizontally and keep the same values in the vertical direction. The vectors which are more horizontal than vertical get scaled more. At the 2 edge cases, vectors that are completely vertical don’t get scaled at all. Vectors that are completely horizontal get scaled the most.
matrix transformation from circle to ellipse, the induced L-2 matrix norm will be sqrt(10) in this case.
The transformation from circle to an ellipse

2. The Frobenius norm is the translation of the L-2 vector norm for matrices, we take the square root of the sum of all numbers squared. This is also the same as the square root of the sum of all singular values squared. This norm is instrumental in optimisation for neural networks.

3. The nuclear or trace norm is defined as the sum of all absolute singular values. This is proving to be useful for tractable solutions and heuristics to intractable optimisation problems. This research is being explored further to this day.

There are further norms not covered but use this is a starting point.

Minimising something with a constraint is the fundamental problem for optimisation, regression, neural networks, and broadly speaking everything. Humans have been trying to optimise since the start of time from using the first spear/wheel to the scale of automation today. The constraint in a lot of problems is dependent on the specifics of the norm we are using. Therefore, norms are very important for learning about the concepts underpinning optimisation. For a detailed overview, Gilbert Strang is highly recommended.

Lecture notes:

Further visualisations in this article by Brian Chivers.

Featured image source: https://upload.wikimedia.org/wikipedia/commons/d/d4/Vector-p-Norms_qtl1.svg

--

--