Why you should consider studying Linear Algebra as a data scientist : SVD

Published in

Analytics Vidhya

8 min readMay 12, 2021

This article is to give a flavor for the reader to explore the topic more in-depth, on their own using the resources mentioned at the end.

Whether you’re doing least-squares approximation or reading a new paper in deep learning a fundamental understanding of linear algebra is critical to internalize key concepts.

The gist of Linear Algebra

Firstly, a few definitions that we should all know

Column space: You could imagine each column of the matrix as a vector in space. Now, the space spanned by taking linear combinations of all the independent vectors in the matrix makes up the column space.

Red vectors are the linear combinations of the black vectors

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3Dfig = plt.figure(figsize=(10,10))
ax = fig.gca(projection='3d')
ax.set_xlim3d(-1, 1)
ax.set_xlabel('x-axis')
ax.set_ylabel('y-axis')
ax.set_zlabel('z-axis')
ax.set_ylim3d(-1, 1)
ax.set_zlim3d(-1, 1)
A=np.array([[1,0],[0,1],[0,0]]) # Initial Column vectors
C=np.copy(A)
ax.quiver(0, 0, 0,*C[:,0],normalize=True,color='k')
ax.quiver(0, 0, 0,*C[:,1],normalize=True,color='k')
for i in range(50): # Taking 50 random linear combinations   x1=np.random.uniform(-100,100,1)
 x2=np.random.uniform(-100,100,1)
 ax.quiver(0, 0, 0,*(C[:,0]*x1 + [:,1]*x2),normalize=True,color='r')plt.show()

Each Column of the matrix shown in the example ( [1,0,0] & [0,1,0] ) is represented by the black vectors. The red vectors are linear combinations of both of these columns. Doesn’t it look like all the red vectors lie on the same plane? Because they do! All such vectors constitute the column space of the matrix. In this case, the column space is a plane (since there are two independent columns). But, if the number of independent columns is higher, the space can exist in higher dimensions.

Rank: The number of independent columns (or rows) in a matrix. The column entries in your matrix cannot be linear combinations of each other. Essentially, you should not be able to use other columns to form that particular column in the matrix.

A tiny example of a 3x3 matrix where the rank is 2, since one of the columns is dependant of the other two. — A tiny 3x3 matrix where the rank is 2 since the 3rd column is dependant on the other two.

from numpy.linalg import matrix_rank
matrix_rank(np.eye(3)) An identinty matrix with rank 3, since all the columns are independent of each other.

At a very high level, we can view the rank of a matrix as the ‘information content’. For instance, if have an image of 1000x1000 pixel values, why would I want to store all of them, if I can have sufficient information with half the amount of values.

To get such a low-rank approximate, we can use several methods such as SVD, or NMF. The following pictures show us how a low-rank approximation may not be that bad all the time.

Not much of a difference, is it?

Linear Transformations

Any matrix can be thought of as a linear transformation (with respect to a fixed basis ). Whenever we multiply a vector with a matrix, we can think of a scaling, shearing, or rotating effect applied to the vector.

Let’s look at a simple example:

We form a 2 x (N*N) matrix and represent it using a scatterplot. Then we apply a transformation to the matrix and see the results.

def build_array(startx,endx,starty,endy,N):''' Build an array of 2 x (N*N) '''xvals = np.array(np.linspace(startx, endx, N))
yvals = np.array(np.linspace(starty, endy, N))
arr=np.vstack(np.meshgrid(xvals,yvals)).reshape(2,N*N)return arrdef visualize_array(arr):''' scatterplot to visualize array '''plt.figure(figsize=(7, 7))
plt.xlim([0,30 ])
plt.ylim([0,20])
plt.scatter(arr[0],arr[1])arr_1=build_array(0,10,0,10,11) # array
visualize_array(arr_1)
transformation=np.array([[1,1],[0,1]]) # transformation
final_array=np.dot(transformation,arr_1)# Applied the trasnformation
visualize_array(final_array)

Original space on the right, transformed on the left

We can see that the x-axis has remained at it’s place, owing to [1,0] in the first column of the matrix ( similar to identity ), we can see a shear has occurred due to [1,1] in the second column. I urge you to try the code by yourself, apply various transformations, and see the matrix's resultant shape.

Eigenvalues and EigenVectors:

This is the equation that we were taught in high school. But what was not clear (at least to me at that time), was the visual intuition behind them. Continuing our line of thought of a matrix as a linear transformation, eigenvectors are the vectors that are not affected when we apply a transformation A to them, and the eigenvalue is simply a scaling factor (The direction remains unchanged). Let’s quickly take a look at an example.

def transform(A,B):''' plot the transformation of a 2x1 vector, given a 2x2 matrix '''fig = plt.figure(figsize=(7,7))
param=np.amax(A)*np.amax(B)
plt.xlim([-1,param])
plt.ylim([-1,param])
plt.quiver(0, 0, B[0],B[1],color='k',scale=1,units='xy',label='original_vector')
C= A[:,0]*B[0]  + A[:,1]*B[1]  # Expressing as a linear combination 
plt.quiver(0, 0, C[0],C[1],color='B',scale=1,units='xy',label='post
transformation_vector')
plt.legend()A=np.array([[2,4],[0,2]]) #Transformation
B=[[1],[0]] #Eigenvector
transform(A,B)B=[[2],[1]] # Other vector
transform(A,B)

Eigenvectors not being effected by the transformation A

We can see that the eigenvector was only ‘stretched’ by the transformation. The scaling factor is nothing but the eigenvalue.

An interesting property arises to these eigenvectors when we have symmetric matrices (Transpose(A) = A). The eigenvectors in these cases appear to be orthogonal or in other words perpendicular. We use the same function above to illustrate an example.

A=np.array([[2,1],[1,2]]) # Symmetric Matrix
B=[[1],[1]] #Eigenvector
transform(A,B)
A=np.array([[2,1],[1,2]]) # Symmetric Matrix
B=[[-1],[1]] #Eigenvector
transform(A,B)

The eigenvectors are orthogonal ( at 90 degrees with each other )

Now, why would we care about the eigenvectors being orthogonal, Imagine if you could break up a matrix into factors of its eigenvectors and eigenvalues, and the matrix is symmetric you would end up with columns which are orthogonal to each other. This would indicate that each column of that particular matrix is completely different from one another. This is how you would go about ‘categorizing’ your entries from an original matrix in the context of unsupervised learning.

But there’s a problem, all matrices are not symmetric, and they are definitely not square (Eigenvectors don’t exist for non-square matrice). Can you imagine a dataset with the same number of rows and columns? In the real world, this would hardly ever happen. So what do we do?

We make it symmetric forcefully, multiplying the transpose of a matrix with itself results in the kind of symmetry that we want. Here’s where the entire idea of singular values and SVD kick in.

SVD:

For symmetric matrices, the ‘eigendecomposition’ would be as above, but for others, we use this:

The vector U & Transpose(V) have orthogonal columns. In the middle, we have our ‘singular values’ ( the substitute for eigenvalues ), but the important factor is that they are ordered (by convention). Taking the first ’n’ singular values we would get the best nth rank approximate of A. This is important since most of our information will be reconstructed from the first few ‘singular values’. If the rank of the matrix is r, we can imagine the matrix V to be in the column space of A, and U to be in the row space of A. I didn’t explain row-space? It’s just the column space of Transpose(A). What’s the rest of the vectors (n-r) other the ‘r’ that are in the column space? Well, that goes into the nullspace (That’s for another day)

Let’s take a look at a small example of how we can use this for topic modeling. We will use the fetch_20newsgroups dataset from sci-kit-learn. This gives us around 13000 documents of various categories that our model has no idea about. To construct our initial matrix we use Tf-Idf so our matrix is in the form of document x words. W

So again, the rows would be the documents and the columns would have words and the entries would have the tf-idf score for that particular word.

#### BUILDING UP DATAFRAMEfrom sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data
vectorizer = TfidfVectorizer(stop_words='english',max_features= 1000,max_df = 0.5,smooth_idf=True) #picking 1000 wordsdf = pd.DataFrame({'document':documents})
df['docs'] = df['document'].str.replace("[^a-zA-Z#]", " ") #cleaning
vectors = vectorizer.fit_transform(df['docs']).todense() tf_dict={}
for i,k in enumerate(vectors):
 a=np.squeeze(np.array(k))
 tf_dict[f'Doc{i}']= adocuments=pd.DataFrame.from_dict(tf_dict)
documents.index=words
documents.T #Final DataFrame

After this, we would get a sparse matrix that would look something like this.

### SVD AND VIEWING TOPICSfrom scipy.linalg import svdU, s, VT = svd(vectors)def display_topics(model, feature_names, no_top_words):
 topic_dict = {}for topic_idx, topic in enumerate(model): topic_dict["Topic %d words" % (topic_idx)]=      ['{}'.format(feature_names[i])for i in topic.argsort()[:-no_top_words - 1:-1]] topic_dict["Topic %d weights" % (topic_idx)]= ['{:.1f}'.format(topic[i]) for i in topic.argsort()[:-no_top_words - 1:-1]]return pd.DataFrame(topic_dict)

VT will contain our right singular vectors.

The eigenvectors of Transpose(A) x A ( which turns out to be symmetric ) gives us our orthogonal right singular vectors. Recall the shape of our document, 13000 x 1000. Can you guess what the shape of VT will be?

If examine VT and order them by top words according to weight, we get the most important words to a topic.

#Examining the top 10 words
display_topics(VT[1:3],words,10)

We can see that the first topic looks to be about computers and the second one seems to be about sports.

Let’s check if the topics are orthogonal like we expect it to be.

np.dot(VT[2:3],VT[:1].T) Multiplying two columnsOutput: array([[-1.73472348e-17]])

It is difficult to articulate what is going on under the hood in one blog post. If you’re new to linear algebra, most of this probably sounded alien jibber-jabber to you. Hence I’ll recommend a learning path for linear algebra which help you deep-dive into the topic.

Resources

There are three main resources for going about this, I would suggest an order like this:

3Blue1Brown Linear Algebra Series: A visual intuition into the concepts I mentioned here, and a great starting point.
Gilbert Strang Lectures on Linear Algebra: This course was meant for understanding linear algebra for data scientists. You can also get the book that goes along with the lectures. The concepts discussed here are theoretical and they will surely help out in the long run.
Fast-AI Computational Linear Algebra: Unlike the Gilbert Strang videos, this focuses on practical applications of the concepts. Once you learn all the theoretical concepts from the Gilbert Strang videos, these lectures will be very easy to follow.

You can find all the codes mentioned in the post here.

Why you should consider studying Linear Algebra as a data scientist : SVD

The gist of Linear Algebra

SVD:

Resources

Written by Arnab