Linear Algebra for Machine Learning

Foundational ideas, musings, and examples

Joe Tenini, PhD
Red Ventures Data Science & Engineering
21 min readOct 4, 2019

--

This post comes from the notes for a short course on linear algebra I gave at Red Ventures as part of our data science accelerator program (you can learn more about the program here). As such, the notes are both less formal and less comprehensive than a textbook, but it may be a useful way to get a quick tour of some of the foundational framework for machine learning problems.

Photo by Federico Respini on Unsplash

Notation

Throughout this course I’ll be using some notation that is standard in mathematics, but may seem very funny otherwise (this tastes funny — not “ha ha” funny). Since notation questions are very hard to google, please ask for clarifications in the comment section so that I can do a better job of explaining myself.

Common fields

I will generally use k for an arbitrary field, but there are some fields that come up so often that they get special treatment. In most documents, they are written with “blackboard” styling, but for the purposes of this article, I’ll just use bold so that they can be written inline without any magic character rendering:

  • Q: The field of rational numbers — the set of numbers that can be represented as a/b where a and b are integers.
  • R: The field of real numbers.
  • C: The field of complex numbers — the set of numbers of the form a + bi where a and b are integers.

So . . . what’s a field? A field is just a set with an addition and multiplication operation that are well behaved. In particular, each element must have an additive and multiplicative inverse — meaning every number a has a number that you can add to it to get 0 and a number that you can multiply by it to get to 1 (but 0 is special and doesn’t have a multiplicative inverse). So the rational numbers are a field, but the integers are not.

Set notation

We can think of a set S as a collection of objects a, b . . . We will usually use “braces” when defining sets, i.e.

If we wish to express that a particular element s is contained within a set S, then we will write s ∈ S.

Sometimes we will define sets using conditions, there is a special notation for this. For example, we could define the set of rational numbers as:

which would be read: “Define Q to be the set of elements x in R satisfying the property that x = a/b for some a and b in the set of integers Z.

Given a set S, we may want to refer to the “number of elements in S” (or the cardinality of S), denoted |S|. This is straightforward for finite sets, but can be a little weird for infinite sets. We won’t fuss over the different cardinalities of infinite sets though it can be fascinating to think about if you’re in to that sort of thing.

Functions

A function is a map between sets. Formally, if A and B are sets, then a function from A to B (written f : A → B) is a set:

with the property that if (a,b) ∈ f and (a,b’) ∈ f, then b = b’. (In grade school, this property is often described as the “vertical line test”) We almost never think of f as a set though — we usually think of it as an assignment rule: if (a,b) ∈ f, we say that f maps a to b, written f(a) = b or a ↦ b.

Other notation

A few examples in these notes will take a bit more time to grasp or will be more obscure. I’ll label these examples or remarks with a ♯ as in, watch out, it’s a bit sharp, don’t cut yourself!

Musings and framings

In mathematics these days, one useful concept for understanding certain ideas (maybe it’s topological spaces, or groups, or smooth manifolds) is the categorical perspective. A category C roughly speaking is a set of objects O and a set of morphisms (or arrows) M. To study objects O generally, or a specific object o ∈ O, we often proceed by considering morphisms (or maps) generally or specifically morphisms to or from o.

In this way we can often expose much more of the structure than if we just squint at the objects themselves. (This is not really a new or novel idea, scientists use this idea all the time — if you’d like to understand some new thing, very often watching the way that it interacts with known quantities is fruitful.)

A second tactic is to push your analysis to a simpler category by something called a functor C → C’. This has a very precise meaning in mathematics, and its application has led to leaps forward for the field. In data science, we apply this in spirit — the problem of understanding data is very tricky, but often we can be successful in pushing the problem down to a problem in the category of vector spaces. Linear algebra is the study of this category.

In this short course, we are going to try to understand vector spaces and maps between vector spaces. For those who find themselves transforming data to the point where it is vectors of real numbers . . . congratulations! You’re now in a vector space and can apply your new-found understanding.

Vector Spaces

What is a vector space? Unfortunately one cannot be told what a vector space is — you have to see it for yourself. Only kidding —

Example 1: let’s consider real 2-dimensional space introduced in many schools as the x-y plane.

The set is first and foremost — a set. That is, we can describe it as:

that is, the set of all pairs of real numbers. But it is more than a set! It also has a binary operation “+” defined as:

(p, q) + (r, s) = (p + r, q + s)

This binary operation has some fancy properties (which it borrows from R).

  1. Associativity: (p + q) + r = p + (q + r)
  2. Identity: There is an element i that has i + p = p for every p.
  3. Inverses: For each p there is an element q such that p + q = q + p = i
  4. Commutativity: For each p and q, we have p + q = q + p

A set with properties (1) and (2) is called a monoid. A set with properties (1), (2), and (3) is called a group, and a set with all four properties is called an abelian group.

So is also a monoid, and a group, and an abelian group under addition. But it’s not just these things! There is also an action of the field R on our abelian group . If v, w ∈ and a, b ∈ R, then we have the following properties:

  1. 1v = v
  2. (ab)v = a(bv)
  3. (a + b)v = av + bv
  4. a(v + w) = av + aw

In general when the thing S acting on the abelian group is just a ring, we call the total package an S-module. In the special case when S is a field (like Q, R, or C), we call this thing a vector space.

Additional examples of vector spaces

Example 2: Consider C as a vector space over R — that is the set of complex numbers a + bi, where a, b ∈ R considered with an action of R given by:

c(a + bi) = ca + cbi

Example 3: Consider the set R[x] of polynomials in x with real coefficients.

Example 4 (Reproducing Kernel Hilbert Spaces ♯): Consider the function:

K: × R² → R given by (x, y) ↦ exp(-|x-y|²)

where x, y are vectors in and |.|² refers to the square of the euclidean distance function given by: |(a,b)-(a’,b’)|² = (a-a’)² + (b-b’)².

This function is an example of a kernel (a kernel is a symmetric, positive definite function). For each x ∈ , we can define a function:

k(x)(y): R by mapping y ↦ K(x, y)

The set of all functions of the form ∑α(i)k(x) where α ∈ R (after adding some points by a technical process called completion) is a very special vector space called a reproducing kernel Hilbert space. When people are using an SVM (Support Vector Machine) classifier on a vector space V, they are choosing a kernel K and then solving the classification problem in the (usually much larger) associated reproducing kernel Hilbert space. The kernel we have chosen above is known as an RBF kernel.

So how many vector spaces have we seen so far? We now have three (or four) examples of “real vector spaces” (sometimes we call vector spaces that have an action of R a real vector space). Are they all the same? What does that even mean?! In general, how do we compare objects in a category?!!

This is where our morphisms come into play — they are the tools that allow us to say things like “these two objects are the same” or “this object fits inside this other object, so it’s not bigger.” So what is a morphism in the category of vector spaces?

Linear Transformations

A morphism in the category of vector spaces is called a linear transformation. Suppose we have two vector spaces V and W over the same field k, with vectors v, v’ ∈ V and a ∈ k. First and foremost, a linear transformation is a function from V to W as sets: f : V → W. However, as we’ve mentioned, V and W are more than sets! They have two operations. For f to be a linear transformation, f must respect these operations, by which we mean that f satisfies the following two properties:

  1. f(v + v’) = f(v) + f(v’)
  2. f(av) = a f(v)

Just like V and W have nested structures (they are sets and groups and vector spaces over R), f is a morphism of sets (we call it a function), a morphism of groups (we call it a homomorphism), and a morphism of R vector spaces (we call it a linear transformation).

In general in a category, we consider two objects V and W to be equivalent (usually the phrase is isomorphic) when there are morphisms f : V → W and g: W → V so that f ∘ g is the identity on W and g ∘ f is the identity on V.

So . . . now we can ask the question: “Are and C isomorphic (the same)?” to which you should now respond “Isomorphic as what?!”

Examples of linear transformations

Example 5: and C are isomorphic as vector spaces over R.

Let f : R² → C be given by (a, b) ↦ a + bi and let g : C → R² be given by a + bi ↦ (a, b). Verify for yourself that f ∘ g and g ∘ f are identity maps on C and respectively. To complete the proof of isomorphism you need to also verify that these functions are linear — that is, they satisfy the above two properties.

Example 6: Consider the linear transformation f from the space of degree ≤ 2 polynomials to the space of degree ≤ 1 polynomials given by:

a + bx +cx² ↦ a + bx

Can you convince yourself that this is linear? Could f be an isomorphism?

Example 7: Consider the function f : R R, given by a ↦ 2a + 1

Is this function a linear transformation?

You may find it interesting that determining whether or not a function is a linear transformation is not more straightforward . . . well, it kind of is, but to understand this, we need to introduce the concept of a basis.

The basis of a vector space

Let’s now introduce an important concept known as a basis. This concept will help us greatly simplify our model for what vector spaces and linear transformations will look like. We will appeal to two theorems that we do not prove in these notes. If you don’t like the feeling of accepting statements without proof . . . great! I encourage you to try to prove them yourself first.

A basis of a vector space V is a set of vectors B = {v(1), v(2), . . . } ⊂ V such that for each u ∈ V, there is a unique set of scalars {α(1), α(2), . . . } such that

(note: the more standard notation would be to have i as a subscript to α and v, but I don’t know of a way to do this in Medium. If you do, please let me know in the comments!)

Note that this definition actually has two properties:

  1. Existence of a representation for every point u ∈ V
  2. Uniqueness of this representation.

These properties are often useful to discuss separately: If B has property (1), then we say that B spans V. More generally, we can refer to span(B) ⊆ V, the set of all vectors in V that can be written as a combination of vectors in B. Property (2) is referred to as linear independence — we say that B is linearly independent of each vector in the span of B can be represented uniquely in terms of B.

In this sense, a basis is a kind of minimal generating set for your vector space V. We now state two results that we will want to have on hand:

Theorem 1. Every vector space has a basis.

Theorem 2. (Dimension theorem for vector spaces) If B and B’ are bases of V, then |B| = |B’|. That is B and B’ have the same cardinality.

The second result now allows us to make a definition that now makes sense:

Let V be a vector space and B be a basis. We call |B| the dimension of V.

Example 8: Consider the real vector space . Then we can see that:

  • {(1, 0)} is linearly independent but does not span — it is not a basis.
  • {(1, 0), (0, 1), (1, 1)} spans but is not linearly independent — it is not a basis.
  • {(1, 0), (0, 1)} is linearly independent and spans . . . it is a basis!

Example 9: What is the dimension of ?

Example 10: What is the dimension of C as a real vector space? What about as a complex vector space (that is, a vector space over C)?

Solving linear systems

We move now to a question that comes up periodically in machine learning — solving linear systems. This question may seem unrelated to our discussion on bases, but we will see a connection soon enough.

Consider the following equations:

This is typically called a system of linear equations. Under which conditions does there have to be a unique solution?

This may seem like a very difficult problem, but let’s write it with different notation:

Let’s rewrite it again as

where the a’s and b are in m-dimensional real vectors and the x’s are real numbers.

We can ask again, when does this equation have to have a unique solution? Well, by definition, if n = m and the a’s constitute a basis, then there is a unique solution!

Example 11: When are there no solutions to the above problem? When are there infinitely many solutions?

Example 12 (Bellman equations for MDPs ♯): For those wondering where we might apply this knowledge in machine learning — an exciting application can be found in considering Markov Decision Processes or MDPs. An MDP consists of a collection of states in which an agent chooses actions and is presented with rewards. A policy π prescribes a way to choose actions given a state. One question that comes up in the study of MDPs is to estimate the value of being in a given state s under a given policy πdefined as the sum of rewards received discounted in time by a factor of γ. There is a natural recursive relationship between these values described by the Bellman equation:

where the R(s, π(s)), γ, and P(s’|s,π(s)) terms are all constants that can be estimated. You’ll notice that if there are n total states, then there are n relationships like this — linear in the V(s) terms (of which there are also n). Thus, as long as there is linear independence between the relationships — there is a unique solution to the system, and we can determine the values of each state!

Linear transformations in light of a basis

Defining a linear transformation

Let’s now restrict ourselves to the case of vector spaces with finite bases — so called finite dimensional vector spaces. We are now ready to make a critical observation:

A linear transformation is determined by what it does to a basis

That is, let’s say that V is a real vector space with basis {v(1), v(2), . . . ,v(n)}. Let’s say we’d like to build ourselves a linear transformation L : V → W where W is some other vector space. For a given v ∈ V, what should L(v) be?

Well, we know that v = a(1)v(1) + a(2)v(2) + . . . + a(n)v(n) for some real numbers a(i). By the properties of linear transformations, have that:

L(v) = a(1)L(v(1)) + a(2)L(v(2)) + . . . + a(n)L(v(n))

So, since any element of V can be written in terms of the v(i) (our basis), we see that specifying a linear transformation V → W is equivalent to choosing n vectors in W — that is, choosing L(v(1)), . . . ,L(v(n)) ∈ W.

Example 13 (Classification theorem for finite dimensional vector spaces): Prove to yourself that all vector spaces over a field k of dimension n are isomorphic.

Rank and Nullity

We have also seen how powerful the concept of dimension can be in describing vector spaces. In fact, we’ve proven that in the finite dimensional case, dimension characterizes the isomorphism class! We would now like to leverage this language for describing linear transformations as well.

Let L : V → W be a linear transformation. Let {v(1), v(2), . . . ,v(n)} be a basis of V. Consider the span of the set {L(v(1)), L(v(2)), . . . , L(v(n))} (we defined the span of a set of vectors above in the section bases). This is itself a vector space written L(V) and often referred to as the image of L. Since L(V) is a vector space, it has a dimension, we call this number (the dimension) the rank of L.

Similarly, we can consider the set of vectors in V that are sent to the zero vector (the vector of all zeros) under L. We call this set the kernel of L, written ker(L) (and different from the kernel function mentioned in the example on reproducing kernel Hilbert spaces!). As we will see in the following exercise however, ker(L) is more than a set!

Exercise 14: Prove to yourself that ker(L) = {v ∈ V | L(v) = 0 } is a vector space.

Since ker(L) is a vector space, we can ask its dimension. We refer to this number as the nullity of L. We will now state (without proof) the so-called “Rank-Nullity” Theorem, which governs the relationship between dim(V), rank(L), and nullity(L).

Theorem (Rank-Nullity): Let L : V → W be a linear transformation where V is finite dimensional. Then:

dim(V) = rank(L) + nullity(L)

or said differently:

dim(V) = dim(L(V)) + dim(ker(L))

If you are coming from a data perspective — thinking of points in a vector space as data. Then there is a nice intuition behind the rank nullity theorem. When you transform your data linearly, you start with a certain amount of information— represented by dim(V). The amount of information you have after the transformation (rank(L)) is equal to the amount of information you started with (dim(V)) minus the amount of information lost by the transformation (nullity(L)).

Exercise 15: Let L : V → V be a linear transformation. The inverse of L is a map J such that L ∘ J and J ∘ L are the identity on V. Given L, when does J exist? (Hint: it exists only when nullity(L) is a special value)

Coordinates

So far we’ve been writing things really generally. Usually when people talk about vector spaces they are writing tuples, e.g. (1, 3, 0, 0, 7) and matrices, e.g.

So what’s going on here? Well, for finite dimensional vector spaces (remember for a given field and a given dimension there is only one!) there is a natural choice of basis — the standard basis. For R⁴, the standard basis looks like this:

{(1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), (0, 0, 0, 1)}

which we would abbreviate: {e(1), e(2), e(3), e(4)}.

We then use the shorthand (7, 2, 1, 4) to stand for 7e(1) + 2e(2) + 1e(3) + 4e(4). But what about the matrix . . . what is the matrix?

The matrix is a way to encode the information of a linear transformation. Suppose we have a linear transformation L : R³ → R⁴. We’ve said this is equivalent to choosing 3 vectors in R⁴: L(e(1)), L(e(2)), and L(e(3)).

Say that these vectors are given as below:

  • L(e(1)) = (1, 0, 7, 3)
  • L(e(2)) = (0, 4, 4, 4)
  • L(e(3)) = (2, 1, 3, 9)

Then the matrix of these choices as column vectors gives us the transformation via matrix multiplication:

In this way, we can easily see the structure of the transformation.

Example 16: Convince yourself that matrix multiplication is linear. For example, if v = a(1)v(1) + . . . +a(n)v(n), then Mv = a(1)Mv(1) + . . . +a(n)Mv(n).

Example 17: Consider the linear transformation L given by:

  • Where does L go from and to (what is the domain and codomain)?
  • What is the dimension of ker(L)?
  • What is the dimension of the image?
  • Can L have an inverse?

Linear regression

A linear regression model M : V → R is an affine transformation. That is, it is of the form x ↦ Lx + b, where L is a linear transformation and b is a vector in the target space (or codomain) — here just a real number.

Example 18: What shape does L have? Reconcile this definition with the characterization of a linear regression as a function of the form f(x(1), . . . ,x(n)) = β(0) + ∑ β(i)x(i).

Alternatively, we can extend V by one dimension to V’ and define a linear regression as a linear transformation V’ → R with the understanding that we will only pass in values of the form (1, x(1), x(2), . . . ,x(n)).

If D ⊂ V’ (our data) is a collection of points in our domain vector space (usually called a feature space), it is natural to think of D as a matrix X where each row is a row of data. If y is a (column) vector of labels for each row of X, then the so called “ordinary least squares method” gives us the desired linear transformation given by:

Example 19: What needs to be true for this transformation to exist? What fix can we employ if the above transformation does not exist?

To get some insight into the above example, consider the following R code:

This code defines a data matrix and fits a linear regression model using the first 3 columns to model the last column. If you inspect the fit model:

We see that one of the features has been dropped from the model! Can you explain why this was done?

Eigenvectors and eigenvalues

Given a linear transformation L : V → V, we can ask the following interesting question: “Are there any vectors v ∈ V so that L just scales them (i.e. makes them longer or shorter without adjusting the direction)?” This question makes sense because L takes V to V. A vector v satisfying this condition is called and eigenvector and the associated value λ is called an eigenvalue.

More concretely, let’s say that L : R² → R² with the standard bases so that we can write it as a matrix M. This question then becomes: does there exist a vector v ∈ and a real number λ ∈ R so that Mv = λv?

If we let I be the identity matrix (that is, the transformation taking every vector to itself), then this is equivalent to asking if there are solutions to the equation (M-λI)v = 0

So we have now reduced the question of whether or not v and λ exist to a question about whether or not M-λI has a non-trivial kernel. Well, we’re already equipped to answer this question, but it turns out that there is a nice numeric trick that we can employ called the determinant. The determinant of an n × n square matrix M is a degree n polynomial in the entries of M. It’s easy to write down in the 2 × 2 case. Let

Then the determinant det(M) is given by det(M) = ad-bc. In any case, the relevant fact for us is the following:

Theorem: A square matrix M has an inverse if and only if det(M) ≠ 0.

Recall that having an inverse is the same as having a trivial kernel, so if we would like to know when M-λI has a kernel, we should just look for when det(M-λI)=0. This is just looking for zeros of a degree n polynomial!

Let’s go back to our 2×2 example. We’ve convinced ourselves that λ ∈ R is an eigenvalue of M if and only if det(M — λI) = 0. That is, when:

which we can expand to λ²-(a + d)λ + (ad-bc) = 0.

Does a degree 2 polynomial have zeros? Sometimes no, sometimes one, sometimes two. Suppose there is an eigenvalue for M, call it λ’, how can we find the associated eigenvector?

Well, that amounts to finding vectors v = (v(1), v(2)) in the kernel of M-λ’I, which amounts to solving for v:

Multiplying the matrix times the vector, this reduces to finding solution of the linear system:

Example 20: Show that the matrix:

has no eigenvalues. (Note: this is obviously geometrically)

Example 21: Show that the matrix

has one eigenvalue. Can you find an associated eigenvector?

Example 22: Show that the matrix

has two distinct eigenvalues. Can you find the associated eigenvectors?

It turns out that the fact that the above matrix has a basis of eigenvectors is no accident. In fact, it is a consequence of the following theorem:

Theorem (Real Spectral Theorem): Let M be an n × n real symmetric matrix (that is, t(M) = M, where t(M) denotes the transpose of M). Then the domain of this transformation has a basis of eigenvectors.

This theorem actually comes into play in data science quite often via the so called “variance-covariance” matrix:

Example 23 (Principle Component analysis ♯): Given centered and scaled data X (centered means that the sample mean for each column is zero and scaled means the sample variance for each column is 1), the variance-covariance matrix is given by (1/(m-1))Xt(X). The principle components of the data are the eigenvectors {p(1), p(2), . . . ,p(n)} of this transformation.

These eigenvectors have some cool properties:

  1. There is a basis of them, and they are orthogonal (by the real spectral theorem).
  2. They are uncorrelated (you can verify this with a simple computation!)
  3. The magnitude of the associated eigenvector tells us how much of the total variance in the data is explained by this feature.

As a result of these properties, people will often trade the original feature space for one spanned by the first k (ordered by magnitude of associated eigenvalues) eigenvectors of the covariance matrix (that is, the first k principle components). This is a lower dimensional space that still contains a maximal amount of the variance contained in the data.

Conclusions

Very often in data science — after all the hard work of gathering business requirements, ironing out data access, querying, cleaning, and encoding data numerically — we find ourselves in the privileged position of needing to reason about points in a vector space (or feature space). Classification and regression models (among other staples of the data science world) sit on top of these vector spaces, and their behavior is largely determined by the geometry of this space.

We now rigorously understand properties of the space like dimension as well as the behavior of structure preserving (linear) transformations. We understand how these transformations retain and lose information, and we have seen many applications of these principles.

It’s my hope that this basic framing can greatly simplify your understanding of many techniques from machine learning like SVMs, MDPs, PCA, or whatever string of letters you may want to look into. If there’s a topic that you think should have made it into these notes, let me know below!

--

--