Elephant in the room — why isn’t this preferred over Gradient Descent‽ What an elegant solution this is, you say. Well, here’s the main reason why: computing the seemingly harmless inverse of that (m * n) by (n * m) matrix is, with today’s most efficient Computer Science algorithm, of cubic time complexity! This means that as the dimensions of X increase, the amount of operations required to compute the final result increases in a cubic trend. If X was rather small and especially had a low value for n/wasn’t of high dimensions, then using the Normal Equation would be feasible. But for any industrial application with large datasets, the Normal Equation would take extremely — sometimes nonsensically — long.
Rohan #3: Deriving the Normal Equation using matrix calculus
Rohan Kapur

What a great article. You totally connected the dots regarding what I’ve learned in a machine learning course I took and more recently, a linear algebra class I attended.

I remember my linear algebra professor mentioned that it is more computational efficient to solve the normal equations

X’XΘ = X’y

and then just use back substitution to find the solution instead of calculating the inverse of X’X. Do you know if this would be more efficient than gradient decent?