In-Depth Analysis

Multiple Regression in Pure SQL

Scriddie
5 min readJul 24, 2019

For a Data Science project to be successful, it needs to put the biggest emphasis on the essentials: The data itself. For this blog post we will assume your data to be structured, human readable and, of course, stored in a relational database. In case your data isn’t that yet, you may want to consider working on this part first. After all, data is not in the points in a scatter plot, it is not the nodes or edges of a graph, in its most essential form data is rows in a SQL table. Thus, as an experienced SQL veteran, you may have found yourself asking: “Why resort to R, Python or any other programming fad when SQL is already turing complete?”. It is with this question that an awesome journey to the heart of Data Science begins.

Francis Galton and friends, 1886
(left to right:) Donald D. Chamberlin, Raymond F. Boyce, Edgar F. Codd (1886)

To illustrate how SQL, when used correctly, is a suitable tool not only for database management but also for data analysis, I will use multiple linear regression as an example.

Usually, we would start out with some sort of feature matrix. However, in a matrix the ordering contains information already. For SQL, we want no such nonsense as having information in the layout and would rather have a more explicit and intuitive long table like so:

…and some labels in the same format.

I’ve created the labels in the following way:

label = 5 * intercept — feature1 + 2 * feature2

Let’s see if we can get a linear regression in pure SQL to figure out this relationship!

Multiple Linear Regression

The analytical solution to a multiple linear regression can be expressed as follows:

So given the matrices, all we really need in terms of linear algebra is transposition, matrix multiplication, the determinant and the adjugate matrix.

The true master of a craft can make it work anyways.

Sure, you might think, lets just define a some functions and get started But are user defined functions even SQL? They are nothing but a scheme to let a machine do the typing and steal your job. They are an insult to any hard-working and hard-coding (see the similarity?) programmer. With that in mind, let’s get started.

X² and X * Y

First we need to get the initial matrix multiplications done. We need our feature matrix X squared and saved as , and the features and labels multiplied and saved as X * Y.

For the matrix multiplication, we avoid the need for any transposition by simply switching row_num and col_num for the left side. After joining the two matrices in question on row_num, we group by the col_num of each matrix to get an outcome of the right dimension. Each entry in the result then gets the left side’s col_num as row_num, the right side’s col_num and the dot product of the two rows as corresponding value.

Determinant of X²

Now comes a bit of a tricky part. Staying true to our paradigms, we want to write code that could scale well in principle but really does not in practice. Therefore, we will start by finding the determinant using the Leibnitz formula:

For the Leibnitz formula, we need the product of the elements on the main diagonal of each of the 6 permutations possible by switching rows in our 3 x 3 matrix (see dihedral group of order 6). In the subquery, we are forming the row permutations by joining the elements of remaining rows in twice until we have all possible products of elements in different rows. In the where clause, we filter for such products, that consist of elements that form the main diagonal of a permutation. Next, we determine the sign of each determinant component by assigning a dense_rank starting from zero to the severity of permutation of each main diagonal product. Finally, we take the sum of the components with their respective sign to arrive at the determinant.

There are two important software development principles at work here:

  1. The more inscrutable your code, the longer its life cycle
  2. Code only you can understand is code that will keep you employed.

Adjugate and Inverse of X²

Did you think this is where we would give up? Well, much to learn for you there still is, young padawan. Let’s go!

So what’s going on here? Essentially, we are starting off with an instance of X², then we join in all such elements that are possible candidates for the adjugate matrix consisting of sub-determinants as shown below:

Of all possible combinations, we only keep the ones where the sum of row_num and col_num of the elements on the main diagonal is the same as the sum of row_num and col_num of the elements on the opposite diagonal. Lastly, we join in the original matrix once more to figure out the position of the resulting sub-determinants and assign the correct sign based on the position.

Knocking on heaven’s door

There is only one more thing to do: Multiply the inverse of X² with X*Y.

And there we have it, just what we wanted — a beautiful piece of analysis done in pure SQL and presented as three loveless cells in a grid. Good ole SQL!

SQL programmer (1956, colorized)

For the inexperienced it may seem like that’s a lot of work for just a linear regression. But this is where SQL really separates the wheat from the chaff. SQL, from input to output, does not care for your pathetic need to minimizing code duplication, structure your code into modular components or ever see the world as anything else but a grid of data. As a true SQL wizard, you have evolved to think like a machine, you have become the thinking machine other AI engineers are trying to build. This allows you to take full advantage of the godlike speed offered by databases. This is what separates you from all the wannabes still trying to avoid typing out the same operation 300 times per script. This is the beauty of data analysis in SQL.

(The full source code including the example data can be found here. It has been written and tested using MySQL 8.0.16.)

--

--

Scriddie

Trying my best to write code that is smarter than me.