BigQuery ML gets faster by computing a closed-form solution (sometimes)
How I got a 50% speedup with no loss in accuracy
--
In a previous article, I showed you how to train a machine learning model using just SQL in BigQuery:
CREATE OR REPLACE MODEL flights.arrdelay
OPTIONS
(model_type='linear_reg', input_label_cols=['arr_delay']) AS
SELECT
arr_delay,
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance
FROM
`cloud-training-demos.flights.tzcorr`
WHERE
arr_delay IS NOT NULL
This took me about 6 minutes. But when I run it now on BigQuery, it takes me only 3 minutes and completes in just one iteration:
What’s up? Why did BigQuery ML suddenly get faster? What is happening? Will you always get this speed-up? Is the resulting model accurate?
Answer: (a) The main reason it got faster is because it is using a faster algorithm (i.e., not because of improvements in computational infrastructure). (b) The faster algorithm involves finding a closed-form solution by computing a “pseudo-inverse”. (c) The pseudo-inverse will only work for a subset of machine learning problems. (d) The model is just as good as what you’d get over multiple iterations.
In this article, I will explain these answers further. Feel free to skim the next section (on the math behind the pseudo-inverse) as the details are not that important.
Pseudo-inverse (the math)
When you do linear regression, you are essentially solving the matrix equation XW = Y for the best values of W given the various features (X) and their corresponding label column (Y).
If X consists of N rows and m features, then X is a N×m matrix, W is a m×1 matrix and Y is a N×1 matrix.
If we have exactly m rows and m features, and if the rows are all linearly independent, then this is a system of linear equations and you can compute the correct answer for W by computing the matrix inverse of X and multiplying both sides by it:
X⁻¹X W = X⁻¹ Y
The product of a matrix and its inverse is an identity matrix, so you get a closed-form solution for W:
W =X⁻¹ Y
But normally you will have much more data and so X will not be a square matrix and you can not compute the inverse of X directly. Instead, you can make X square by multiplying it with the transpose of itself:
(Xᵀ X) W = Xᵀ Y
Now, if we invert the squared matrix, we can directly solve for the weights:
W = (Xᵀ X)⁻¹ Xᵀ Y
This thing: (Xᵀ X)⁻¹ Xᵀ is called the Moore-Penrose pseudo-inverse of X, and this is what BigQuery is computing to find the weights directly.
Will this lead to an accurate solution?
Linear regression models belong to the class of “convex optimization” problems. This means that there is only a global minimum — there are no local minima. The normal iterative approach, called mini-batch gradient descent, that BigQuery uses converges towards this global minimum, and the pseudo-inverse directly solves for it. So, both methods will yield equivalent models (not the same model because the solution is not unique if some of your features are linearly correlated). There is no tradeoff in terms of accuracy provided you let the iterative approach run until convergence.
What is the speed up possible?
If the expression (Xᵀ X)⁻¹ Xᵀ can be computed quickly and efficiently, then it is better to directly solve for the solution rather than to use an iterative, gradient-descent approach. And this is what BigQuery now does.
In the flights problem, num_rows is 6 million and num_features is 660 and we got a 50% speed up. If you have larger datasets, you should see more of a speed up. This is because computing Xᵀ X can be done in an embarrassingly parallel way, and BigQuery excels at parallelizing this sort of computation on very large datasets. So, the larger the number of rows, the faster the speed-up versus the gradient-descent approach. Also, the fewer the number of features, the smaller the matrix to invert, and the faster the speedup.
In general, the greater the num_rows/num_features ratio is, the faster the pseudo-inverse will be.
When can BigQuery compute the pseudo-inverse?
The pseudo-inverse helps you find the best weights only if the error you are minimizing is the mean squared error. This is the case for linear regression but only if you are not modifying the error function, e.g. by using L1 regularization. It is not the case for multi-class logistic regression (where the error function is cross-entropy). There is a way available to transform a binary classification problem to make it a linear regression problem, but at present, BigQuery does not implement this.
Secondly, consider the size of the matrix that we are inverting. X is of size N×m where m is the cardinality or number of features after categorical variables are one-hot encoded. So, Xᵀ is of size m×N and Xᵀ X is of size m×m. This means that when you compute the pseudo-inverse, you need to invert a matrix of size m×m. This is an in-memory operation and has to happen on a single BigQuery worker. So, BigQuery will choose to do it only if computing the inverse of an m×m matrix on a single worker is faster than the normal distributed iterative approach. In the flights problem, the cardinality consists of three categorical features that contribute 14 + 322 + 322 (carrier, origin, dest) and 3 numeric features. So, the total cardinality is 661, and inverting a 661x661 matrix is fast enough on a single worker. Currently, the threshold on number of features for flipping from the pseudo-inverse to an iterative approach is in the high thousands, though this will change as hardware specs change.
Third, consider the caveat I started the matrix inverse with — the rows have to be linearly independent in order that the matrix is well-behaved enough for you to invert it. So, BigQuery will compute the pseudo-inverse only if your dataset is large enough. If you have an order of magnitude more rows than you have features, then you should fall into pseudo-inverse territory.
What if I don’t want to?
You can force BigQuery to not use the pseudo-inverse. Force it to use iterative mini-batch gradient descent by explicitly specifying the optimize strategy:
CREATE OR REPLACE MODEL flights.arrdelay
OPTIONS
(model_type='linear_reg', input_label_cols=['arr_delay'], optimize_strategy='batch_gradient_descent') AS
SELECT
arr_delay,
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance
FROM
`cloud-training-demos.flights.tzcorr`
WHERE
arr_delay IS NOT NULL
When I did that, the model took 6.5 minutes and converged in 7 iterations.
It’s better to let BigQuery use the closed form solution when it can, though. Why would you give up a 50% reduction in training time for no loss in accuracy? Pretty much the only reason I can think of is when you have a horrible dataset where many of the rows are identical — this is because the inversion of such ill-posed matrices are still a research problem, but you did clean up your dataset, didn’t you?¹
Summary
BigQuery ML will now compute a closed-form solution using a pseudo-inverse. This is faster than the normal iterative approach, but it will be chosen only if:
- You are doing linear regression without L1 regularization
- You have fewer than a few thousand features (taking into account the cardinality of categorical features)
- You have an order of magnitude more rows than features
The resulting model has the same accuracy as what you’d get had BigQuery used its normal, iterative mini-batch gradient descent approach.
- The actual speed up depends on the number of rows and number of features. The more rows you have, or the fewer features you have, the more of a speed up you can expect.
- You can turn off this feature by explicitly specifying optimize_strategy=’batch_gradient_descent’. Don’t , though.
¹If you are wondering whether this will be a problem, try doing a COUNT(DISTINCT) on your data and see how many duplicate rows you have.
Thanks to Mingge Deng, Amir Hormati, and Felipe Hoffa for helpful feedback and pointers.