Simple Linear Regression — OLS vs Mini-batch Gradient Descent (Python)

Published in

Python experiments

6 min readFeb 28, 2019

Motivation

It started with the discussion on the linear regression between a traditional statistician (me) and a computer scientist/mathematician. The question is simple but interesting….

Two approaches to fit a linear regression model are:

Ordinary Least Squares (OLS) (an analytical method)
Mini-batch gradient descent (GD) (a numerical method)

Will these two approaches return the same fitted line given the same data and model, almost the same loss/error function?

The mathematical/theoretical differences are described in this blog: Linear Regression Simplified — Ordinary Least Square vs Gradient Descent. There are three common ways to batch the data for GD:

All data in one batch (Batch Gradient Descent)
One observation per batch (Stochastic Gradient Descent)
Subset observations per batch (Mini-batch Gradient Descent)

The blog focus on the mini-batch GD which lies between batch GD and stochastic GD. I will try to show the differences in practice.

Generate data (small data size)

An artificial data set of 10 points is generated with an intercept of 6, a slope of 1, and residuals sampled from the standard normal distribution.

random.seed(999)
x = np.random.normal(size=10) 
y = x+ 6 +np.random.normal(size=10)

The data looks like below:

For simplicity, I will use the simple linear regression (uni-variate linear regression) with intercept term.

When batching the data, I use a batch size of 2, which always determines a line without residual within each batch. This makes things easier in later investigations.

OLS approach

The OLS estimate the slope 𝛽 and intercept 𝛼 of the straight line by minimizing the sum of squared residuals (SSE).

From the SSE, we can derive the estimates of 𝛽 and 𝛼 as below:

This uses all the data in one go and one iteration. This can be implemented by the Python module sklearn.linear_model.LinearRegression().

The estimate of intercept 𝛼̂ is 6.089, and the estimate of slope 𝛽̂ is 0.767 using OLS.

lm = linear_model.LinearRegression()
lm.fit(x.reshape(-1,1), y)

Gradient descent approach

Similar to OLS, Gradient descent approach aims to minimize a cost function by iterations. The cost function for the simple linear regression is equivalent to the average of squared residuals.

where m is the batch size. The partial derivatives of the cost function represent the change of the corresponding parameters. In the case of simple linear regression, the partial derivatives of 𝛼 and 𝛽 are as below:

After each batch, the estimates of 𝛼 and 𝛽 are updated as below:

where 𝛾 is the learning rate. When the learning rate is zero, GD will learn nothing. While the learning rate has no upper limit, the larger it is, the more GD learns from the latest batch.

Define functions

Here is the function I use to batch the data.

def batch_data(num, data, labels, shuffle=True):
    idx = np.arange(0 , len(data))
    if shuffle:
        np.random.shuffle(idx)
    ind = np.array(range(1,len(data)+1))
    b_ends = ind[ind%num==0]    x_batched = []
    y_batched = []
    for i in b_ends:
        data_batch = data[idx[(i-num):i]] 
        labels_batch = labels[idx[(i-num):i] ]
        x_batched.append(data_batch)
        y_batched.append(labels_batch)
    return x_batched, y_batched

GD model is defined as follows:

X = tf.placeholder("float") 
Y = tf.placeholder("float")W = tf.Variable(np.random.randn(), name = "W") 
b = tf.Variable(np.random.randn(), name = "b")y_pred  = tf.add(tf.multiply(X, W), b)
# Mean Squared Error Cost Function
cost = tf.reduce_sum(tf.pow(y_pred-Y, 2)) / (2 * tf.to_float(tf.size(y_pred)))# Global Variables Initializer 
init = tf.global_variables_initializer()#####################run the model#################################def run_GD(lr_choices, n_batches, n_epochs):    lr_ls = []
    epoch_ls = []
    batch_ls = []
    cost_ls = []
    beta_ls = []
    alpha_ls = []
    
    for lr in lr_choices:
        # Gradient Descent Optimizer 
        optimizer =                  tf.train.GradientDescentOptimizer(lr).minimize(cost)with tf.Session() as sess:# Initializing the Variables 
            sess.run(init)            for epoch in range(n_epochs):               x_b, y_b = batch_data(int(len(x)/n_batches),x,y, shuffle=False)# Iterating through all the epochs 
                  for btch in range(n_batches): 
                     lr_ls.append(lr)
                     epoch_ls.append(epoch)
                     batch_ls.append(btch)# Feeding each data point into the optimizer using Feed Dictionary 
                     sess.run(optimizer, feed_dict = {X : x_b[btch], Y : y_b[btch]})                     cost_ls.append(sess.run(cost, feed_dict = {X : x_b[btch], Y : y_b[btch]}) )
                     beta_ls.append(sess.run(W) )
                     alpha_ls.append(sess.run(b))
    return pd.DataFrame({'learning_rate': lr_ls,
                       'epoch': epoch_ls,
                       'batch': batch_ls,
                       'cost': cost_ls,
                       'beta': beta_ls,
                       'intercept': alpha_ls})

Batch GD

Applying the batch GD converges to the same estimates as OLS if the learning rate falls within a reasonable range and the GD runs through enough epochs.

This is not surprising; each epoch the GD learns the whole information, thus eventually pushing the estimates from the initialized values to the OLS ones.

Mini-batch GD

The mini-batch GD goes through the batches in the same order 50 times (50 epochs, no shuffling) using different learning rates. The 10 data points are split into 5 batches with the order kept fixed, so that it is easier to track the convergence behavior.

It’s obvious that a learning rate of 1 is over-shooting, not converging.

When the learning rate is right, the results of mini-batch GD are very close to OLS estimates.

Because the batches are not shuffled, the final estimates at every epoch are shifted by the last batch; this repeated pattern is clearly shown in the right plot when learning rate is 0.9.

Look at the final estimates of 50th epoch, different learning rates reached different results and differ from the OLS estimates.

The following plots compare the estimates against OLS estimates at epoch 0, 1, 2, 3, 4, and 10, excluding learning rate of 1 (diverged). The learning rate of 0.1 seems a conservative choice, but does mini-batch GD converge at this learning rate?

Although learning rate of 0.1 gave fairly good estimates using 50 epochs, but the estimate of slope didn’t converge.

mini-batch GD vs OLS per batch

To better understand the mini-batch GD process, I did the following experiment:

Fit a line per batch using OLS
Fit the GD with 50 epochs (shuffling batches, learning rate 0.4)

Compare the slope and intercept estimates of 5 batches to those from OLS for each epoch.

The graph below plots the estimates of intercept against slope of 5 batches by OLS fittings.

The initialized values of slope and intercept start outside of the OLS estimates, and quickly move inside and wander around the average of OLS estimates. Smaller learning rate narrows the wandering area. This makes sense because each batch carries partial information.

Because of the small size of the data, mini-batch GD doesn’t reach the convergence.

Summary

This blog is not to show one approach is better than the other. I hope it could help understanding the differences between these two methods in a practical way.

OLS is easy and fast if the data is not big.

Mini-batch GD is beneficial when the data is big and memory is limited. Using this approach, the learning rate choice and batch size are important. We should always check the convergence, even the final estimates look good.