Mini batch Gradient Descent

4 min readApr 28, 2023

Mini-batch Gradient Descent is a variant of Gradient Descent, which is an iterative optimization algorithm used to find the minimum of a cost function in machine learning problems. Unlike Batch Gradient Descent that processes the entire training set at once, mini-batch Gradient Descent processes small random subsets of the training data (mini-batches) at each iteration.

The main difference between mini-batch Gradient Descent and Batch Gradient Descent is that, instead of computing the gradient of the cost function over the entire training set, mini-batch Gradient Descent computes the gradient over a small random subset of the training set. This reduces the computational cost and speeds up the convergence of the algorithm.

Here’s an example to illustrate the difference between Batch Gradient Descent and mini-batch Gradient Descent:

Suppose we have a training set with 1000 examples, and we want to train a model using Gradient Descent. If we use Batch Gradient Descent, we will process all 1000 examples at once to compute the gradient of the cost function. This can be computationally expensive, especially if we have a large number of features.

If we use mini-batch Gradient Descent instead, we will divide the training set into small random subsets (e.g., 32, 64, or 128 examples) and process each subset (i.e., mini-batch) at each iteration. For example, if we use a mini-batch size of 32, we will randomly select 32 examples from the training set and compute the gradient of the cost function using these examples.

The advantage of mini-batch Gradient Descent is that it can take advantage of vectorization (i.e., performing computations on multiple examples simultaneously), which can speed up the computation of the gradient. Additionally, mini-batch Gradient Descent can converge faster than Batch Gradient Descent because it takes smaller steps towards the minimum of the cost function, which can help the algorithm escape from local minima.

However, mini-batch Gradient Descent can be more sensitive to the learning rate and the choice of mini-batch size, which can affect the stability and convergence of the algorithm. A learning rate that is too high can cause the algorithm to diverge, while a learning rate that is too low can slow down the convergence of the algorithm. Similarly, a mini-batch size that is too small can introduce noise into the gradient, while a mini-batch size that is too large can increase the computational cost and reduce the effectiveness of the algorithm.

from sklearn.datasets import load_diabetes

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
X,y = load_diabetes(return_X_y=True)
print(X.shape)
print(y.shape)
(442, 10)
(442,)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)
reg = LinearRegression()
reg.fit(X_train,y_train)
LinearRegression()
print(reg.coef_)
print(reg.intercept_)
[  -9.16088483 -205.46225988  516.68462383  340.62734108 -895.54360867
  561.21453306  153.88478595  126.73431596  861.12139955   52.41982836]
151.88334520854633
y_pred = reg.predict(X_test)
r2_score(y_test,y_pred)
0.4399387660024645
import random

class MBGDRegressor:
    
    def __init__(self,batch_size,learning_rate=0.01,epochs=100):
        
        self.coef_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.epochs = epochs
        self.batch_size = batch_size
        
    def fit(self,X_train,y_train):
        # init your coefs
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])
        
        for i in range(self.epochs):
            
            for j in range(int(X_train.shape[0]/self.batch_size)):
                
                idx = random.sample(range(X_train.shape[0]),self.batch_size)
                
                y_hat = np.dot(X_train[idx],self.coef_) + self.intercept_
                #print("Shape of y_hat",y_hat.shape)
                intercept_der = -2 * np.mean(y_train[idx] - y_hat)
                self.intercept_ = self.intercept_ - (self.lr * intercept_der)

                coef_der = -2 * np.dot((y_train[idx] - y_hat),X_train[idx])
                self.coef_ = self.coef_ - (self.lr * coef_der)
        
        print(self.intercept_,self.coef_)
    
    def predict(self,X_test):
        return np.dot(X_test,self.coef_) + self.intercept_
mbr = MBGDRegressor(batch_size=int(X_train.shape[0]/50),learning_rate=0.01,epochs=100)
mbr.fit(X_train,y_train)
154.83742294704174 [  38.40848324 -142.67633481  457.28746444  303.60926403  -17.99961807
  -85.81943788 -192.04933133  116.18528414  407.24722272  105.8082595 ]
y_pred = mbr.predict(X_test)
r2_score(y_test,y_pred)
0.4518861449568681

from sklearn.linear_model import SGDRegressor
sgd = SGDRegressor(learning_rate='constant',eta0=0.1)
batch_size = 35

for i in range(100):
    
    idx = random.sample(range(X_train.shape[0]),batch_size)
    sgd.partial_fit(X_train[idx],y_train[idx])
sgd.coef_
array([  49.19545821,  -67.84533938,  338.57819421,  247.97315609,
         25.30849249,  -24.71685159, -155.45845777,  116.19331239,
        312.91250811,  133.36595993])
sgd.intercept_
array([148.61489911])
y_pred = sgd.predict(X_test)
r2_score(y_test,y_pred)
0.4271789125617129