Mini batch Gradient Descent

ajaymehta
4 min readApr 28, 2023

--

Mini-batch Gradient Descent is a variant of Gradient Descent, which is an iterative optimization algorithm used to find the minimum of a cost function in machine learning problems. Unlike Batch Gradient Descent that processes the entire training set at once, mini-batch Gradient Descent processes small random subsets of the training data (mini-batches) at each iteration.

The main difference between mini-batch Gradient Descent and Batch Gradient Descent is that, instead of computing the gradient of the cost function over the entire training set, mini-batch Gradient Descent computes the gradient over a small random subset of the training set. This reduces the computational cost and speeds up the convergence of the algorithm.

Here’s an example to illustrate the difference between Batch Gradient Descent and mini-batch Gradient Descent:

Suppose we have a training set with 1000 examples, and we want to train a model using Gradient Descent. If we use Batch Gradient Descent, we will process all 1000 examples at once to compute the gradient of the cost function. This can be computationally expensive, especially if we have a large number of features.

If we use mini-batch Gradient Descent instead, we will divide the training set into small random subsets (e.g., 32, 64, or 128 examples) and process each subset (i.e., mini-batch) at each iteration. For example, if we use a mini-batch size of 32, we will randomly select 32 examples from the training set and compute the gradient of the cost function using these examples.

The advantage of mini-batch Gradient Descent is that it can take advantage of vectorization (i.e., performing computations on multiple examples simultaneously), which can speed up the computation of the gradient. Additionally, mini-batch Gradient Descent can converge faster than Batch Gradient Descent because it takes smaller steps towards the minimum of the cost function, which can help the algorithm escape from local minima.

However, mini-batch Gradient Descent can be more sensitive to the learning rate and the choice of mini-batch size, which can affect the stability and convergence of the algorithm. A learning rate that is too high can cause the algorithm to diverge, while a learning rate that is too low can slow down the convergence of the algorithm. Similarly, a mini-batch size that is too small can introduce noise into the gradient, while a mini-batch size that is too large can increase the computational cost and reduce the effectiveness of the algorithm.

from sklearn.datasets import load_diabetes

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
X,y = load_diabetes(return_X_y=True)
print(X.shape)
print(y.shape)
(442, 10)
(442,)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)
reg = LinearRegression()
reg.fit(X_train,y_train)
LinearRegression()
print(reg.coef_)
print(reg.intercept_)
[ -9.16088483 -205.46225988 516.68462383 340.62734108 -895.54360867
561.21453306 153.88478595 126.73431596 861.12139955 52.41982836]
151.88334520854633
y_pred = reg.predict(X_test)
r2_score(y_test,y_pred)
0.4399387660024645
import random

class MBGDRegressor:

def __init__(self,batch_size,learning_rate=0.01,epochs=100):

self.coef_ = None
self.intercept_ = None
self.lr = learning_rate
self.epochs = epochs
self.batch_size = batch_size

def fit(self,X_train,y_train):
# init your coefs
self.intercept_ = 0
self.coef_ = np.ones(X_train.shape[1])

for i in range(self.epochs):

for j in range(int(X_train.shape[0]/self.batch_size)):

idx = random.sample(range(X_train.shape[0]),self.batch_size)

y_hat = np.dot(X_train[idx],self.coef_) + self.intercept_
#print("Shape of y_hat",y_hat.shape)
intercept_der = -2 * np.mean(y_train[idx] - y_hat)
self.intercept_ = self.intercept_ - (self.lr * intercept_der)

coef_der = -2 * np.dot((y_train[idx] - y_hat),X_train[idx])
self.coef_ = self.coef_ - (self.lr * coef_der)

print(self.intercept_,self.coef_)

def predict(self,X_test):
return np.dot(X_test,self.coef_) + self.intercept_
mbr = MBGDRegressor(batch_size=int(X_train.shape[0]/50),learning_rate=0.01,epochs=100)
mbr.fit(X_train,y_train)
154.83742294704174 [ 38.40848324 -142.67633481 457.28746444 303.60926403 -17.99961807
-85.81943788 -192.04933133 116.18528414 407.24722272 105.8082595 ]
y_pred = mbr.predict(X_test)
r2_score(y_test,y_pred)
0.4518861449568681

from sklearn.linear_model import SGDRegressor
sgd = SGDRegressor(learning_rate='constant',eta0=0.1)
batch_size = 35

for i in range(100):

idx = random.sample(range(X_train.shape[0]),batch_size)
sgd.partial_fit(X_train[idx],y_train[idx])
sgd.coef_
array([ 49.19545821, -67.84533938, 338.57819421, 247.97315609,
25.30849249, -24.71685159, -155.45845777, 116.19331239,
312.91250811, 133.36595993])
sgd.intercept_
array([148.61489911])
y_pred = sgd.predict(X_test)
r2_score(y_test,y_pred)
0.4271789125617129

Mini-Batch Gradient Descent (MBGD) is a compromise between the two, offering a faster convergence than BGD and less oscillations than SGD.

Mini batch Gradient Descent (FAST)

Stochastic Gradient Descent (SGD) converges faster but with more oscillations.

Stochastic Gradient Descent (fast but randomness)

Batch Gradient Descent (BGD) converges more slowly but steadily towards the minimum of the cost function.

Batch Gradient Descent(slow)

here’s a more detailed comparison of the convergence of Batch Gradient Descent (BGD), Stochastic Gradient Descent (SGD), and Mini-Batch Gradient Descent (MBGD):

BGD:

  • Computes the gradient of the cost function over the entire training set.
  • Updates the model parameters once per iteration.
  • Converges more slowly but steadily towards the minimum of the cost function.
  • Can be more stable than SGD and MBGD, but may take longer to converge.
  • Requires more memory and computational resources than SGD and MBGD.

SGD:

  • Computes the gradient of the cost function over a single training example.
  • Updates the model parameters once per example.
  • Converges faster but with more oscillations towards the minimum of the cost function.
  • Can be more sensitive to the choice of learning rate and the randomness of the data.
  • Requires less memory and computational resources than BGD.

MBGD:

  • Computes the gradient of the cost function over small random subsets of the training data (mini batches).
  • Updates the model parameters once per mini-batch.
  • Offers a compromise between the speed of SGD and the stability of BGD.
  • Converges faster than BGD and with less oscillations than SGD.
  • Requires less memory and computational resources than BGD, but more than SGD.

--

--

ajaymehta

Meet Ajay a blogger and AI/DS expert. Sharing insights on cutting-edge tech, machine learning, data analysis, and their real-world applications.