Stochastic Gradient Descent

ajaymehta
7 min readApr 28, 2023

--

SGD stands for Stochastic Gradient Descent. It is an optimization algorithm used to minimize the cost function of a machine learning model. The key difference between SGD and Batch Gradient Descent (BGD) is that instead of using the entire dataset to compute the gradient at each step, SGD randomly selects a single data point or a small subset of data points (also called mini batch) to compute the gradient.

In SGD, the cost function is computed on a single data point, and the parameters of the model are updated accordingly. The process is repeated for all the data points in the dataset or a randomly selected mini-batch. This makes SGD much faster than BGD, especially for large datasets, as it computes the gradients for a subset of data points instead of the entire dataset.

One of the key advantages of SGD is that it can help the model avoid getting stuck in a local minimum, which can happen with BGD. With SGD, the model parameters are updated more frequently, and the gradient calculation involves a random data point, which introduces more randomness into the optimization process, and the optimizer can escape from a suboptimal local minimum.

However, since SGD updates the parameters using only a single data point or a small subset of data points, it can lead to more noisy updates compared to BGD. This can make the convergence slower and lead to oscillations or instability in the optimization process. To mitigate this problem, the learning rate in SGD is often reduced over time to smooth out the updates.

Overall, SGD is a powerful optimization algorithm that is widely used in deep learning, especially for large-scale datasets, due to its computational efficiency and ability to handle noisy data. However, it requires careful tuning of the learning rate and mini-batch size to ensure convergence and stability.

Time Comparison between SGD and BGD

If we compare SGD and BGD with the same number of epochs, and assuming we are using the same hardware and the same learning rate, it is true that SGD (with a mini-batch size of 1) will perform more calculations than BGD for each epoch. This is because in SGD, we need to compute the gradient for each individual data point, while in BGD, we compute the gradient over the entire dataset.

For example, if we have 10,000 data points and we train for 10 epochs, then with SGD, we will perform 100,000 (10,000 x 10) calculations to update the model parameters. In contrast, with BGD, we will perform only 10 calculations (1 calculation per epoch) to update the model parameters.

Based on the previous explanations, if we compare SGD and BGD with the same number of epochs and assume the same hardware and learning rate, then SGD may take more time than BGD to complete the training process. This is because SGD performs more calculations per epoch than BGD due to the fact that it updates the model parameters more frequently using smaller batches of data.

However, it is important to note that the actual running time of the algorithms depends on many factors, such as the size of the dataset, the complexity of the model, the implementation details, and the choice of hyperparameters. Therefore, in some cases, it is possible that SGD can be faster than BGD even with the same number of epochs.

Overall, the choice between SGD and BGD depends on the specific problem and the trade-offs between convergence speed and computational efficiency. SGD is often preferred for large datasets or complex models due to its ability to update the model parameters more frequently and its scalability to parallel and distributed computing environments. BGD, on the other hand, may be preferred for smaller datasets or simpler models where computational efficiency is more important.

from sklearn.datasets import load_diabetes

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
import time
X,y = load_diabetes(return_X_y=True)
print(X.shape)
print(y.shape)
>>> (442, 10)
>>> (442,)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)
reg = LinearRegression()
reg.fit(X_train,y_train)
LinearRegression()
print(reg.coef_)
print(reg.intercept_)
>>> [ -9.16088483 -205.46225988 516.68462383 340.62734108 -895.54360867
561.21453306 153.88478595 126.73431596 861.12139955 52.41982836]
>>> 151.88334520854633
y_pred = reg.predict(X_test)
r2_score(y_test,y_pred)
>>> 0.4399387660024645
class SGDRegressor:

def __init__(self,learning_rate=0.01,epochs=100):

self.coef_ = None
self.intercept_ = None
self.lr = learning_rate
self.epochs = epochs

def fit(self,X_train,y_train):
# init your coefs
self.intercept_ = 0
self.coef_ = np.ones(X_train.shape[1])

for i in range(self.epochs):
for j in range(X_train.shape[0]):
idx = np.random.randint(0,X_train.shape[0])

y_hat = np.dot(X_train[idx],self.coef_) + self.intercept_

intercept_der = -2 * (y_train[idx] - y_hat)
self.intercept_ = self.intercept_ - (self.lr * intercept_der)

coef_der = -2 * np.dot((y_train[idx] - y_hat),X_train[idx])
self.coef_ = self.coef_ - (self.lr * coef_der)

print(self.intercept_,self.coef_)

def predict(self,X_test):
return np.dot(X_test,self.coef_) + self.intercept_
sgd = SGDRegressor(learning_rate=0.01,epochs=40)
start = time.time()
sgd.fit(X_train,y_train)

print("The time taken is",time.time() - start)
>>> 157.2641767094034 [ 58.34618831 -41.65635944 314.02875897 219.47021721 30.2641263
-8.98591625 -165.98040412 132.42543708 294.42536821 122.71528461]
>>> The time taken is 0.17459607124328613
y_pred = sgd.predict(X_test)
r2_score(y_test,y_pred)
>>> 0.4205755931585481
from sklearn.linear_model import SGDRegressor
reg = SGDRegressor(max_iter=100,learning_rate='constant',eta0=0.01)
reg.fit(X_train,y_train)
C:\Users\91842\anaconda3\lib\site-packages\sklearn\linear_model\_stochastic_gradient.py:1220: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
warnings.warn("Maximum number of iteration reached before "
SGDRegressor(learning_rate='constant', max_iter=100)
y_pred = reg.predict(X_test)
r2_score(y_test,y_pred)
>>> 0.43059547063734904

Showing Convergence

Stochastic Gradient Descent VS Batch Gradient Descent

In (SGD) you can see some kind of randomness behavior.

Stochastic Gradient Descent
Batch Gradient Descent
Stochastic Gradient Descent

In the case of SGD, the cost function fluctuates more randomly over the iterations, as the algorithm updates the model parameters based on a single example at a time. This can be visualized as a more erratic convergence of the line towards the optimal values of the model parameters. However, the fluctuations can eventually average out over multiple iterations, leading to convergence towards the minimum of the cost function.

Stochastic Gradient Descent

In the case of BGD, the cost function decreases smoothly and gradually over the iterations, as the algorithm takes steps in the direction of the steepest gradient of the cost function. This can be visualized as a smooth and gradual convergence of the line towards the optimal values of the model parameters.

Batch Gradient Descent

Stochastic Gradient Descent (SGD) is a popular optimization algorithm used in machine learning for training models. SGD is particularly useful when dealing with large datasets or complex models, where computing the gradient over the entire dataset may be impractical or inefficient.

Here are some situations where SGD may be a good choice:

  1. Large datasets: When working with large datasets, computing the gradient over the entire dataset can be very computationally expensive. In such cases, SGD can be used to update the model parameters using a mini batch of data points, which can significantly reduce the computation time.
  2. Non-convex optimization problems: In non-convex optimization problems, there may be many local minima, making it challenging for optimization algorithms to find the global minimum. Using SGD can help the optimizer escape from local optima and converge faster to the global minimum.
convex and Non convex functions (3d)
  1. Online learning: In online learning, the data is available in a stream, and the model needs to be updated in real-time. In such cases, SGD can be used to update the model parameters using one data point at a time, making it ideal for online learning scenarios.
  2. Regularization: SGD can also be used to implement regularization techniques such as L1 and L2 regularization, which can help prevent overfitting and improve the generalization performance of the model.

learning rate scheduler

Learning rate schedule: Using a fixed learning rate may not be optimal since the optimal learning rate can change during training. Solution: Use a learning rate schedule that gradually decreases the learning rate over time, such as reducing the learning rate by a factor of 2 after a fixed number of epochs.

here’s a more detailed comparison of the convergence of Batch Gradient Descent (BGD), Stochastic Gradient Descent (SGD), and Mini-Batch Gradient Descent (MBGD):

BGD:

  • Computes the gradient of the cost function over the entire training set.
  • Updates the model parameters once per iteration.
  • Converges more slowly but steadily towards the minimum of the cost function.
  • Can be more stable than SGD and MBGD, but may take longer to converge.
  • Requires more memory and computational resources than SGD and MBGD.

SGD:

  • Computes the gradient of the cost function over a single training example.
  • Updates the model parameters once per example.
  • Converges faster but with more oscillations towards the minimum of the cost function.
  • Can be more sensitive to the choice of learning rate and the randomness of the data.
  • Requires less memory and computational resources than BGD.

MBGD:

  • Computes the gradient of the cost function over small random subsets of the training data (mini batches).
  • Updates the model parameters once per mini-batch.
  • Offers a compromise between the speed of SGD and the stability of BGD.
  • Converges faster than BGD and with less oscillations than SGD.
  • Requires less memory and computational resources than BGD, but more than SGD.

--

--

ajaymehta

Meet Ajay a blogger and AI/DS expert. Sharing insights on cutting-edge tech, machine learning, data analysis, and their real-world applications.