Stochastic Gradient Descent (SGD)

5 min readMay 4, 2023

Introduction

Stochastic Gradient Descent (SGD) is a popular optimization algorithm used in machine learning to train models that can make predictions on new data. In this article, we will explore the basics of SGD, including how it works, its advantages and disadvantages, and how to implement it in Python.

What is Stochastic Gradient Descent?

Gradient descent is an optimization algorithm used in machine learning to minimize a cost function. In supervised learning, the cost function measures how well the model is performing on the training data. The goal of gradient descent is to find the set of model parameters that minimizes the cost function.

Stochastic gradient descent is a variant of gradient descent that updates the model parameters using only a subset of the training data, called a mini-batch, at each iteration. The mini-batch size is typically small, ranging from 1 to a few hundred, which makes stochastic gradient descent computationally efficient compared to standard gradient descent, which uses the entire training set at each iteration.

How does Stochastic Gradient Descent work?

The basic idea behind stochastic gradient descent is to start with an initial set of model parameters and iteratively update them in the direction of the steepest descent of the cost function. The direction of the steepest descent is given by the negative gradient of the cost function with respect to the model parameters.

At each iteration, we randomly select a mini-batch of training examples and compute the gradient of the cost function with respect to the model parameters using only the examples in the mini-batch. We then update the model parameters using the gradient and a learning rate, which controls the step size of the update.

The learning rate is a hyperparameter that determines how much to adjust the model parameters at each iteration. A high learning rate can cause the model parameters to oscillate around the minimum of the cost function, while a low learning rate can cause the model to converge slowly. The learning rate is typically set using a grid search or a learning rate schedule that adapts the learning rate during training.

Here’s how the SGD algorithm works:

Initialize the model parameters with some random values.
Repeat until convergence:

Shuffle the training examples randomly.
For each example in the training set:
Compute the gradient of the loss function with respect to the model parameters using the current example.
Update the model parameters in the opposite direction of the gradient.

In other words, instead of computing the gradient over the entire training set, we randomly select a single training example and use it to update the parameters. We then repeat this process until the algorithm converges.

The Python code for SGD:

def sgd(X, y, learning_rate=0.01, epochs=100):
    m, n = X.shape
    w = np.zeros(n)
    b = 0
    
    for epoch in range(epochs):
        for i in range(m):
            # Select a random training example
            j = np.random.randint(m)
            xi = X[j]
            yi = y[j]
            
            # Compute the gradient of the loss function with respect to w and b
            dw, db = compute_gradient(xi, yi, w, b)
            
            # Update the parameters
            w -= learning_rate * dw
            b -= learning_rate * db
            
    return w, b

Here, we loop over the number of epochs, and for each epoch, we loop over each training example randomly. We then compute the gradient with respect to the model parameters using the compute_gradient() function, and update the parameters using the learning rate.

Advantages of Stochastic Gradient Descent

Computationally efficient: Stochastic gradient descent is computationally efficient because it only uses a mini-batch of training examples at each iteration, which reduces the memory requirements and allows for parallel processing
Converges faster: Stochastic gradient descent can converge faster than standard gradient descent because it updates the model parameters more frequently, which allows it to escape from local minima more easily.
Good for large datasets: Stochastic gradient descent is well-suited for large datasets because it can update the model parameters using a subset of the data, which makes it more memory efficient than standard gradient descent.

Disadvantages of Stochastic Gradient Descent

Can be noisy: Stochastic gradient descent can be noisy because it updates the model parameters using only a subset of the training data at each iteration, which can cause the cost function to fluctuate.
May converge to a local minimum: Stochastic gradient descent can converge to a local minimum of the cost function, which may not be the global minimum. To overcome this problem, we can use techniques like momentum, which can help the algorithm escape local minima.
Learning rate selection: Selecting an appropriate learning rate can be challenging, as a high learning rate can cause the algorithm to diverge, while a low learning rate can cause the algorithm to converge slowly.

Implementing Stochastic Gradient Descent in Python

def sgd(X, y, learning_rate=0.01, epochs=100):
    m, n = X.shape
    w = np.zeros(n)
    b = 0
    
    for epoch in range(epochs):
        for i in range(m):
            # Select a random training example
            j = np.random.randint(m)
            xi = X[j]
            yi = y[j]
            
            # Compute the gradient of the loss function with respect to w and b
            dw, db = compute_gradient(xi, yi, w, b)
            
            # Update the parameters
            w -= learning_rate * dw
            b -= learning_rate * db
            
    return w, b

Implementing Stochastic Gradient Descent in Python

Let’s now look at an example of how to implement stochastic gradient descent in Python using scikit-learn’s SGDRegressor class. We will use the Boston Housing dataset, which contains information about houses in Boston and their corresponding median values.

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the dataset
california = fetch_california_housing()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(california.data, california.target, test_size=0.2, random_state=42)

# Scale the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Create an instance of the SGDRegressor class
sgd = SGDRegressor(max_iter=1000, eta0=0.01, penalty='l2', random_state=42)

# Train the model using stochastic gradient descent
sgd.fit(X_train, y_train)

# Evaluate the model on the testing data
score = sgd.score(X_test, y_test)

print('Score: %.2f' % score)

In this example, we first load the Boston Housing dataset and split it into training and testing sets using the train_test_split function. We then scale the data using StandardScaler, which standardizes the features by subtracting the mean and dividing by the standard deviation.

Next, we create an instance of the SGDRegressor class and specify the maximum number of iterations (max_iter), the initial learning rate (eta0), the regularization term (penalty), and the random seed (random_state). We then train the model using stochastic gradient descent by calling the fit method on the training data.

Finally, we evaluate the model on the testing data using the score method, which returns the coefficient of determination (R²) of the prediction. In this case, we achieve an R² score of 0.70, which indicates that the model explains 70% of the variance in the testing data.

Conclusion

Stochastic gradient descent is a powerful optimization algorithm that can be used to train machine learning models efficiently on large datasets. It has many advantages, including computational efficiency, faster convergence, and suitability for large datasets. However, it also has some disadvantages, such as noise and the potential to converge to a local minimum. By understanding these advantages and disadvantages, we can use stochastic gradient descent effectively and choose appropriate hyperparameters to optimize its performance.

Stochastic Gradient Descent (SGD)

Written by Rishi