Deep learning basics — weight decay

Sophia Yang, Ph.D.
Analytics Vidhya
Published in
2 min readSep 4, 2020


What is weight decay?

Weight decay is a regularization technique by adding a small penalty, usually the L2 norm of the weights (all the weights of the model), to the loss function.

loss = loss + weight decay parameter * L2 norm of the weights

Some people prefer to only apply weight decay to the weights and not the bias. PyTorch applies weight decay to both weights and bias.

Why do we use weight decay?

  • To prevent overfitting.
  • To keep the weights small and avoid exploding gradient. Because the L2 norm of the weights are added to the loss, each iteration of your network will try to optimize/minimize the model weights in addition to the loss. This will help keep the weights as small as possible, preventing the weights to grow out of control, and thus avoid exploding gradient.

How do we use weight decay?

To use weight decay, we can simply define the weight decay parameter in the torch.optim.SGD optimizer or the torch.optim.Adam optimizer. Here we use 1e-4 as a default for weight_decay.

optimizer = torch.optim.SGD(model.parameters(), lr=1e-3, weight_decay=1e-4)optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)

