You have build a deep network (DN) but the predictions are garbage. How are you going to troubleshoot the problem? In this article, we describe some of the most common problems in a deep network implementation. But if you have not read the Part 4: Visualize Deep Network models and metrics, please read it first. We need to know what to look before fixing anything.
The 6-part series for “How to start a Deep Learning project?” consists of:
· Part 1: Start a Deep Learning project.
· Part 2: Build a Deep Learning dataset.
· Part 3: Deep Learning designs.
· Part 4: Visualize Deep Network models and metrics.
· Part 5: Debug a Deep Learning Network.
· Part 6: Improve Deep Learning Models performance & network tuning.
Troubleshoot steps for Deep Learning
In the early development, we are fighting multiple battles at the same time. As mentioned before, Deep Learning (DL) training composes of million iterations to build a model. Locate bugs are hard and it breaks easily. Start with something simple and make changes incrementally. Model optimizations like regularization can always wait after the code is debugged. Focus on verifying the model is functioning first.
- Set the regularization factors to zero.
- No other regularization (including dropouts).
- Use the Adam optimizer with default settings.
- Use ReLU.
- No data augmentation.
- Fewer DN layers.
- Scale your input data but no un-necessary pre-processing.
- Don’t waste time in long training iterations or large batch size.
Overfitting the model with a small amount of training data is the best way to debug deep learning. If the loss does not drop within a few thousand iterations, debug the code further. Achieve your first milestone by beating the odds of guessing. Then make incremental modifications to the model: add more layers and customization. Train it with the full training dataset. Add regularizations to control the overfit by monitor the accuracy gap between the training and validation dataset.
If stuck, take out all bells and whistles and solve a smaller problem.
Many hyperparameters are more relevant to the model optimization. Turn them off or use default values. Use Adam optimizer. It is fast, efficient and the default learning rate does well. Early problems are mostly from bugs rather from the model design or tuning problems. Go through the checklist in the next section before any tunings. It is more common and easier to verify. If loss still does not drop after verifying the checklist, tune the learning rate. If the loss drops too slow, increase the learning rate by 10. If the loss goes up or the gradient explodes, decrease the learning rate by 10. Repeat the process until the loss drops gradually and nicely. Typical learning rates are between 1 and 1e-7.
- Visualize and verify the input data (after data pre-processing and before feeding to the model).
- Verify the accuracy of the input labels (after data shuffle if applicable).
- Do not feed the same batch of data over and over.
- Scale your input properly (likely between -1 and 1 and zero centered).
- Verify the range of your output (e.g. between -1 and 1).
- Always use the mean/variance from the training dataset to rescale the validation/testing dataset.
- All input data to the model has the same dimensions.
- Access the overall quality of the dataset. (Are there too many outliners or bad samples?)
- The model parameters are initialized correctly. The weights are not set to all 0.
- Debug layers that the activations or gradients diminish/explode. (from rightmost to leftmost layers)
- Debug layers that weights are mostly zero or too large.
- Verify and test your loss function.
- For pre-trained model, your input data range matches the range used in the model.
- Dropout in inference and testing should be always off.
Initialize the weights to all zeros is one of the most common mistakes and the DN will never learn anything. Weights should be initialized with a Gaussian distribution:
Scaling & normalization
Scaling and normalization are well-understood but remain one of the most overlook problems. If input features and nodes output are normalized, the model will be much easier to train. If it is not done correctly, the loss will not drop regardless of the learning rate. We should monitor the histogram for the input features and the nodes’ outputs for each layer (before the activation functions). Always scale input properly. For the nodes’ outputs, the perfect shape is zero-centered with values not too large(positively or negatively). If not and we encounter gradient problems in that layer, apply batch normalization for convolution layers and layer normalization for RNN cells.
Verify and test the correctness of your loss function. The loss of your model must be lower than the one from the random guessing. For example, in a classification problem with 10 classes, the cross entropy loss for random guessing is -ln(1/10).
Review what is doing badly (errors) and improve it. Visualize your errors. In our project, the model performs badly for images with highly entangled structure. Identify the model weakness to make changes. For example, add more convolution layers with smaller filters to disentangle small features. Augment data if necessary, or collect more similar samples to train the model better. In some situations, you may want to remove those samples and constrain yourself to a more focus model.
Turn off regularization (overfit the model) until it makes reasonable predictions.
Once the model code is working, the next tuning parameters are the regularization factors. We increase the volume of our training data and then increase the regularizations to narrow the gap between the training and the validation accuracy. Do not overdo it as we want a slightly overfit model to work with. Monitor both data and regularization cost closely. Regularization loss should not dominate the data loss over prolonged periods. If the gap does not narrow with very large regularizations, debug the regularization code or method first.
Similar to the learning rate, we change testing values in the logarithmic scale. (for example, change by a factor of 10 at the beginning) Beware that each regularization factor can be in a totally different order of magnitude, and we may tune those parameters back and forth.
Multiple cost functions
For the first implementations, avoid using multiple data cost functions. The weight for each cost function may be in different order of magnitude and will require some efforts to tune it. If we have only one cost function, it can be absorbed into the learning rate.
When we use pre-trained models, we may freeze those model parameters in certain layers to speed up computation. Double check no variables are frozen in-correctly.
As less often talked, we should unit test core modules so the implementation is less vulnerable to code changes. Verify the output of a layer may not be easy if its parameters are initialized with a randomizer. Otherwise, we can mock the input data and verify the outputs. For each module (layers), We can verify
- the shape of the output in both training and inference.
- the number of trainable variables (not the number of parameters).
Always keep track of the shape of the Tensor (matrix) and document it inside the code. For a Tensor with shape [N, channel, W, H ], if W (width) and H (height) are swapped, the code will not generate any error if both have the same dimension. Therefore, we should unit test our code with a non-symmetrical shape. For example, we unit test the code with a [4, 3] Tensor instead of a [4, 4] Tensor.
If you have any tips on debugging, feel free to share it in the comment section. Now you pass one of the most difficult part of the DL. Let’s beat the state-of-the-art model in Part 6: Improve Deep Learning Models performance & network tuning.