In part 3, we cover some high level deep learning strategy. Then we go into the details of the most common design choices. (Some basic DL backgrounds may be needed.)
The 6-part series for “How to start a Deep Learning project?” consists of:
· Part 1: Start a Deep Learning project.
· Part 2: Build a Deep Learning dataset.
· Part 3: Deep Learning designs.
· Part 4: Visualize Deep Network models and metrics.
· Part 5: Debug a Deep Learning Network.
· Part 6: Improve Deep Learning Models performance & network tuning.
Simple and smart
Start your design simple and small. In the study phase, people are flooded with many cool ideas. We tend to code all the nuts and bolts in one shoot. Resist the seduction from exotic ideas. This will not work. Try to beat the state-of-the-art too early is not practical. Design with fewer layers and customizations. Delay solutions that require un-necessary hyperparameters tuning. Verify the loss is dropping. Do not waste time training the model with too many iterations or too large batch size.
After a short debugging, our model produces pretty un-impressive results after 5000 iterations. But colors start confined to regions. There is hope that skin tone is showing up.
This gives us valuable feedback on whether the model starts coloring. Do not start with something big. You will spend most of your time debugging this or wondering do we just need another hour to train the model.
Nevertheless, this is easier to say than doing it. We jump steps. But you are warned!
Priority & incremental driven design
To create simple designs first, we need to sort out the top priorities. Break down complex problems into smaller problems and solve them in steps. Everyone has a plan ‘till they get punched in the mouth. (a quote form Mike Tyson) The right strategy in a DL project is to maneuver quickly from what you learn. Before jumping to a model using no hints, we start one with spatial color hints. We do not move to a “no hint” design in one step. We first move to a model with color hints but dropping the hints’ spatial information. The color quality drops significantly. We shift our priority and refine our model first before making the next big jump. We are dealing with many surprises in designing models. Instead of making a long-term plan that keeps changing, be priority driven. Use shorter and smaller design iterations to make the project manageable.
Avoid random improvements
Analyze the weakness of your models first instead of making random improvements like bi-directional LSTM or PReLU. Visualize the errors (badly performed scenarios) and the performance metrics to identify real issues. Random improvements can be counterproductive by increasing training complexity un-proportionally high with little returns.
We apply constraints to the network design to make training more effective. Building deep learning is not only putting layers together. Adding good constraints make learning more efficient or more “intelligent”. For example, apply attention so the network knows where to look. In the variational autoencoder, we train the latent factors to be Normal distributed. In our design, we apply denoising to corrupt large fractions of spatial color hints by zeroing them out. Ironically, it forces the model to learn and to generalize better.
DL is more than adding layers.
For the rest of the article, we will discuss some of the most common design choices encounter in a DL project.
Deep learning software frameworks
In just six months after the release of the TensorFlow from Google on Nov 2015, it became the most popular deep learning framework. While it seems implausible for any challengers soon, PyTorch was released by Facebook a year later and get a lot of traction from the research community. As of 2018, there are many choices of deep learning platform including TensorFlow, PyTorch, Caffe, Caffe2, MXNet, CNTK etc… There is one key factor triggers the defection of some researchers to PyTorch. The PyTorch design is end-user focused. The API is simple and intuitive. The error messages make sense and the documentation is well organized. Some of the features like pre-trained models, data pre-processing and loading common datasets in PyTorch are very popular. TensorFlow does an excellent job. But so far, it adopts a bottom-up approach that makes things complicated. Their APIs are verbose. Debugging is ad hoc. It has about half a dozen API models to build DN: the result of many consolidations and matching competitor offerings.
Make no mistakes, TensorFlow is still dominating as of Feb 2018. The developer community is the biggest. This may be the only factor matters. If you want to train the model with multiple machines or deploy the inference engine onto a mobile phone, TensorFlow is the only choice. Nevertheless, if other platforms prove themselves being more end-user focus, we will foresee more deflection for small to mid-size projects.
As TensorFlow evolves, there are many API options in building a DN. The highest level API is the estimator which provides implicit integration with the TensorBoard for performance metrics. However, its adoption remains low outside the built-in estimators. The lowest level APIs is verbose and spread in many modules. It is now being consolidated into the tf.layers, tf.metrics and tf.losses modules with wrapper APIs that build DN layers easier. For researchers that want more intuitive APIs, there are Keras, TFLearn, TF-Slim, etc… All work well with TensorFlow. I will suggest selecting one that has the pre-trained models that you need and the utilities to load your dataset. The amount of latest activities is important also. In the academic world, Keras APIs are pretty popular for quick prototyping.
Don’t reinvent the wheel. Many deep learning software platforms come with pre-trained models like VGG19, ResNet and Inception V3. Training from scratch takes a long time. As stated from the VGG paper in 2014: “the VGG model was originally trained with four NVIDIA Titan Black GPUs, training a single net took 2–3 weeks depending on the architecture.”
Many pre-trained models can be repurposed for deep learning problems. For example, we extract image features using a pre-trained VGG model and feed them to an LSTM model to generate captions. Many pre-trained models are trained with ImageNet images. If your target data is not very different from the ImageNet, we freeze most of the model’s parameters and retrain the last few fully connected layers only. Otherwise, we retrain the whole network end-to-end with our training dataset. But in both cases, since the model is already pre-trained, it can be retrained with significantly fewer iterations. As the training is shorter, we can avoid overfitting even the training dataset is not large enough. This kind of transfer learning works well across disciplines, for example training a Chinese language model with a pre-trained English model.
However, this transfer learning is only justifiable for problems requiring a complex model to extract features. In our project, our samples are different from the ImageNet, and we need to retrain the model end-to-end. Nevertheless, the training complexity from VGG19 is too high when we only need relative simple latent factors (the colors). So we decide to build a new but simpler CNN model for feature extraction.
Not all cost functions are created equally. It impacts how easy to train the model. Some cost functions are pretty standard, but some problem domains need some careful thoughts.
- Classification: Cross entropy, SVM
- Regression: Mean square error (MSE)
- Object detection or segmentation: Intersection over Union (IoU)
- Policy optimization: Kullback–Leibler divergence
- Word embedding: Noise Contrastive Estimation (NCE)
- Word vectors: Cosine similarity
Cost functions looking good in the theoretical analysis may not perform well in practice. For example, the cost function for the discriminator network in GAN adopts a more practical and empirical approach than the theoretical one. In some problem domains, the cost functions can be part guessing and part experimental. It can be a combination of a few cost functions. In our project, we start with the standard GAN cost functions. We also add a reconstruction cost using MSE and other regularization costs. However, our brain does not judge styling by MSE. One of the unresolved areas in our project is to find a better reconstruction cost function. One possibility includes using perceptual loss to measuring the difference in the style instead of per-pixel comparison.
Finding good cost functions becomes more important when we move into less familiar problem domains.
Good metrics help you to compare and tune models better. Search for established metrics for your type of problem. For ad hoc problem, check out Kaggle. It hosts many DL competitions with well documented metrics. Unfortunately, for our project, it is hard to define a precise formula to measure the accuracy for artistic rendering.
L1 and L2 regularization are both common but L2 regularization is more popular in deep learning.
What is good about L1 regularization? L1 regularization promotes sparsity in parameters which encourages representations that disentangle the underlying representation. Since each non-zero parameter adds a penalty to the cost, L1 prefers more zero parameters than the L2 regularization. i.e. it prefers many zeros and a slightly larger parameter than many tiny parameters in L2 regularization. L1 regularization makes filters cleaner, easier to interpret and therefore a good choice for the feature selection. The computation is easier to optimize and therefore consume less power. Hence, L1 is more suitable for the mobile devices. L1 is also less vulnerable to outliners and works better if the data is less clean. However, L2 regularization remains more popular because the solution may be more stable.
Always monitor gradient closely for diminishing or exploding gradients. Gradient descent problems have many possible causes which are very hard to verify. Do not jump into the learning rate tuning or making model design changes too fast. The small gradients may simply caused by programming bugs. For example, the input data is not scaled properly, or the weights are all initialized to zero. Tuning takes time. It will have better returns if we verify other causes first.
If other possible causes are eliminated, apply gradient clipping (in particular for NLP) when gradient explode. Skip connections are a common technique to mitigate gradient diminishing problem. In ResNet, a residual layer allows the input to bypass the current layer to the next layer. Effectively, it reduces the depth of the network to make training easier in early training.
Scale your input features. We often scale features to be zero-centered within a specific range say [-1, 1]. The improper scaling of the features is one most common cause for the exploding or diminishing gradients. Sometimes, we compute a mean and a variance from the training data to scale the data closer to be Normal distributed. When we scale the validation or testing data, reuse the mean and the variance from the training data. Do not recompute them.
Batch Normalization & Layer Normalization
The unbalance of the nodes’ outputs before the activation functions in each layer is another major source of the gradient problem. If needed, apply batch normalization (BN) to CNN. DN learns faster and better if inputs are properly normalized (scaled). In BN, we compute the means and the variances for each spatial location from each batch of training data. For example, with a batch size of 16 and a feature map with 10x10 spatial dimension, we compute 100 means and 100 variances (one per location). The mean at each location is the average of the corresponding locations from the 16 samples. We use the means and the variances to renormalize the node outputs at each location. BN improves accuracy and reduces training time. As a side bonus, we can increase the learning rate further to make training faster.
However, BN is not effective for RNN. We use Layer normalization instead. In RNN, the means and variances from the BN are not suitable to renormalize the output of RNN cells. It is likely because of the recurrent nature of the RNN and the sharing parameters. In Layer normalization, the output is renormalized by the mean and the variance calculated by the layer’s output of the current sample. A 100 elements layer uses only one mean and one variance from the current input to renormalize the layer.
Dropout can be applied to layers to regularize a model. Dropout becomes less popular after the introduction of batch normalization in 2015. Batch normalization uses the mean and the standard deviation to rescale the node outputs. Because each training batch has different mean and variance, this behaves like noise which forces layers to learn more robustly for variants in input. Since batch normalization also helps the Gradient descent, it gradually replaces dropout.
The benefit of combining dropout with L2 regularization is domain specific. Usually, we may test dropouts in tuning and collect empirical data to justify its benefit.
In DL, ReLU is the most popular activation function to introduce non-linearity to the model. If the learning rate is too high, many nodes can be dead and stay dead. If changing the learning rate does not help, we can try leaky ReLU or PReLU. In a leaky ReLU, instead of outputting zero when x < 0, it has a small predefined downward slope (say 0.01 or set by a hyperparameter). Parameter ReLU (PReLU) pushes a step further. Each node will have a trainable slope.
To test the real performance, we split our data into three parts: 70% for training, 20% for validation and 10% for testing. Make sure samples are randomized properly in each dataset and each batch of training samples. During training, we use the training dataset to build models with different hyperparameters. We run those models with the validation dataset and pick the one with the highest accuracy. But, as the last safeguard, we use the 10% testing data for a final insanity check. If your testing result is dramatically different from the validation result, the data should be randomized more, or more data should be collected.
Setting a baseline helps us in comparing models and debugging. Research projects often require an established model as a baseline to compare models. For example, use a VGG19 model as the baseline for classification problems. Alternatively, we can extend some established and simple models to solve our problem first. This helps us to understand the problem better and establishes a performance baseline for comparison. In our project, we modify an established GAN implementation and redesign the generative network as our baseline.
We save models’ output and metrics periodically for comparison. Sometimes, we want to reproduce results for a model or reload a model to train it further. Checkpoints allow us to save models to be reloaded later. However, if the model design has changed, all old checkpoints cannot be loaded. Even there is no automated process to solve this, we use Git tagging to trace multiple models and reload the correct model for a specific checkpoint. Checkpoints in TensorFlow is huge. our designs take 4 GB per checkpoint. When working in a cloud environment, configure enough storages accordingly. We start and terminate Amazon cloud instances frequently. Hence, we store all the files in the Amazon EBS so it can be reattached easily.
Built-in layers from DL software packages are better tested and optimized. Nevertheless, if custom layers are needed:
- Unit test the forward pass and backpropagation code with non-random data.
- Compare the backpropagation result with the naive gradient check.
- Add tiny ϵ for the division or log computation to avoid NaN.
One of the challenges in DL is reproducibility. During debugging, if the initial model parameters keep changing between sessions, it will be hard to debug. Hence, we explicitly initialize the seeds for all randomizer. In our project, we initialize the seeds for python, the NumPy and the TensorFlow. For final tuning, we turn off the explicit seed initialization so we generate different models for each run. To reproduce the result of a model, we checkpoint a model and reload it later.
Adam optimizer is one of the most popular optimizers in DL, if not the most popular. It suits many problems including models with sparse or noisy gradients. It achieves good results fast with the greatest benefit of easy tuning. Indeed, default configuration parameters often do well. Adam optimizer combines the advantages of AdaGrad and RMSProp. Instead of one single learning rate for all parameters, Adam internally maintains a learning rate for each parameter and separately adapt them as learning unfolds. Adam is momentum based using a running record of the gradients. Therefore, the gradient descent runs smoother and it dampens the parameter oscillation problem due to the large gradient and learning rate. An alternative but less used option is the SGD using Nesterov Momentum.
Adam optimizer tuning
Adam has 4 configurable parameters.
- The learning rate (default 0.001)
- β1: the exponential decay rate for the 1st moment estimates (default 0.9).
- β2: the exponential decay rate for the 2nd-moment estimates (default 0.999). This value should be set close to 1.0 on problems with a sparse gradient.
- ϵ(default 1e-8) is a small value added to the mathematical operation to avoid illegal operations like divide by zero.
β (momentum) smoothes out the gradient descent by accumulate information on the previous descent to smooth out the gradient changes. The default configuration works well for early development usually. If not, the most likely parameter to be tuned is the learning rate.
Here is a brief summary on the major steps of a deep learning project:
• Define task (Object detection, Colorization of line arts)
• Collect dataset (MS Coco, Public web sites)
◦ Search for academic datasets and baselines
◦ Build your own (From Twitter, News, Website,…)
• Define the metrics
◦ Search for established metrics
• Clean and preprocess the data
◦ Select features and transform data
◦ One-hot vector, bag of words, spectrogram etc...
◦ Bucketize, logarithm scale, spectrogram
◦ Remove noise or outliers
◦ Remove invalid and duplicate data
◦ Scale or whiten data
• Split datasets for training, validation and testing
◦ Visualize data
◦ Validate dataset
• Establish a baseline
◦ Compute metrics for the baseline
◦ Analyze errors for area of improvements
• Select network structure
◦ CNN, LSTM…
• Implement a deep network
◦ Code debugging and validation
◦ Parameter initialization
◦ Compute loss and metrics
◦ Choose hyper-parameters
◦ Visualize, validate and summarize result
◦ Analyze errors
◦ Add layers and nodes
• Hyper-parameters fine tunings
• Try our model variants
Do I miss any core design topics in DL? Feel free to share with us in the comment section. Deep network is one big black box. People act irrationally when debugging a DN. Instead of spending hours in following dead end leads, we should spend some time to create a framework to visualize the DN models and the metrics. In Part 4: Visualize Deep Network models and metrics, we cover what should we monitor in troubleshooting and tuning our networks.