Pytorch FAQ Code Snippets for Udacity Deep Learning Scholarship Challenge with Facebook

Uniqtech
Data Science Bootcamp
14 min readJan 3, 2019

--

This is a comprehensive guide on troubleshooting Pytorch final challenge project for beginners. It compiles interesting FAQs and chats from the Udacity Deep Learning Scholarship Challenge with Facebook in preparation with the Deep Learning Nanodegree (mostly slack channel discussions). It’s a great source for going through Pytorch tutorials, getting started with Pytorch, and more importantly, learn hyperparameter tuning to improve deep learning performance. If you want to support the publication, go ahead and applaud for the article. You can always view this article for free using incognito mode for example. Your support is greatly appreciated. Please comment below regardless if you are a Medium member. Key author and contributor Sun.

Looking for code snippets? Comments and message me private. It is honor code not to share solutions, so we will only be able to share valuable hints without giving away the answer. They are useful especially if you are rushing to finish your project. Hints include: relevant look-alike code snippets and relevant Python notebooks that were used in the Udacity course and where to find them, and how to map them to your final project. Need a last minute answer? The slack channel is the best. Many helping hands. Or email us. hi@uniqtech.co We will point you to the right direction.

Glossary

  • What does fc stand for? Answer: Fully connected.
  • View model architecture. For example understand the model layer structure of ResNet or VGG — popular pre-trained models. print(model)
  • CUDA. Pytorch uses CUDA to manage device choices: CPU or GPU. You can check if GPU is enabled on your computer using torch.cuda.is_available()

Data Preprocessing Data Transformation

  • First step is download the dataset, or upload it to the right folder, such as a Google Drive, and then unzip it. The exclamation mark is for using command line in Jupyter Notebook. Skip this step if you are working in the Udacity student workspace.
  • Use training set.class_to_idx to map integers to classes

“Why we need model.class_to_idx = image_datasets[‘train’].class_to_idx is because The labels returned do not match the folder names. You’d need to use class_to_idx to get the correct mapping between label and class id. They are loaded in order for character strings, i.e. 1, 10, 100, 101, 102, 11, 2, 3, 4… and given labels 0,1,2,3,4,… The folder names (which here happen to be numbers) are treated as characters.” — Udacity slack channel

  • There’s no provided test dataset for the Oxford102 flower dataset. It is withheld in the challenge to evaluate student performance. There’s not exactly enough data to shuffle and split but some advanced students augment the dataset by using external data. There are user made test dataset on Kaggle. It’s not uncommon to augment datasets in real Kaggle competitions.
  • Because there’s no provided test dataset, it is okay to augment the dataset with another. Careful, you may have data leak and overlapping images in both train and validation.
  • Map integer folder names into meaningful class names. model.class_to_idx = image_datasets.class_to_idx
  • You can augment the dataset using rotation, flipping and some variations. This is common best-practice in the industry. Data augmentation is a professional tool. For this dataset, data, hue, contrast variations do not necessarily perform better. No one knows for sure, but the intuition may be classifying flower species using colors is actually important. Color noises cause the generated data to deviate from the true population distribution of the data. The generated data is no longer representative, hence not good for training.

Image Transformation is very very important. In our experiments of various learning rate, epoch, stepLR steps combinations, if the transformation is not right, accuracy will cap around 90% no matter what kind of optimization we do. We realized that initially the transformations were not set correctly. Essentially, all the extra optimization we did was wasted. Before your proceed to do the rest of the project, double and triple check your transformation functions, preprocessing steps before proceeding. Will save you a lot of time.

Release Note and Dependencies

  • Pytorch 1.0.0 the latest, has Google Cloud, AWS and production scripting support but is not the version that Udacity grader uses. Udacity uses 0.4.0 which is yet again different from Google Colab installation 0.4.1. Make sure to use strict=False in the import attribute so that no errors are raised because 0.4.1 have some attributes that 0.4.0 does not. To check your Pytorch version use torch.__version__

“The PyTorch deep learning framework is now supported throughout more of Google Cloud’s AI platforms and services. This coincides with Facebook’s release of PyTorch 1.0 in preview. With this news, we’re now offering a new VM image and an extended TensorRT package in Kubeflow so it’s easier to support serving PyTorch models. Other features coming soon include deeper TensorBoard integration and the ability to connect PyTorch to Cloud TPUs” Google Cloud Platform Nov 6 2018

  • Dependencies

Model Building, Model Architecting

  • Train an existing model. In order to achieve 95%+ accuracy, people have elected to train existing large models such as ResNet and VGG from scratch! It’s surprising. The trade off is: training takes a long time and there may not be enough data (only 6500+ images available for 102 classes). Though those who tried reported a gain in accuracy.
  • To use an existing model is equivalent to freeze some of its layers and parameters and not train those. Turn off training autograd by setting require_grad to False.
  • The code sample below shows multiple linear layer with RELU activation and dropout
  • The number of layers turns out to e important. If there are too many layers, like the code snippet above, the model may not learn and adjust weights well due to insufficient data. Ctrl+F search Curse of Dimensionality in this article. A rule of thumb: more layers require more data. If data is not available, try fewer layers.
  • The Pursuit of Elegant Solution? Occam’s razor principle says that the simpler, more elegant solution that achieves equal accuracy and quality is preferred over a complex solution. While it is possible to train all layers of a complex model, you can build equally sophisticated models by just training the last one or two layers of the pre-trained model instead of all the layers. The curse of dimensionality of data also requires exponentially more data for every additional feature weight to train, so we don’t want to train too many new layers. Considering the dataset only has 6500+ images, we don’t will run into data issues. So as you design your model, think about the 80/20 rule — how do you achieve 80% effectiveness with 20% of work and effort.
  • Hint: four fully connected layers as final layer even with dropout does not fit this particular dataset too well.
  • Calculating Accuracy is very important as well. How do you judge a model? Its performance of course, in this case, measured by accuracy. Looking for the code snippet? Hint: it appears in a past lesson.

Model Training in Google Cloud vs Training on Local Machine

Thanks you new cloud technology like Google Colab. Once you set up the notebook, you can continue train and monitor on your mobile devices! Cloud choices including Google Colab, Kaggle, AWS. Local choices including your own laptop and your gaming computer.

  • We repurporsed an msi NVIDIA GTX 1060 previously for Assassin’s Creed Origin :D If you want to know more let us know.
  • Regardless of where the model is trained, if the training loss has gone down a lot near zero, but the validation loss does not decrease (there’s no test dataset), you may want to watch out your model is overfitting and it may be memorizing training data. Halt and tune your parameters. Even if you achieve 99% accuracy your model may not generalize, hence it’s a possibility that it cannot be used else where.
  • Udacity scholarship student Avinash explains how to get started with Google Colab installing dependencies and required libraries. Link
  • If you want to go above and beyond, you can plot the train validation loss in graphs to visualize the change over time.
  • Be able to train on a GPU locally has been a major advantage. We were able to iterate through parameter tuning combinations fast without interuption. Google Colab has a 12-hour timeout as well as a 12GB quota limit. If you are an advanced user, be sure to avoid constantly downloading the flower dataset, instead store it in your Google Drive.
  • After deleting your model in Google Drive, be sure to empty trash to actually delete it.

Hyperparameter Tuning

Depends on the model you are using, tuning these parameters may help improve accuracy! The parameters may also be completely different based on the model you are using.

  • Learning rate: Too large a learning rate, in this case, even 0.1 can show signs of non convergence (the weights jump around too much and cannot converge. You may observe dismal 2% accuracy). It’s a signal that you need tone down your learning rate. In this case, learning rate can be as low as 0.005 and 0.001 people have seen good results. Also some pros use StepLR a scheduler that steps in the training loop to decay the learning rate so that learning rate shrink over time. Later one, the learning part should be smaller and smaller. Remember to define the scheduler and also include the actual scheduler.step() function in the training loop.
  • Read more about scheduler here.
  • StepLR can decay by a factor of 0.1 (set by gamma) for n number of steps, where n is the step size. If you epoch number is 20, your step needs to be smaller than 20 to take effect.
  • If learning rate is set too high initially, you may miss the optima you are searching for.

Print out learning rate

You could print out your lr with

  • Batch size: 6500+ images / 102 classes ~ 64 max batch size. Generally people saw good results between 16–64. We tried 64 the results weren’t ideal. 30–50 have been great.
  • Epochs: ranges from 16–64. Why 16? The number is arbitrary. People who are hardcore about the leaderboard trained more than 50 epochs but it’s possible to get fairly good results using 5–10 epochs. It’s possible to overfit especially if training loss is too low (model is memorizing) and or validation loss starts to go up. Increasing Epochs can potentially help. You may see the accuracy stop improving. That’s when you realize epochs have limits. See a following section, where we explain how too many epochs may mean that your model is overfitting or memorizing. Our many experiments showed us that, for this dataset, too many epochs do not yield extraordinary results unless you are aiming for 99%.
  • Optimizer: Adam that comes with optimization or Stochastic Gradient Descent (SGD)with Momentum. The momentum helps you get unstuck at local minima.

Transfer Learning

In practice, very few people train an entire Convolutional Network from scratch (with random initialization), because it is relatively rare to have a dataset of sufficient size. Instead, it is common to pretrain a ConvNet on a very large dataset (e.g. ImageNet, which contains 1.2 million images with 1000 categories), and then use the ConvNet either as an initialization or a fixed feature extractor for the task of interest. — Pytorch Documentation

  • Pretrained models can be large: 1GB+ for VGG and 500MB+ for ResNet152. Choose wisely. You can imagine neither in its raw state is optimized for mobile. Google has MobileNet.
  • It’s common practice to only replace the last layer — the classifier or the last fully connected layer but people who are hardcore and wanted to achieve 90%+ accuracy had to train the entire model of ResNet not just the last layer. And got better result than VGG. People who trained the entire model reported seeing accuracy gains. Training just the last layer replacing it with a fully connected layer of the 102 output dimension gives a max of 90% accuracies in our experiments.

A particular cool strategy we saw in the slack channel was: freeze all pre-trained model layers except for the last layer. Train the last layer for n number of epochs. Then stop. Unfreeze the entire model, and train a few more epochs. This way you have trained an entire complex model with a new classifier or final fully connected layer and you are able to do that with a limited size dataset.

  • Consensus is that ResNet152 gives way better result than ResNet50 but there are people who got high results on simple models too. Some train for more than 15 epochs.
  • You can use print(model) to print out the entire architecture of a pretrained model such as ResNet, and you will notice the last layers which are to be replaced during transfer learning have different names. Quiz: what is the last layer of ResNet152? Answer fc You can replace the last layer using the code example below:
  • With transfer learning and a torchvision pretained model that is sophisticated enough it is not hard to get 80% accuracy on validation set with just a few epochs of training. If your accuracy is low, consider tuning your parameters. People have seen a light weight classifier layer (1–2 layers) gives great results. There’s no need for complex structure for the classifier if you are using transfer learning.
  • Unfreezing a large model can easily use up memory and GPU quota and prolong training time.
  • ResNet without dropout, Adam optimizer has been popular. Though SGD with momentum and other models have claimed high spots on the leaderboard as well.

An active developer gave the following advice: “Use resnet 152 with 2 layer in classifier without dropout. Use adam as optimiser. Train it for 10 epoches only with the classifier. Then save the best model and unfreeze the whole model and train it for 15 epoches with lower lr” — Manisha

Model Tuning

  • Curse of dimensionality: before you add four fully connected layers to the end of a pre-trained model like the code sample in Model Building section, think twice. Do you have enough data to train all 4 layers? As the number of parameters increase, the need for data increase exponentially. For example, the Udacity Deep Learning Scholarship Challenge has 6500+ image files for 102 Oxford flower species, that’s only 63+ images per class. That’s barely enough data for one layer.
  • See the hyperparameter tuning section for more tips.
  • If your accuracy is below 20% after 5 epochs then something is seriously wrong with your model. If your accuracy is between 80%-90% you are heading to the right direction but if it doesn’t change after 10–20 epochs, then you need better architecture and setup. Getting to 80% should be easy. If you got 80% it’s probably not enough to get to the next round . Most people can probably reach it. To get between 90–95% is possible without too much effort. Quite some students seem to have reached this. To get above 95% is hard. You can take a look at the specs in the leadership to see what model and model parameters people used. To get close to 99% is very hard. People had to run many epochs. Some of these seemingly high performance models may be overfitting. We would say if you are 80%+ you should at least attempt the final project. If you have 95% you are doing well. If you are 99% great, but are you overfitting?

Save and Load Model Checkpoint

  • Pro tip: Did you know you can save and load models locally and in google drive? This way you don’t have to start from scratch every time. For example, if you already trained 5 epochs. You can save the weights and train another 5 epochs. Now you did 10 epochs total! Very convenient. The free GPU resources time out and get erased very often. Remember incremental training is possible.
  • Save a checkpoint and load it locally

You may also see extension .pt and .pth

Model Submission and Scoring

  • Set CPU as device

Udacity server uses CPU instead of GPU, so don’t forget to modify your code before submitting (see SECTION D #2) @Vittorio Nardone

  • Use strict=False ignoring Pytorch 0.4.1 parameters that don’t exist in Pytorch 0.4.0

“Solution: Most probably you are using pytorch version 0.4.1 (or newer) on your computer, this version introduces several new keys in state dict such as num_batches_tracked that don’t exist in version 0.4.0 running in udacity lab machine. But this new keys can be ignored when you load the stated_dict using the option strict=False in function load_state_dict()” — Udacity slack channel

  • What is my final project score? You will not be told the score. Udacity withheld the grading dataset so that you cannot just tune your model to memorize it. It is commonly known as a private leaderboard, frequently used in grading and evaluating performance (Kaggle competitions are like this).
  • When submitting your model, make sure you have good validation accuracy in your notebook. Make sure training loss is not artificially low (your model memorized the data) and that validation loss is not high (your model didn’t learn anything). If those two things are happening, even if your validation accuracy is high, you may be overfitting. Your standing in the private leaderboard will be much lower than expected.

Developer Tools

  • Unofficial leaderboard: Link
  • Here is a list of useful Jupyter Notebooks from the course, directly relevant for the project:
  • An Udacity scholar published this extremely useful Google Colab notebook to get you started:

Error Messages and Troubleshooting

  • NaN Error: not a number. Check if you have zero division anywhere in your loss calculation.
  • NameError: if Google Colab times out or its GPU quota is maxed out, you will receive this error. Need to restart runtime and import and install everything from scratch. If working locally, NameError may be a missing variable or missing import statement #Menu>Runtime>Restart Runtime
  • Syntax Error: Udacity uses Python3, if you use Python2 there is a syntax error on print statements.
  • UserWarning: Implicit dimension choice for log_softmax has been deprecated. This means that you must specify a dimension for log_softmax to operate on. Usually you can use nn.LogSoftmax(dim=1) which sums across axis 1.
  • RuntimeError: cuda runtime error. Udacity test code uses cpu instead of gpu so you need to make a device switch.
  • AttributeError: module ‘PIL.Image’ has no attribute ‘register_extensions Link to explanation

Advanced Training

You can get even fancier by plotting the result of the classification in a confusion matrix. You can also plot the training loss and validation loss curve with epochs.

  • Calculate the mean and standard deviation of the flower dataset rather than using the ImageNet mean and stdev values
  • Do a train-test-validation split instead of just the current train-val split
  • Use advanced optimizer. Try DenseNet. Unfreeze all layers.
  • Instead of using random initialization, train first, then use the in-progress weights or newly trained weights and fine tune all hypterparameters from here.

Appendix

pytorch facebook challenge project course outline and syllabus

--

--