[Personal Notes] Deep Learning by Andrew Ng — Course 3: Structuring Machine Learning Projects

9 min readMar 11, 2019

Orthogonalization

We briefly learned about orthogonalization while we were covering about the early stopping. Early stopping can manipulate both bias and variance, which is not ideal for orthogonalization, a strategy that tries to fix different problems by controlling different aspects of the model structure. First, we should try to fit train set. If our model is not performing properly on train set, we should build a bigger network or use different optimization techniques. Then, we move on to fit dev set. Regularization and bigger train set could help when results from train set does not transfer to dev set. If we have a test set and our model does not do as good on test set as it did on dev set, we might need bigger dev set. Finally, we want to check if our model applies to real world problems. If it is struggling with the real world data, we need to change the dev set or/and cost function.

Single Number Evaluation Metric

We need some kind of metrics to evaluate the performance of our model with different sets. Having multiple metrics could make it very confusing to correctly judge the performance. So, we should try to come up with a single number evaluation metric that combines multiple metrics into a one scoring system. For example, both precision and recall tell the accuracy of a model. With the two, we can make a harmonic scoring system called F1 Score = 2 / (1 / P) + (1 / R). With the clear goal of maximizing the F1 Score, we can evaluate models way easier than trying to juggle with the two numbers.

Satisficing vs. Optimizing Metrics

When we try to compare different models, we might not end up with a single metric if we care beyond the accuracy of the models like runtime and memory. In such situation, we set the accuracy to be optimizing metric that we try to maximize, and make runtime and memory to be satisficing metrics that we only aim to pass certain standards. We choose a model, that satisfy our runtime and memory standards with the highest accuracy, to be the best model.

Train/Dev/Test Sets

It is extremely important to have dev and test sets come from the same distribution because we want to optimize the model on the same target. With the same logic, they should also embody what kind of data we expect to get in the future for real world problems. For example, if we try to build a recommendation model for Netflix, we cannot use data from Amazon for dev and test sets. It just does not make sense as if we would not work on Calculus problem sets for Linear Algebra final.

Size of Dev and Test Sets

We learned that distributing large portion of data for dev/test sets is not a good idea in modern big data era. For Deep Learning is so data hungry, we need to utilize as much data as possible for training the model. Dev/test sets just need to be big enough to give us some confidence. The rest goes to train set.

When To Change Dev/Test Sets Or Evaluation Metric

It is not uncommon to change dev/test sets or evaluation metric during the development of model because, in many cases, it is not very clear at the beginning which target we should be aiming for. So, we should define a reasonable target and change when we realize that the target has been wrong.

Human Level Performance

Often when we develop a model, we first try to make it perform as well as humans and then go beyond human level performance to get the least error possible called Bayes error. It is way easier to reach the human level performance than trying to surpass it because we can use human intuition to make the model better. Humans can help out by labeling data, analyzing error manually, and through bias/variance analysis. However, once the model outperforms humans, it is impossible for us to give any human input to the model, making it really slow to reach the Bayes error. ‘

Avoidable Bias

In order to do proper bias/variance anaylsis, we need to figure out what our avoidable bias is. By definition, it is Bayes error - train error. However, for many perception tasks, it is impossible for computers to be better than humans. So most of the time, human and Bayes error are pretty close which makes avoidable bias = Bayes (human) error - train error. During bias/variance analysis, we first focus on minimizing the avoidable bias, and then work on reducing the variance, which is train error - dev error. The problems that machines can surpass humans are the ones with a lot of structural data. In this case, it becomes hard to do bias/variance analysis because we do not have a clear picture of what Bayes error should be.

Improving Model Performance

To fix avoidable bias problem we can try to train bigger model, train for longer time, use better optimization, or change around the model architecture and hyper-parameters. To reduce variance, we could get more data, use regularization, or change around the model architecture and hyper-parameters.

Carrying Out Error Analysis

As mentioned, analyzing error manually gives us some insight about what is going on with our model and how to improve it. During error analysis, we would look into the errors made by the model and figure out if the errors are worth fixing. For example, to see if it is worth fixing a cat classifier that classify some dogs as cats, get around 100 mislabeled and count how many of them are dogs. If 5% of the 10% error turns out to be dogs, our best possible error after fixing the problem will be 9.5% (5% of 10%). On the other hand, if 50% of the 10% error turns out to be dogs, we could reach 5% error which is totally worth trying. Possible improvement that can be made is called the ceiling. When we have multiple ideas of error fixing, we should make a chart to easily visualize which approach has the highest potential for improvement.

Cleaning Up Incorrectly Labeled Examples

Deep Learning algorithms are pretty roust to random and unintentional label errors in train set. However, they are vulnerable to systematic errors that consistently mislabel something. So, we usually leave train set untouched for incorrectly labeled examples. For dev/test sets, we could carry out error analysis to see if fixing them worth our time. If we decide to fix something, we should apply the same process to dev/test sets so that their distribution stays the same.

Build First System Quickly, Then Iterate

As we can see, it is almost impossible to have a good model on the first trial. So, do not hesitate too much before building our first system quickly. To do so, we first need to set up train, dev, test sets and metric. Then, go a head and build initial system quickly so that we can use bias/variance analysis to prioritize what steps to take next.

Training And Testing On Different Distribution

Sometimes, it is inevitable to get train and dev/test sets from different sources because Deep Learning algorithms are so data hungry and need as much data as we can collect. In this case, we want to concentrate dev/test sets with data that we expect in real world situations and use the rest for training rather than randomly shuffling. This will allow us better performance in long run because we have a better target.

Bias And Variance With Mismatched Data

If train and dev/test sets have different distribution, variance does not equal to train error - dev error any more. This makes bias/variance analysis unstable. To define variance, we need to set up a train-dev set that that comes from the same distribution as the train set but not used in training. With the new set up of data, variance = train error — train-dev error and data-mismatch = train-dev error — dev error.

Addressing Data Mismatch

If data mismatch becomes a major problem, although there is no straight forward solution, we could try few things. First, carry out manual error analysis and understand the difference between train and dev/test sets. Then, try to modify train set more similar to dev/test sets or collect more training data that are more similar to dev/test sets. If you decide to artificially synthesize data, be careful not to overfit to a small subset of all possible examples.

Transfer Learning

During transfer learning, we transfer our knowledge from task A to task B. For example, we could use models trained on ImageNet data for X-ray images rather than starting from point zero. To do transfer learning, we should first make sure to replace the output layer so that it serves our purpose (correct number of classes). We could retrain the whole network if we have enough data for task B. If not, we should just retrain the later layers because the early layers already learned from task A to recognize simple features that are often useful for task B as well. For example, the early layers of models trained on ImageNet data will detect low level features like edges and shapes that are also applicable to X-ray images. The process of retraining the layers is called fine tuning. To speed up the fine tuning, we could precompute activations that will be inputs to the first layer that we decide to retrain. We save a lot of training time because we can skip many layers of computations. We call this pretraining. It makes sense to use transfer learning, if we do not have a lot of data and when task A and B has same type of inputs like images or audio.

Multitask Learning

In multitask learning, we train a neural network for multiple tasks simultaneously, while we trained sequentially from task A to B in transfer learning. The intuition behind multitask learning is that the earlier layers will share same low level feature learning that data for different tasks will help each other, which is better than training multiple networks. For example, we could develop an object recognition system that detects multiple objects at the same time. Each image will have multiple labels like [1, 0, 1, 0] because an image might contain more than one object, which makes it impossible to use regular cross-entropy loss with Softmax. We need to modify the loss function slightly so that it is more like the cross-entropy loss with Sigmoid for multiple classes. When we are training on set of tasks that share low features and have similar amount of data, and have enough resources to train a big enough network to do several tasks at the same time, it is reasonable to try multitask learning.

End-To-End Deep Learning

An end-to-end Deep Learning system uses just one neural network to do everything that the system needs. Otherwise, we could use multiple Deep Learning systems to have multiple stages or use manually designed feature extraction. For example with face recognizer, we use one neural network to detect faces out of human images and use another to identify and recognize to whom that face belongs. The pros of using end-to-end Deep Learning is that we can let data speak for itself without human intuition. Also, we do not need to spend a lot of effort on hand-designing. The cons are that we may need a lot of data and that it excludes potentially useful hand-designed components.