Neural Network 07 — Setting up your Machine Learning Application
Welcome to the lesson 07. 😀 In this lesson you will be learning how to set up your machine learning application. If you are interested earlier lessons of this series, you can go through them using following links.
- Prerequisites
- Logistic Regression is a solid base
- Neural Network Representation
- Activation functions
- Gradient descent for Neural Network
- Deep L-layer neural network
If you are good to go, awesome! let’s get started. 😎
Train / Dev / Test sets
If you already have some knowledge about machine learning, you know that, usually we split out dataset into different portions for training, validating and testing out our model.
Applied machine learning is a highly iterative process. Making good choices in how you setup your training, development, and testing sets make a huge difference in implementing a high performance neural network.
When we train a neural network, we have to make number of decisions such as choosing,
- number of layers
- number of hidden units
- learning rate
- activation functions
- and many more…
When we starting a new application it is impossible to correctly guess the right values for all of these hyperparameter choices, on our first run.
We have to iterate through these steps many times to find a better neural network for our application.
As I mentioned earlier, setting up our dataset well in terms of our train, development, and test sets can make it much more efficient on that.
We usually follow following steps iteratively.
- Typically, we device our dataset into 3 parts.
a). Train set
b). Hold-out cross validation set / development / dev set
c). test set - We train our model on train set and validate out output with dev set. We repeat these steps until we find the best model.
- Then we run test set on our best model and evaluate it on our test set in order to get an unbiased estimate of how well our algorithm is doing.
Is 70% train / 30% test like split still valid?
In previous era of machine learning, people usually split their datasets like 70% / 30% on train / test, or 60% / 20% / 20% on train / dev / test sets.
This strategy is reasonable for datasets with 100, 1000, or even 100000 records in them.
But…
Now in this Big Data era, above percentages are not acceptable.
For example: Let’s say we have a dataset with 1 million records. We do not need 20% of 1million as our dev / test sets. (It’s a huge amount and in a way it’s a waste of valuable data.)
10,000 records from above dataset is good enough for the validation and evaluation process.
So, for above 1 million records dataset the splitting ratio can be set as
98% | 1% | 1% (train | dev | test)
For even larger datasets, we can adjust these ratios like
99.5% | 0.4% | 0.1% (train | dev | test)
Mismatched train/test distributions
Let’s consider an image processing task. There are many places that we can find images for image processing tasks.
- Original images we collected for our image processing task
- Our own images
- Augmented images from existing image sets
- Images download from the internet
- Images uploaded by the users using our application
Above sources can have more professional and high resolution images, as well as poor quality, low resolution images. So, the availability of useful data in the images is different in those cases.
Eg: Original dataset has more professional and high resolution images, but images uploaded by users may have poor quality and low resolution.
In that case those two datasets has two different distribution of data. Following diagram shows how we can manage the situation to some extent.
Not having a test set is also totally fine. Train and dev are the essential parts here.
Bias and variance
In machine learning or deep learning, bias and variance are two sources of error that can affect the performance of a model.
Bias: refers to the error introduced by the approximating a real-world problem, which may be complex by a simplified model.
High bias causes poor performance on both train and test sets which is called underfitting. It’s caused by the models with very simple architecture. Model is too simple and cannot capture underlying patterns of training set.
Variance: refers to the model’s sensitivity to the variations of the training set.
Hight variance causes poor performance on test set or new/unseen data which is called overfitting. It’s caused by a too complex model.
Following diagram will explain it more clearly.
Even though we say that low-bias and low-variance is good, we consider a tradeoff between those two values. Cross-validation, regularization and ensemble techniques can be used to manage the balance between bias and variance.
But… how do we decide a percentage of error is high/low? Let’s say task 1 have 5% of train error. Task 2 have 15% of train error. Can we say task 1 perform well over task 2? BIG NO! 😏
It depends on Bayesian Optimal error of each individual task.
Bayesian Optimal error
Bayesian optimal error is the lowest possible error that can achieve on a particular dataset.
If a task shows that Bayesian optimal error is 15%, 15% of train error is not high bias.
Basic recipe for machine learning
In earlier era of machine learning there were a lot of discussions about “bias variance tradeoff”. Because at than time, if we reduce the bias, it increases the variance. If we reduces the variance, it increases the bias. So, we need some kind of a balance/tradeoff between bias and variance to get the optimal model performance.
But…
In the modern deep learning / big data era, as long as we keep training a bigger network and as long as we can keep getting more data,
getting a bigger network ➡ Just reduce the bias ✔
getting more data ➡ just reduce the variance ✔
Great!!! Completed another lesson ✅. 😄 Hope you enjoyed this lesson as well. See you in the next lesson. Good Luck!!! Keep Learning!!! 🎯