Deep Learning
Fastai Chapter 4: Questions & Answers
The answer key for the questionnaire at the end of the chapter
Chapter Summary:
Chapter 4 provides an overview of the training process. It covers loading datasets, making predictions, measuring loss, calculating gradients, and updating weights and biases. It also covers some of the tensor operations, activation functions, loss functions, optimizer functions, and learning rate.
01. How is a grayscale image represented on a computer? How about a color image?
Grayscale Image: An image with one channel that’s represented as a 2-dimensional matrix. It contains pixel values that represent the intensity of light for each pixel in the image where zero is a black pixel, 255 is a white pixel, and all the values in between are the different shades of gray pixels.
Color Image: An image with three channels that are represented as a 3-dimensional matrix. It contains three 2-dimensional matrices which contain pixel values that represent the intensity of color for each pixel in the image where each matrix represents the different shades of red, green, and blue.
02. How are the files and folders in the MNIST_SAMPLE dataset structured? Why?
The MNIST_SAMPLE
the dataset is structured using the file and folder names. It separates the training
and validation
sets in the dataset into the train
and valid
subdirectories. It separates the images in the train
and valid
subdirectories into the 3
and 7
subdirectories. It also organizes the images in the 3
and 7
subdirectories using sequentially numbered file names.
- It makes it possible to extract the labels from the file and folder names.
03. Explain how the “pixel similarity” approach to classifying digits works.
Pixel Similarity: An approach that measures the difference between the new image and the ideal image. It creates the ideal image by calculating the average pixel value for every pixel in the specified images. It also measures the distance between the pixel values in the new image and the ideal image.
- The image with the shortest distance is most similar to the ideal image.
04. What is list comprehension? Create one now that selects odd numbers from a list and doubles them.
List Comprehension: A syntax that creates a new list based on an existing list. It creates the new list by performing an operation on each item in the existing list. It also contains three parts which include the expression, for loop and optional if-condition that’s declared between square brackets.
[expression for item in list if-condition]
05. What is a rank-3 tensor?
Tensor Rank: Refers to the number of dimensions in the tensor. It refers to n-dimensions where rank zero is a scalar with zero dimensions, rank one is a vector with one dimension, rank two is a matrix with two dimensions, and rank three is a cuboid with three dimensions. It can also be determined by the number of indices that are required to access a value within the tensor.
06. What is the difference between tensor rank and shape? How do you get the rank from the shape?
Tensor Shape: Describes the number of items in each axis in the tensor. It contains information about the number of matrices, number of rows in each matrix, and number of columns in each matrix. It also helps visualize the tensor which becomes useful for higher rank tensors that are more abstract.
- The rank refers to the number of dimensions in the tensor.
- The shape refers to the number of items in each dimension of the tensor.
- The rank is equal to the number of numbers in the shape of the tensor.
07. What are RMSE and L1 norms?
Mean Absolute Error (MAE): A loss function that measures the distance between the predicted
and label
values. It calculates the mean of the absolute difference between the predicted
and label
values. It also gives equal emphasis to normal
and outlier
errors using the same linear scale.
- It performs well when there are large errors from outliers.
- It performs poorly when there aren’t large errors from outliers.
- It performs the same calculation as the “L1 Norm” in mathematics.
mae = (predicted_values - label_values).abs().mean()
Root Mean Squared Error (RMSE): A loss function that measures the distance between the predicted
and label
values. It calculates the square root of the mean of the squared difference between the predicted
and label
values. It also gives significantly more emphasis to outlier
errors.
- It performs well when there aren’t large errors from outliers.
- It performs poorly when there are large errors from outliers.
- It performs the same calculation as the “L2 Norm” in mathematics.
rmse = ((predicted_values - label_values)**2).mean().sqrt()
08. How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop?
Numpy Array: A multi-dimensional matrix that can perform numerical operations on its data. It can contain items that have the same data type which includes arrays of arrays. It can also run in the C programming language using the CPU which runs thousands of times faster than Python.
PyTorch Tensor: A data structure that’s similar to the Numpy array but with an extra restriction that unlocks new capabilities. It can only contain items that are the same basic numeric data type. It can also run in the C programming language using the CPU which runs thousands of times faster than Python or using the GPU which runs up to millions of times faster.
- By performing the calculations using Numpy arrays or PyTorch tensors.
09. Create a 3×3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom-right four numbers.
10. What is broadcasting?
Broadcasting: An operation that performs an operation between tensors with different shapes. It requires the length of the dimensions in the tensors to be equal or one of the dimensions must be one. It automatically expands the tensor with the smallest rank to have the same size as the larger tensor.
11. Are metrics generally calculated using the training set or the validation set? Why?
Validation Set: A subset of the dataset that isn’t used to train the model. It can be used to produce an unbiased evaluation of the model because the model isn’t allowed to learn or memorize the data in the subset. It can also be used to prevent underfitting and overfitting by fine-tuning the model.
- The metrics are usually calculated using the
validation
set. - It produces an unbiased evaluation of the performance of the model.
12. What is SGD?
Stochastic Gradient Descent (SGD): An optimization
function that minimizes the loss
value. It initializes the weights and biases using random values. It calculates the gradients of the loss function to identify whether increasing or decreasing the weights and biases will reduce the loss
value. It multiplies the learning rate by the gradients to determine how much to increase or decrease the weights and biases. It also subtracts the result from the weights and biases to update the weights and biases.
- Imagine being lost in the mountains with your car parked at the bottom of the mountain. It would be good to always take steps downhill which eventually leads to your car. It would also be good to know how big of steps to take and to keep taking steps until you’ve reached the bottom.
13. Why does SGD use mini-batches?
The optimization
algorithms can calculate the gradients using one or more items at the same time. It can use the entire dataset but that takes too long and might not fit into memory. It can use a single item in the dataset but that can be imprecise and unstable. It can also use multiple items in a mini-batch which can be reasonably accurate and stable for larger mini-batches.
- It produces results with the best speed, accuracy, and stability overall.
14. What are the seven steps in SGD for machine learning?
- Initialize the weights and biases with random values.
- Calculate the predictions.
- Calculate the loss.
- Calculate the gradients.
- Update the weights.
- Go to step two and repeat the process.
- Stop when the model is good enough.
15. How do we initialize the weights in a model?
The weights and biases in the model are initialized using random
values. It produces incorrect predictions and high loss
values but that are fine because the optimization
the algorithm continuously updates the weights and biases which minimizes the loss
value to produce the most accurate predictions.
16. What is loss?
Loss: A metric that measures how incorrect the predictions are. It calculates the distance between the predicted
values and label
values using one of the loss
functions. It can be used to prevent underfitting and overfitting. It can also be used to calculate the gradient to update the weights and biases.
17. Why can’t we always use a high learning rate?
Learning Rate: A hyperparameter that controls how much to adjust the weights and biases at the end of each epoch during the training process. It causes the training process to take too long or get stuck in a local minimum when the value is too low. It also causes the training process to overshoot the global minimum or bounce around forever when the value is too high.
- It causes the training process to overshoot the global minimum.
- It causes the training process to bounce around forever.
18. What is a gradient?
Gradient: A vector of partial derivatives of the loss
function with respect to the weights and biases. It measures the change in the loss
value with respect to the changes in the weights and biases. It also identifies whether increasing or decreasing the weights and biases will reduce the loss
value.
19. Do you need to know how to calculate gradients yourself?
Requires_Grad_: A method that informs PyTorch to calculate the gradients of the tensor. It tags the tensor to keep track of all the operations that are applied on the tensor to calculate the gradients during the backward
pass. It also sets the gradients to automatically accumulate in the grad
attribute.
- No, it isn’t necessary to know how to calculate gradients manually.
variable_name = Tensor([1., 2., 3.]).requires_grad_()
20. Why can’t we use accuracy as a loss function?
Classification Accuracy: A metric that measures how often the model makes correct predictions. It can be calculated by dividing the number of correct predictions by the total number of predictions. It also only works well when there’s approximately an equal number of items in each class.
Loss Function: A function that calculates the loss
value by measuring the distance between the predicted
and label
values. It changes the loss
value in response to the changing weights and biases which produces a slightly better loss
value as the model makes slightly better predictions.
- The
accuracy
the metric isn’t sensitive enough to changes in the weights and biases to allow the model to learn effectively during thetraining
process. It only changes when the weights and biases have changed enough to make different predictions. It also produces many gradients that equal zero which prevents the weights and biases from updating.
21. Draw the sigmoid function. What is special about its shape?
Sigmoid: An activation function that converts the predicted
values into probability
values that range between the 0.0
and 1.0
float. It converts very large numbers to the 1.0
float and negative numbers to the 0.0
float. It is also best well for binary
and multi-label
classification problems.
- Sigmoid is an activation function that’s named after its shape which resembles the letter “S” when plotted. It produces a smooth curve that gradually transitions from
float
values above0.0
to just below1.0
.
22. What is the difference between a loss function and a metric?
Loss Function: A function that calculates the loss
value by measuring the distance between the predicted
and label
values. It changes the loss
value in response to the changing weights and biases which produces a slightly better loss
value as the model makes slightly better predictions.
- It provides an interpretation of the performance of the model that’s easy for the computer to understand which helps to minimize the
loss
value and monitor for things like overfitting, underfitting, and convergence.
Metric: A function that evaluates the performance of the model. It usually measures performance differently based on the type of model and dataset. It measures the accuracy, precision, or recall for classification models with balanced datasets. It also measures the area under the curve of receiver operating characteristics for classification models with imbalanced datasets.
- It provides an interpretation of the performance of the model that’s easy for humans to understand which helps give meaning to the numbers in the context of the goals of the overall project and project stakeholders.
23. What is the function to calculate new weights using a learning rate?
Optimizer: A function that minimizes the loss
function. It calculates the gradients of the loss
function to identify whether increasing or decreasing the weights and biases will reduce the loss
value. It multiplies the learning rate by the gradients to determine how much to increase or decrease the weights and biases. It also subtracts the result from the weights and biases to update the weights and biases which produces a smaller loss
value.
- Different
optimizer
functions perform different variations of the task.
24. What does the DataLoader class do?
DataLoader: A class that separates the dataset into mini-batches. It can shuffle the data before the data is separated. It can pass the mini-batches to the Learner
object using sequential
or parallel
processing. It sets the dataset
parameter to the Datasets
object to specify the data. It sets the bs
parameter to a integer
value to specify the batch size. It also returns the mini-batches in the DataLoader
objects in the DataLoaders
object.
25. Write pseudocode showing the basic steps taken in each epoch for SGD.
predictions = linear_model(x_batch)
loss = mnist_loss(predictions, y_batch)
loss.backward()
for parameter in parameters:
parameter.data -= parameter.grad.data * learning_rate
parameter.grad = None
26. Create a function that, if passed two arguments [1,2,3,4] and ‘abcd’, returns [(1, ‘a’), (2, ‘b’), (3, ‘c’), (4, ‘d’)]. What is special about that output data structure?
The output has the same data structure as the Dataset
an object that’s used in PyTorch. It contains a list of tuple
objects that each store an input
and label
value. It also pairs the input
and label
values at each index of the list
objects in the first
and second
parameters of the zip
function.
- It shows that PyTorch is a convenient function rather than dark magic.
27. What does view do in PyTorch?
View: A method that reshapes the tensor object without changing the contents. It can perform memory-efficient reshaping, slicing, and element-wise operations because it doesn’t create a copy of the data. It also shares the underlying data with the original tensor which means any changes that are made to the data in the view will be reflected in the original tensor.
28. What are the bias parameters in a neural network? Why do we need them?
Bias: A parameter that offsets the output
value of the nodes in the model to better fit the data during the training
process. It gets added to the sum of the product of the input
and weight
values before the result is passed to the activation
function. It also shifts the activation
function to the left or right which moves the entire curve to delay or accelerate the activation.
29. What does the @ operator do in Python?
Matrix Multiplication (@): An operator that applies matrix multiplication between two arrays. It performs the same operation as the matmul
function from the NumPy
library. It also makes matrix formulas much easier to read which makes it much easier to work with for both experts and non-experts.
np.matmul(np.matmul(np.matmul(A, B), C), D)A @ B @ C @ D
30. What does the backward method do?
Backward: A method that calculates the gradient of the loss
value with respect to the weights and biases to later update the weights and biases and eventually reduce the loss
value. It calculates the gradient bypassing each weight
value up through the history of operations that are tracked by the tensor. It also stores the gradients in the grad
attribute in the tensor.
31. Why do we have to zero the gradients?
The gradients need to be reset after the backward
the method is called because they accumulate in the grad
attribute every time the backward
the method is called which would prevent the weights and biases from updating correctly.
for parameter in parameters:
parameter.data -= parameter.grad.data * learning_rate
parameter.grad = None
32. What information do we have to pass to Learner?
Learner: A class that stores everything that’s needed to train the model and perform transfer learning. It mostly performs the training loop, customizes the training loop, loads and saves the model, and prints the evaluation metrics. It also requires the following items to create the Learner
object.
DataLoaders
object.Model
class.Optimizer
function.Loss
function.
33. Show Python or pseudocode for the basic steps of a training loop.
for _ in range(number_of_epochs):
for x_batch, y_batch in train_dataloader:
predictions = linear_model(x_batch)
loss = mnist_loss(predictions, y_batch)
loss.backward()
for parameter in parameters:
parameter.data -= parameter.grad.data * learning_rate
parameter.grad = None
34. What is ReLU? Draw a plot of it for values from -2 to +2.
Rectified Linear Unit (ReLU): An activation function that replaces all the negative input
values with zero. It solves the vanishing gradient problem. It also prevents the weights and biases from updating properly when there are too many activations that are zero because the gradient of zero is zero.
35. What is an activation function?
Activation Function: A function that decides which input
values are most important for making predictions. It adds non-linearity to the architecture of the model which lets the model identify complex relationships between the input
and output
values that are essential for learning complex data.
36. What’s the difference between F.relu and nn.ReLU?
These functions apply the same ReLU
the activation function in different ways.
- F.relu is used for building the model by defining the class.
- nn.ReLU is used for building the model using the Sequential module.
37. The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more?
There are performance benefits for using more than two layers and one non-linear activation function. It turns out that smaller matrices with more layers perform better than large matrices with fewer layers. It also means the model will train faster, use fewer parameters, and take up less memory.
Extra Resources: Want to learn how to use artificial intelligence, machine learning, and deep learning? This blog is covering the Fastai course and interesting repositories related to the field.Fastai:
1. Chapter 1: Your Deep Learning Journey Q&A
2. Chapter 2: From Model to Production Q&A
3. Chapter 3: Data Ethics Q&A
4. Chapter 4: Under the Hood: Training a Digit Classifier Q&A
5. Chapter 5: Image Classification Q&A
6. Chapter 6: Other Computer Vision Problems Q&A
7. Chapter 7: Training a State-of-the-Art Model Q&ALinux:
01. Install and Manage Multiple Python Versions
02. Install the NVIDIA CUDA Driver, Toolkit, cuDNN, and TensorRT
03. Install the Jupyter Notebook Server
04. Install Virtual Environments in Jupyter Notebook
05. Install the Python Environment for AI and Machine Learning
06. Install the Fastai Course RequirementsWSL2:
01. Install Windows Subsystem for Linux 2
02. Install and Manage Multiple Python Versions
03. Install the NVIDIA CUDA Driver, Toolkit, cuDNN, and TensorRT
04. Install the Jupyter Notebook Server
05. Install Virtual Environments in Jupyter Notebook
06. Install the Python Environment for AI and Machine Learning
07. Install Ubuntu Desktop With a Graphical User Interface (Bonus)
08. Install the Fastai Course RequirementsWindows 10:
01. Install and Manage Multiple Python Versions
02. Install the NVIDIA CUDA Driver, Toolkit, cuDNN, and TensorRT
03. Install the Jupyter Notebook Server
04. Install Virtual Environments in Jupyter Notebook
05. Install the Python Environment for AI and Machine Learning
06. Install the Fastai Course RequirementsMac:
01. Install and Manage Multiple Python Versions
02. Install the Jupyter Notebook Server
03. Install Virtual Environments in Jupyter Notebook
04. Install the Python Environment for AI and Machine Learning
05. Install the Fastai Course Requirements