Deep Learning

Fastai Chapter 4: Questions & Answers

The answer key for the questionnaire at the end of the chapter

David Littlefield

Published in

Geek Culture

14 min readJul 23, 2021

Chapter Summary:

Chapter 4 provides an overview of the training process. It covers loading datasets, making predictions, measuring loss, calculating gradients, and updating weights and biases. It also covers some of the tensor operations, activation functions, loss functions, optimizer functions, and learning rate.

01. How is a grayscale image represented on a computer? How about a color image?

Grayscale Image: An image with one channel that’s represented as a 2-dimensional matrix. It contains pixel values that represent the intensity of light for each pixel in the image where zero is a black pixel, 255 is a white pixel, and all the values in between are the different shades of gray pixels.

Color Image: An image with three channels that are represented as a 3-dimensional matrix. It contains three 2-dimensional matrices which contain pixel values that represent the intensity of color for each pixel in the image where each matrix represents the different shades of red, green, and blue.

02. How are the files and folders in the MNIST_SAMPLE dataset structured? Why?

The MNIST_SAMPLE the dataset is structured using the file and folder names. It separates the training and validation sets in the dataset into the train and valid subdirectories. It separates the images in the train and valid subdirectories into the 3 and 7 subdirectories. It also organizes the images in the 3 and 7 subdirectories using sequentially numbered file names.

It makes it possible to extract the labels from the file and folder names.

03. Explain how the “pixel similarity” approach to classifying digits works.

Pixel Similarity: An approach that measures the difference between the new image and the ideal image. It creates the ideal image by calculating the average pixel value for every pixel in the specified images. It also measures the distance between the pixel values in the new image and the ideal image.

The image with the shortest distance is most similar to the ideal image.

04. What is list comprehension? Create one now that selects odd numbers from a list and doubles them.

List Comprehension: A syntax that creates a new list based on an existing list. It creates the new list by performing an operation on each item in the existing list. It also contains three parts which include the expression, for loop and optional if-condition that’s declared between square brackets.

[expression for item in list if-condition]

05. What is a rank-3 tensor?

Tensor Rank: Refers to the number of dimensions in the tensor. It refers to n-dimensions where rank zero is a scalar with zero dimensions, rank one is a vector with one dimension, rank two is a matrix with two dimensions, and rank three is a cuboid with three dimensions. It can also be determined by the number of indices that are required to access a value within the tensor.

06. What is the difference between tensor rank and shape? How do you get the rank from the shape?

Tensor Shape: Describes the number of items in each axis in the tensor. It contains information about the number of matrices, number of rows in each matrix, and number of columns in each matrix. It also helps visualize the tensor which becomes useful for higher rank tensors that are more abstract.

The rank refers to the number of dimensions in the tensor.
The shape refers to the number of items in each dimension of the tensor.
The rank is equal to the number of numbers in the shape of the tensor.

07. What are RMSE and L1 norms?

Mean Absolute Error (MAE): A loss function that measures the distance between the predicted and label values. It calculates the mean of the absolute difference between the predicted and label values. It also gives equal emphasis to normal and outlier errors using the same linear scale.

It performs well when there are large errors from outliers.
It performs poorly when there aren’t large errors from outliers.
It performs the same calculation as the “L1 Norm” in mathematics.

mae = (predicted_values - label_values).abs().mean()

Root Mean Squared Error (RMSE): A loss function that measures the distance between the predicted and label values. It calculates the square root of the mean of the squared difference between the predicted and label values. It also gives significantly more emphasis to outlier errors.

It performs well when there aren’t large errors from outliers.
It performs poorly when there are large errors from outliers.
It performs the same calculation as the “L2 Norm” in mathematics.

rmse = ((predicted_values - label_values)**2).mean().sqrt()

08. How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop?

Numpy Array: A multi-dimensional matrix that can perform numerical operations on its data. It can contain items that have the same data type which includes arrays of arrays. It can also run in the C programming language using the CPU which runs thousands of times faster than Python.

PyTorch Tensor: A data structure that’s similar to the Numpy array but with an extra restriction that unlocks new capabilities. It can only contain items that are the same basic numeric data type. It can also run in the C programming language using the CPU which runs thousands of times faster than Python or using the GPU which runs up to millions of times faster.

By performing the calculations using Numpy arrays or PyTorch tensors.

09. Create a 3×3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom-right four numbers.

10. What is broadcasting?

Broadcasting: An operation that performs an operation between tensors with different shapes. It requires the length of the dimensions in the tensors to be equal or one of the dimensions must be one. It automatically expands the tensor with the smallest rank to have the same size as the larger tensor.

11. Are metrics generally calculated using the training set or the validation set? Why?

Validation Set: A subset of the dataset that isn’t used to train the model. It can be used to produce an unbiased evaluation of the model because the model isn’t allowed to learn or memorize the data in the subset. It can also be used to prevent underfitting and overfitting by fine-tuning the model.

The metrics are usually calculated using the validation set.
It produces an unbiased evaluation of the performance of the model.

12. What is SGD?

Stochastic Gradient Descent (SGD): An optimization function that minimizes the loss value. It initializes the weights and biases using random values. It calculates the gradients of the loss function to identify whether increasing or decreasing the weights and biases will reduce the loss value. It multiplies the learning rate by the gradients to determine how much to increase or decrease the weights and biases. It also subtracts the result from the weights and biases to update the weights and biases.

Imagine being lost in the mountains with your car parked at the bottom of the mountain. It would be good to always take steps downhill which eventually leads to your car. It would also be good to know how big of steps to take and to keep taking steps until you’ve reached the bottom.

13. Why does SGD use mini-batches?

The optimization algorithms can calculate the gradients using one or more items at the same time. It can use the entire dataset but that takes too long and might not fit into memory. It can use a single item in the dataset but that can be imprecise and unstable. It can also use multiple items in a mini-batch which can be reasonably accurate and stable for larger mini-batches.

It produces results with the best speed, accuracy, and stability overall.

14. What are the seven steps in SGD for machine learning?

Initialize the weights and biases with random values.
Calculate the predictions.
Calculate the loss.
Calculate the gradients.
Update the weights.
Go to step two and repeat the process.
Stop when the model is good enough.

15. How do we initialize the weights in a model?

The weights and biases in the model are initialized using random values. It produces incorrect predictions and high loss values but that are fine because the optimization the algorithm continuously updates the weights and biases which minimizes the loss value to produce the most accurate predictions.

16. What is loss?

Loss: A metric that measures how incorrect the predictions are. It calculates the distance between the predicted values and label values using one of the loss functions. It can be used to prevent underfitting and overfitting. It can also be used to calculate the gradient to update the weights and biases.

17. Why can’t we always use a high learning rate?

Learning Rate: A hyperparameter that controls how much to adjust the weights and biases at the end of each epoch during the training process. It causes the training process to take too long or get stuck in a local minimum when the value is too low. It also causes the training process to overshoot the global minimum or bounce around forever when the value is too high.

It causes the training process to overshoot the global minimum.
It causes the training process to bounce around forever.

18. What is a gradient?

Gradient: A vector of partial derivatives of the loss function with respect to the weights and biases. It measures the change in the loss value with respect to the changes in the weights and biases. It also identifies whether increasing or decreasing the weights and biases will reduce the loss value.

19. Do you need to know how to calculate gradients yourself?

Requires_Grad_: A method that informs PyTorch to calculate the gradients of the tensor. It tags the tensor to keep track of all the operations that are applied on the tensor to calculate the gradients during the backward pass. It also sets the gradients to automatically accumulate in the grad attribute.

No, it isn’t necessary to know how to calculate gradients manually.

variable_name = Tensor([1., 2., 3.]).requires_grad_()

20. Why can’t we use accuracy as a loss function?

Classification Accuracy: A metric that measures how often the model makes correct predictions. It can be calculated by dividing the number of correct predictions by the total number of predictions. It also only works well when there’s approximately an equal number of items in each class.

Loss Function: A function that calculates the loss value by measuring the distance between the predicted and label values. It changes the loss value in response to the changing weights and biases which produces a slightly better loss value as the model makes slightly better predictions.

The accuracy the metric isn’t sensitive enough to changes in the weights and biases to allow the model to learn effectively during the training process. It only changes when the weights and biases have changed enough to make different predictions. It also produces many gradients that equal zero which prevents the weights and biases from updating.

21. Draw the sigmoid function. What is special about its shape?

Sigmoid: An activation function that converts the predicted values into probability values that range between the 0.0 and 1.0 float. It converts very large numbers to the 1.0 float and negative numbers to the 0.0 float. It is also best well for binary and multi-label classification problems.

Sigmoid is an activation function that’s named after its shape which resembles the letter “S” when plotted. It produces a smooth curve that gradually transitions from float values above 0.0 to just below 1.0.

22. What is the difference between a loss function and a metric?

It provides an interpretation of the performance of the model that’s easy for the computer to understand which helps to minimize the loss value and monitor for things like overfitting, underfitting, and convergence.

Metric: A function that evaluates the performance of the model. It usually measures performance differently based on the type of model and dataset. It measures the accuracy, precision, or recall for classification models with balanced datasets. It also measures the area under the curve of receiver operating characteristics for classification models with imbalanced datasets.

It provides an interpretation of the performance of the model that’s easy for humans to understand which helps give meaning to the numbers in the context of the goals of the overall project and project stakeholders.

23. What is the function to calculate new weights using a learning rate?

Optimizer: A function that minimizes the loss function. It calculates the gradients of the loss function to identify whether increasing or decreasing the weights and biases will reduce the loss value. It multiplies the learning rate by the gradients to determine how much to increase or decrease the weights and biases. It also subtracts the result from the weights and biases to update the weights and biases which produces a smaller loss value.

Different optimizer functions perform different variations of the task.

24. What does the DataLoader class do?

DataLoader: A class that separates the dataset into mini-batches. It can shuffle the data before the data is separated. It can pass the mini-batches to the Learner object using sequential or parallel processing. It sets the dataset parameter to the Datasets object to specify the data. It sets the bs parameter to a integer value to specify the batch size. It also returns the mini-batches in the DataLoader objects in the DataLoaders object.

25. Write pseudocode showing the basic steps taken in each epoch for SGD.

predictions = linear_model(x_batch)
loss = mnist_loss(predictions, y_batch)
loss.backward()
for parameter in parameters:
    parameter.data -= parameter.grad.data * learning_rate
    parameter.grad = None

26. Create a function that, if passed two arguments [1,2,3,4] and ‘abcd’, returns [(1, ‘a’), (2, ‘b’), (3, ‘c’), (4, ‘d’)]. What is special about that output data structure?

The output has the same data structure as the Dataset an object that’s used in PyTorch. It contains a list of tuple objects that each store an input and label value. It also pairs the input and label values at each index of the list objects in the first and second parameters of the zip function.

It shows that PyTorch is a convenient function rather than dark magic.

27. What does view do in PyTorch?

View: A method that reshapes the tensor object without changing the contents. It can perform memory-efficient reshaping, slicing, and element-wise operations because it doesn’t create a copy of the data. It also shares the underlying data with the original tensor which means any changes that are made to the data in the view will be reflected in the original tensor.

28. What are the bias parameters in a neural network? Why do we need them?

Bias: A parameter that offsets the output value of the nodes in the model to better fit the data during the training process. It gets added to the sum of the product of the input and weight values before the result is passed to the activation function. It also shifts the activation function to the left or right which moves the entire curve to delay or accelerate the activation.

29. What does the @ operator do in Python?

Matrix Multiplication (@): An operator that applies matrix multiplication between two arrays. It performs the same operation as the matmul function from the NumPy library. It also makes matrix formulas much easier to read which makes it much easier to work with for both experts and non-experts.

np.matmul(np.matmul(np.matmul(A, B), C), D)A @ B @ C @ D

30. What does the backward method do?

Backward: A method that calculates the gradient of the loss value with respect to the weights and biases to later update the weights and biases and eventually reduce the loss value. It calculates the gradient bypassing each weight value up through the history of operations that are tracked by the tensor. It also stores the gradients in the grad attribute in the tensor.

31. Why do we have to zero the gradients?

The gradients need to be reset after the backward the method is called because they accumulate in the grad attribute every time the backward the method is called which would prevent the weights and biases from updating correctly.

for parameter in parameters:
    parameter.data -= parameter.grad.data * learning_rate
    parameter.grad = None

32. What information do we have to pass to Learner?

Learner: A class that stores everything that’s needed to train the model and perform transfer learning. It mostly performs the training loop, customizes the training loop, loads and saves the model, and prints the evaluation metrics. It also requires the following items to create the Learner object.

DataLoaders object.
Model class.
Optimizer function.
Loss function.

33. Show Python or pseudocode for the basic steps of a training loop.

for _ in range(number_of_epochs):
    for x_batch, y_batch in train_dataloader:
        predictions = linear_model(x_batch)
        loss = mnist_loss(predictions, y_batch)
        loss.backward()
        for parameter in parameters:
            parameter.data -= parameter.grad.data * learning_rate
            parameter.grad = None

34. What is ReLU? Draw a plot of it for values from -2 to +2.

Rectified Linear Unit (ReLU): An activation function that replaces all the negative input values with zero. It solves the vanishing gradient problem. It also prevents the weights and biases from updating properly when there are too many activations that are zero because the gradient of zero is zero.

35. What is an activation function?

Activation Function: A function that decides which input values are most important for making predictions. It adds non-linearity to the architecture of the model which lets the model identify complex relationships between the input and output values that are essential for learning complex data.

36. What’s the difference between F.relu and nn.ReLU?

These functions apply the same ReLU the activation function in different ways.

F.relu is used for building the model by defining the class.

nn.ReLU is used for building the model using the Sequential module.

37. The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more?

There are performance benefits for using more than two layers and one non-linear activation function. It turns out that smaller matrices with more layers perform better than large matrices with fewer layers. It also means the model will train faster, use fewer parameters, and take up less memory.

Extra Resources: Want to learn how to use artificial intelligence, machine learning, and deep learning? This blog is covering the Fastai course and interesting repositories related to the field.Fastai:
1. Chapter 1: Your Deep Learning Journey Q&A
2. Chapter 2: From Model to Production Q&A
3. Chapter 3: Data Ethics Q&A
4. Chapter 4: Under the Hood: Training a Digit Classifier Q&A
5. Chapter 5: Image Classification Q&A
6. Chapter 6: Other Computer Vision Problems Q&A
7. Chapter 7: Training a State-of-the-Art Model Q&ALinux:
01. Install and Manage Multiple Python Versions
02. Install the NVIDIA CUDA Driver, Toolkit, cuDNN, and TensorRT
03. Install the Jupyter Notebook Server
04. Install Virtual Environments in Jupyter Notebook
05. Install the Python Environment for AI and Machine Learning
06. Install the Fastai Course RequirementsWSL2:
01. Install Windows Subsystem for Linux 2
02. Install and Manage Multiple Python Versions
03. Install the NVIDIA CUDA Driver, Toolkit, cuDNN, and TensorRT 
04. Install the Jupyter Notebook Server
05. Install Virtual Environments in Jupyter Notebook
06. Install the Python Environment for AI and Machine Learning
07. Install Ubuntu Desktop With a Graphical User Interface (Bonus)
08. Install the Fastai Course RequirementsWindows 10:
01. Install and Manage Multiple Python Versions
02. Install the NVIDIA CUDA Driver, Toolkit, cuDNN, and TensorRT
03. Install the Jupyter Notebook Server
04. Install Virtual Environments in Jupyter Notebook
05. Install the Python Environment for AI and Machine Learning
06. Install the Fastai Course RequirementsMac:
01. Install and Manage Multiple Python Versions
02. Install the Jupyter Notebook Server
03. Install Virtual Environments in Jupyter Notebook
04. Install the Python Environment for AI and Machine Learning
05. Install the Fastai Course Requirements