Understanding TensorFlow: Part 3–2

Serie 3–2: Neural network-related operations

8 min readSep 9, 2021

Today’s outline:

Defining loss
Automatic differentiation and gradients
Optimization of neural networks

Defining loss

We know that in order for a neural network to learn something useful, a loss needs to be defined. There are several functions for automatically calculating the loss in TensorFlow, two of which are shown in the following code. The tf.nn.l2_loss function is the mean squared error loss, and tf.nn.softmax_cross_entropy_ with_logits_v2 is another type of loss, which actually gives better performance in classification tasks. And by logits here, we mean the unnormalized output of the neural network (that is, the linear output of the last layer of the neural network):

# Returns half of L2 norm of t given by sum(t**2)/2

x = tf.constant([[2,4],[6,8]],dtype=tf.float32)

x_hat = tf.constant([[1,2],[3,4]],dtype=tf.float32)

# MSE = (1**2 + 2**2 + 3**2 + 4**2)/2 = 15

MSE = tf.nn.l2_loss(x-x_hat)

# A common loss function used in neural networks to optimize the network

# Calculating the cross_entropy with logits (unnormalized outputs of the last layer)

# instead of outputs leads to better numerical stabilities

y = tf.constant([[1,0],[0,1]],dtype=tf.float32)

y_hat = tf.constant([[3,1],[2,5]],dtype=tf.float32)

# This function alone doesn’t average the cross entropy losses of all data points,

# You need to do that manually using reduce_mean function

CE = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=y_hat))

tf.keras has the corresponding operation tf.keras.MSE and tf.keras.losses.categorical_crossentropy. Note that tf.nn.l2_loss automatically compute sum(t**2)/2 while tf.keras.MSE need to plus sum operation manually by tf.reduce_sum. tf.keras.losses.categorical_crossentropy needs to specify parameter from_logits = True to take the unnormalized output of the neural network to compute.

#MSE_keras

x = tf.constant([[2,4],[6,8]],dtype=tf.float32)

x_hat = tf.constant([[1,2],[3,4]],dtype=tf.float32)

MSE_keras = tf.reduce_sum(tf.keras.losses.MSE(x, x_hat))

# cross entropy

y = tf.constant([[1,0],[0,1]],dtype=tf.float32)

y_hat = tf.constant([[3,1],[2,5]],dtype=tf.float32)

CE_keras = tf.reduce_mean(tf.keras.losses.categorical_crossentropy(y, y_hat, from_logits=True))

Automatic differentiation and gradients

tf.Variable

Since in most cases, you will want to calculate gradients with respect to a model’s trainable variables. Here before we introduce how to use TensorFlow to calculate gradients, we need to introduce tf.Variable first.

Variables play an important role in TensorFlow. A variable is essentially a tensor with a specific shape defining how many dimensions the variable will have and the size of each dimension. However, unlike a regular tensor, variables are mutable; meaning that the value of the variables can change after they are defined. This is an ideal property to have to implement parameters of a learning model (for example, neural network weights), where the weights change slightly after each step of learning. For example, if you define a variable with x = tf.Variable(0,dtype=tf.int32), you can change the value of that variable using a TensorFlow operation such as tf.assign(x,x+1). However, if you define a tensor such as x = tf.constant(0,dtype=tf.int32), you cannot change the value of the tensor, as for a variable. It should stay 0 until the end of the program execution.

Variable creation is quite simple. When creating a variable, a few things are of high importance. We list them here and discuss each in detail in the following paragraphs:

Variable shape
Data type
initial value
Name (optional)

The variable shape is a 1D vector of the [x,y,z,…] format. Each value in the list indicates how large the corresponding dimension or axis is. For instance, if you require a 2D tensor with 50 rows and 10 columns as the variable, the shape would be equal to [50,10].

When creating a variable, a few things are of high importance. We list them here and discuss each in detail in the following paragraphs:

Variable shape
Data type
Initial value
Name (optional)

The data type plays an important role in determining the size of a variable. There are many different data types including the commonly used tf.bool, tf.uint8, tf.float32, and tf.int32. Each data type has a number of bits required to represent a single value with that type. For example, tf.uint8 requires 8 bits, whereas tf.float32 requires 32 bits. It is common practice to use the same data types for computations as doing otherwise can lead to data type mismatches. So if you have two different data types for two tensors that you need to transform, you have to explicitly convert one tensor to the other tensor’s type using the tf.cast(…) operation.

The tf.cast(…) operation is designed to cope with such situations. For example, if you have an x variable with the tf.int32 type, which needs to be converted to tf.float32, employ tf.cast(x,dtype=tf.float32) to convert x to tf.float32.

Next, a variable requires an initial value to be initialized with. TensorFlow provides several different initializers for our convenience, including constant initializers and normal distribution initializers. Here are a few popular TensorFlow initializers you can use to initialize variables:

tf.zeros
tf.constant
tf.random_uniform
tf.truncated_normal
tf.normal

Finally, the name of the variable will be used as an ID to identify that variable in the graph. So if you ever visualize the computational graph, the variable will appear by the argument passed to the name keyword. If you do not specify a name, TensorFlow will use the default naming scheme.

We mentioned at the very beginning of this section that the data type in TensorFlow is Tensor. So tf.Variable is a Tensor, too. But it is a little bit special comparing with other common Tensors. tf.Variable objects store mutable tf.Tensor. More accurately speaking, variables are backed by tensor. they have a dtype and shape, and can be exported to NumPy just like other Tensors. The difference between tf.Variable and other Tensors is that variables cannot be reshaped and common Tensors can’t be reassigned by using tf.Variable.assign.

# create variable

a = tf.Variable([2.0, 3.0])

# reassign

a.assign([1, 2])

The parameter ‘trainable’ in tf.Variable needs to be noted. If a tf.Variable is set by trainable=False, the gradient computation with respect to it would be ignored. The default value of ‘trainable’ is True.

One more thing, tf.Variable + tf.Tensor would be a tf.Tensor.

a = tf.Variable(1.0)

b = tf.constant(2.0)

c = a + b

print(c)

# Returns (out) =>

tf.Tensor(3.0, shape=(), dtype=float32)

tf.GradientTape

TensorFlow2.x provides the tf.GradientTape API for automatic differentiation. It ‘records’ operations happened during the forward pass to a ‘tape’, then uses that tape in reverse order to compute the gradients during the backward pass.

Let’s see an example to calculate gradients with respect to a model’s trainable variables.

# trainable variables in a model is watched by default

layer = tf.keras.layers.Dense(2, activation=’relu’)

x = tf.constant([[1., 2., 3.]])

with tf.GradientTape() as tape:

# Forward pass

y = layer(x)

loss = tf.reduce_mean(y**2)

# Calculate gradients with respect to every trainable variable

grad = tape.gradient(loss, layer.trainable_variables)

# Returns (out) =>

[<tf.Tensor: shape=(3, 2), dtype=float32, numpy= array([[0. , 1.0097853], [0. , 2.0195706], [0. , 3.029356 ]], dtype=float32)>,

<tf.Tensor: shape=(2,), dtype=float32, numpy=array([0. , 1.0097853], dtype=float32)>]

tf.GradientTape automatically records (watch) operations after accessing a trainable tf.Variable. This default behavior makes it convenient to calculate the gradient of a loss with respect to all a model’s trainable variables. But tf.Tensor is not “watched” by default. So TensorFlow provides us a GradientTape.watch() method to have over the control of whose gradient should be calculated.

# use GradientTape.watch() to take over the control

x0 = tf.Variable(0.0, trainable=False)

x1 = tf.Variable(10.0)

x2 = tf.constant(4.0)

with tf.GradientTape() as tape:

tape.watch(x2)

y0 = tf.math.sin(x0)

y1 = tf.nn.softplus(x1)

y2 = tf.pow(x2, 2.0)

y = y0 + y1 + y2

y_sum = tf.reduce_sum(y)

grad = tape.gradient(y_sum, {‘x0’: x0, ‘x1’: x1, ‘x2’: x2})

print(‘dy/dx0:’, grad[‘x0’])

print(‘dy/dx1:’, grad[‘x1’].numpy())

print(‘dy/dx2:’, grad[‘x2’])

# Returns (out) =>

dy/dx0: None

dy/dx1: 0.9999546

dy/dx2: tf.Tensor(8.0, shape=(), dtype=float32)

we can see from the above example, x0 is not trainable and not watched, the GradientTape ignored it. The gradient result with respect to x0 is None. x1 is a trainable tf.Variable. It is watched by GradientTape automatically and it has a gradient result. x3 is a constant. It needs to be explicitly watched by GradientTape so that it can get a gradient result.

Optimization of neural networks

After defining the loss of a neural network, our objective is to minimize that loss over time. Optimization is the procedure used for this. In other words, the objective of the optimizer is to find the neural network parameters (that is, weights and bias values) that give the minimum loss for all the inputs. Again, our beloved TensorFlow provides us with several different optimizers, so we don’t have to worry about implementing them from scratch.

Figure 2.9 illustrates a simple optimization problem and shows how the optimization happens over time. The curve can be imagined as the loss curve (for high dimensions, we say loss surface), where x can be thought of as the parameters of the neural network (in this case a neural network with a single weight), and y can be thought of as the loss. We have an initial guess of x=2. From this point, we use the optimizer to reach the minimum y (that is, loss), which is obtained at x=0. More specifically, we take small steps in the direction opposite to the gradient at a given point and continue for several steps in this manner. However, in real-world problems, the loss surface will not be as nice as in the illustration, but it will be more complex:

In this example, we use tf.keras.optimizers.SGD. The learning_rate parameter denotes the step size you take in the direction of minimization (distance between two red dots):

# Optimizers play the role of tuning neural network parameters so that

# their task error is minimal

# For example task error can be the MSE

# for a classification task

opt = tf.keras.optimizers.SGD(learning_rate=0.1)

var = tf.Variable(1.0)

loss = lambda: (var ** 2)/2.0 # d(loss)/d(var1) = var1

# First step is `- learning_rate * grad`

opt.minimize(loss, [var]).numpy()

print(var.numpy())

# Returns (out) =>0.9

Everytime you execute the loss minimize operation , you will get close to the varvalue that gives the minimum of loss.