How do neural networks work (Part 2 — building a model)

Jeremy
9 min readJan 14, 2023

--

Intuition

The concept of neural networks is intuitive enough. If we could replicate the human brain, could we create something as smart as ourselves?

Unfortunately there are still many aspects of the brain that are a mystery. One model of the human brain is a graph where each node is a neuron that acts to strengthen or weaken input signals. Pathways will fire and become prominent based on repeated stimuli. The translation of this mental model into code is the origin of neural networks (see part 1).

By Izaak Neutelings

Understanding backpropagation

Initially, the network’s parameters start in a randomized state as there hasn’t been any input. This shares analogies with the human brain, where some people’s starting point may be better suited than others for different tasks.

Backpropagation is the process used by the model to ‘learn’, going from the initial randomized state to a state where inputs are correctly mapped to desired outputs. It does this by updating weights and biases layer by layer, starting from the last layer.

Recall each layer of a neural network can be formulated as a function:

For a given layer, how should the weights (W) and biases (B) be updated? The intuition is we want to vary W or B such that it minimizes the error function as much as possible. One way to do this is by iteratively moving in the direction of steepest descent — and eventually we’ll end up at a minima. But which direction is the direction of steepest descent?

Recall from multivariable calculus that the gradient vector marks the direction of steepest ascent — watch this if you need a refresher. This gradient vector is defined as the partial derivative of a function with respect to each of its variables. So to find the direction of steepest descent, all we need to do is to compute the gradient vector of the error function in terms of W and B respectively and update the parameters in the opposite direction to the gradient vector.

Therefore, to compute the direction of steepest descent in terms of each parameter, we have the following set of partial derivative equations:

For now let’s ignore how we find the partial derivative of the error function with respect to y — I’ll cover this later.

The application of the chain rule is valid here as the error function varies based on y (the output of the current layer). In turn, y varies based on the terms: x, w, b.

As the partial derivative represents the direction of steepest ascent, we take the negation to move in the direction of steepest descent. Thus, we have our weight and bias update equations:

While we’ve updated W and B, is there a way to update the input x of the current layer? Not directly, but we could indirectly affect x by updating the weights and biases of the previous layer. To do this, we want to compute the partial derivative of e with respect to x in the current layer.

Note that the input x in the current layer is actually the output y from the perspective of the previous layer. Therefore for all layers besides the last layer we have calculated the needed vector for updating weights and biases, de/dy.

The final part is we need to find de/dy for the last layer to initiate the whole process. While there are many different error functions that could be used, suppose we use the mean squared error (MSE) function. Then we have:

Commonly, MSE is used for regression problems and binary cross-entropy is used for classification problems. Although the dummy dataset I’ll be using is a classification problem, I am sticking with MSE as the error function to keep things simple.

Now we have everything we need. Let’s go into implementing this in code.

Implementing a neural network

One of the best ways to learn how neural networks work, is to build one ourselves. These days, practically no one builds a network from scratch — opting to use TensorFlow or PyTorch to take care of all the required boiler plate. While these packages are useful, it is still important to understand how the underlying models actually work.

I have chosen Python as the level of abstraction for this exercise. While we could go lower level to C or bit manipulation, I think that would cloud the key ideas behind a neural network.

Let’s suppose we want to train a neural network to solve the XOR problem. As we already know what each input value should map to, we can construct this dataset in the following way:

NUM_EXAMPLES = 1000

data = []
for i in range(NUM_EXAMPLES):
val_1 = random.random()
val_2 = random.random()

if (val_1 < 0.5 and val_2 >= 0.5) or (val_1 >= 0.5 and val_2 < 0.5):
label = 1.0
else:
label = 0.0

data.append(([val_1, val_2], [label]))

Now let’s implement the network. First, we define a class for a fully connected layer:

class FCLayer:
def __init__(self, input_dim, output_dim):
self.input_dim = input_dim
self.output_dim = output_dim

# Instantiate initial weights as a output_dim by input_dim matrix
# This instantiates a randomized weight for every edge directed to
# every node in the layer
self.weights = [
[
random.random() for j in range(input_dim)
] for i in range(output_dim)
]
# This instantiates a bias value for every node in the layer
self.bias = [
random.random() for i in range(output_dim)
]

def forward(self, input: List[float]):
assert(len(input) == self.input_dim)

# Save the input passed to the forward function for backpropagation
self.input = input

y = [
sum(
[
input[j] * self.weights[i][j] for j in range(self.input_dim)
]
) + self.bias[i] for i in range(self.output_dim)
]

return y

def backward(self, de_dy: List[float], learning_rate: float):
assert(len(de_dy) == self.output_dim)

de_dx = [
sum(
[
de_dy[j] * self.weights[j][i] for j in range(self.output_dim)
]
) for i in range(self.input_dim)
]
de_dw = [
[
de_dy[i] * self.input[j] for j in range(self.input_dim)
] for i in range(self.output_dim)
]
de_db = de_dy

for i in range(self.output_dim):
for j in range(self.input_dim):
self.weights[i][j] -= de_dw[i][j] * learning_rate

for i in range(self.output_dim):
self.bias[i] -= de_db[i] * learning_rate

return de_dx

Note the for loops used in the forward and backward functions are just implementations of matrix operations. Therefore using libraries such as NumPy can simplify the code and lead to better performance.

With the current layer definition, we see a sequence of layers will ultimately only produce some linear combination of the original input. Consider a 2 layer network with 2, 1 nodes in each layer:

In the first layer we end up with the following values:

In the second (final) layer we are still left with a linear combination:

Therefore we need to add a non-linear activation function between layers to allow the model to learn non-linear mappings. This was the problem with the perceptron network (see part 1) when tasked with solving the XOR problem. The ReLU function is commonly used as it also helps with vanishing and exploding gradient problems for deep neural networks.

We can see this solves the linear combination problem as the final node in our dummy network above would then be:

As ReLU is non-linear, it is not possible to rewrite this as a linear combination of the input x1, x2 (you would need cases). The caveat to this would be if the values passed to ReLU are always non-negative for all inputs. In that case ReLU would make no difference as ReLU is simply the identity function for non-negative inputs. In practice, ReLU works well.

Needing cases hints at what ReLU is doing under the hood. While it can be said to be non-linear, more precisely, it is piecewise linear. It simulates a curve via a series of smaller straight lines.

Recall ReLU and its derivative are defined as:

Therefore the value we pass backwards from this activation layer is

class ReLUActivation:
def forward(self, input: List[float]):
self.input = input

y = [
0.0 if input[i] <= 0 else input[i] for i in range(len(input))
]
return y

def backward(self, de_dy: List[float], learning_rate: float):
# Here we disregard the learning rate as this is an activation function
de_dx = [
0.0 if self.input[i] <= 0 else de_dy[i] for i in range(len(de_dy))
]
return de_dx

Now we define a class for the model which contains a train and predict function:

class Model:
def __init__(self, layers, num_epochs, batch_size, learning_rate):
self.layers = layers
self.num_epochs = num_epochs
self.batch_size = batch_size
self.learning_rate = learning_rate

def predict(self, input):
output = input

for layer in self.layers:
output = layer.forward(output)

return output

def train(self, train_set):
for epoch in range(self.num_epochs):
cur_pos = 0
epoch_error_sum = [0 for i in range(len(train_set[0][1]))]

while cur_pos < len(train_set):
error_sum = [0 for i in range(len(train_set[cur_pos][1]))]
de_dy = [0 for i in range(len(train_set[cur_pos][1]))]

for i in range(self.batch_size):
output, label = train_set[cur_pos + i]

for layer in self.layers:
output = layer.forward(output)

assert(len(output) == len(label))

for i in range(len(error_sum)):
error_sum[i] += 1/self.batch_size * math.pow(label[i] - output[i], 2)
de_dy[i] += -2/self.batch_size * (label[i] - output[i])

for layer in reversed(self.layers):
de_dy = layer.backward(de_dy, self.learning_rate)

for i in range(len(epoch_error_sum)):
epoch_error_sum[i] += error_sum[i]

cur_pos += self.batch_size

print(f"Epoch {epoch}, error {epoch_error_sum}")

We then construct and train the model:

NUM_EPOCHS = 300
BATCH_SIZE = 5
LEARNING_RATE = 0.01

model = Model(
[
FCLayer(2, 64),
ReLUActivation(),
FCLayer(64, 1),
],
NUM_EPOCHS,
BATCH_SIZE,
LEARNING_RATE
)

model.train(data)

Finally, we can evaluate the model:

predictions_df = pd.DataFrame(columns=['x1', 'x2', 'prediction'])

error_sum = [0 for i in range(len(data[0][1]))]

for input, label in data:
prediction = model.predict(input)

predictions_df = predictions_df.append({
'x1': input[0],
'x2': input[1],
'prediction': prediction[0],
'label': label[0]
}, ignore_index=True)

for i in range(len(error_sum)):
error_sum[i] += abs(label[i] - prediction[i])

fig = px.scatter(predictions_df, x="x1", y="x2", color='prediction')
fig.show()

We can see from the scatterplot below that the model is predicting high values for points in the top left and bottom right quadrants, and low values for points in the top right and bottom left quadrants. This is precisely the solution we’re looking for.

You can find the full implementation of this network here.

What’s next

Now that we’ve seen how to build a neural network, it’s really not that complicated. What’s surprising, is how effective these networks are at mimicking and even outperforming humans on real world tasks.

In part 3, let’s investigate the effectiveness of these models.

References

--

--