Implement Transformer via Pytorch step-by-step part 2

Ming H.
2 min readDec 4, 2023

--

Following from part 1, where we have defined the self-attention func along with the multi-head mechanism, we will continue to build our block for the encoder and decoder.

As you can see, deep learning is like stacking basic building blocks to form a complex architecture. The encoder part is fairly easy: ) To make the code more straightforward to follow, I choose not to clone sublayers for add + norm in slight sacrifice of longer codes.

class Encoder_layer(nn.Module):

def __init__(self,n_head,d_model,hidden):
super(Encoder_layer, self).__init__()

self.norm = nn.LayerNorm(layer.size)

self.attention_layer= MultiHeadAttention(d_model, n_head)

self.feed_forward_layer= FeedForwardLayer(d_model, hidden)

def forward(self, x):
# we make a copy for later residue adding
_x = x

# use multi-head attention we defined in part 1
atten = self.attention_layer(x)

# add residue and normalize layer
_atten = _x + self.norm(atten)

# feed forward layer which we will define later
x = self.feed_forward_layer(x)

return self.norm(x)+_atten

Let’s define feed-forward layer class:

class FeedForwardLayer(nn.Module):

def __init__(self, d_model, hidden):

super(FeedForwardLayer, self).__init__()

self.linear1 = nn.Linear(d_model, hidden)

self.linear2 = nn.Linear(hidden, d_model)

self.relu = nn.ReLU()

def forward(self, x):

x = self.linear1(x)

x = self.relu(x)

x = self.linear2(x)

return x

We have everything we need for the encoder layer now, we need to copy and connect it 6 times sequentially as the original paper did. Remember the copy helper func we defined in part 1? It now comes in handy!

class Encoder(nn.Module):

def __init__(self, d_model, hidden, n_head, n_copy):

super().__init__()

# n_copy = 6

self.layers = clones(
EncoderLayer(d_model,
hidden,
n_head),
n_copy)

def forward(self, x):

for layer in self.layers:

x = layer(x)

return x

Now the encode block is done! We could follow similar logic to write decoder block. The only difference is to prevent algo from cheating to read all sequential words in the sentence, we have to mask the next words by simply assigning its value to -inf after current position i.

--

--