Transformers — Encoder Decoder Architecture (part-1)

5 min readJun 11, 2024

အခုနောက်ပိုင်း အရမ်းခေတ်စားနေတဲ့ BERT, GPT3 မှာပါံတဲ့ Transformer Arcitecture အကြောင်း sharing လုပ်ပေးပါမယ်။ အရင်တစ်ခါပြောပြတဲ့ CLIP မှာလည်း Transformers ပါပါတယ်။ Transformer ကို နားလည်ဖို့ အရင် Deep Learning architecutre ဖြစ်တဲ့ Recurrent Neural Network (RNN) အကြောင်းတွေသိထားဖို့လိုပါတယ်။ Content က နညိိးနည်းရှည်ပါတယ်။

Part 1 : Input Embedding and Positional Embedding
Part 2 : Encoder Stage
Part 3 : Decoder Stage
Part 4 : Pros and Cons, Resources
Part 5 : Research Papers using Transformers

Transformers ကို မပြောခင် အရင် encoder decoder ကို အနည်းငယ် ပြောပြပါမယ်။ Encoder Decoder က Sequence to Sequence task တွေဖြစ်တဲ့ Machine Translation, Text Summarization, Question and Answering တွေကနေ စတာပါ။ ပုံကိုကြည့်ရင်တော့ difference ကို နားလည်လိမ့်မယ် ထင်ပါတယ်။

Transformers အကြောင်းပြောရင် အရမ်းများတာမို့ အပိုင်း ၃ ပိုင်းခွဲပြောမယ်လုပ်ထားပါတယ်။

Input embedding and Positional embedding
Encoder
Decoder

Attention is all you need ဆိုတဲ့ Paper ထဲက အတိုင်း ကို တစ်ခြား additional information တွေပါထည့်ပြီးပြောပြပေးထားပါတယ်။ အောက်မှာပါတဲ့ English လိုရေးထားတာတွေက paper ထဲက information တွေပါ။ အဲ့ဒါတွေကို တစ်ခုချင်းရှင်းထားပေးတာပါ။

Attention is all you need Paper link :

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an…

arxiv.org

တစ်ပိုင်းချင်းစီကို အရှင်းဆုံး example တစ်ခုသုံးပြီး Math calculation ကိုဘယ်လိုသုံးသွားလဲဆိုတာကို အသေးစိတ်ရှင်းပြပေးပါမယ်။

ဒီ post မှာကတော့ Encoder ထဲမှာပါတဲ့ Input Embedding နဲ့ Positional Embedding ပါ။

What is Input Embedding ?

Embedding layer ကဘာလုပ်တာလဲကို အတိုချုပ်ပြောရရင် textual word တွေရဲ့ information တွေကို learning လုပ်တဲ့ layer လို့ ပြောလို့ရပါတယ်။ word တွေမှာ semantics ကော syntactic meaning ကော ရှိတာမို့

သူ့ထဲမှာလည်း step တွေအများကြီးပါပါတယ်။

Step (1) Tokenization

ဝင်လာတဲ့ input sentence တစ်ခုချင်းစီကို tokenization လုပ်ရပါမယ်။

Example ထဲတော့ word level ခွဲထားတာပါ။ Tokenization မှာလည်း Word Level Tokenization, Character Level Tokenization, Byte Pair Encoding (BPE), WordPiece, Sentence Tokenization စသဖြင့်အများကြီးရှိပါတယ်။ Tokenization ကောင်းလာလေလေ အနောက်က model ကလည်း performance ကောင်းလာလေလေပါပဲ။ အသေးစိတ်တော့ မပြောတော့ပါဘူး။ deep dive သွားဖို့ ကောင်းတဲ့ topic ပါ။ မြန်မာစာမှာဆို ဒီ tokenization တောင်ကောင်းကောင်းမရှိပါဘူး။

Step (2) Vocabulary Mapping and Embedding look up

ပုံထဲ ကြည့်ရင် ပိုရှင်းပါတယ်။ Dataset တစ်ခုမှာ Vocabulary size တွက်ချင်ရင် အပေါ်က split လုပ်ထားတဲ့ tokenization ကို unique tokens ခွဲထုတ်လိုက်တာပါပဲ။ စုစုပေါင်း 128 လုံးရှိမယ်ဆိုရင် index က 0~127 ထိရှိပါမယ်။ ပြီးရင် word အစား integer index တွေအနေနဲ့ ကိုယ်စားပြုတာပါ။ ဒါကတော့ အရမ်း basic ကျတာပေါ့နော်။ အခုလို မဟုတ်ပဲ သက်သက် train ထားတဲ့ embedding models တွေကနေလည်း ယူလို့ရပါတယ်။ ဥပမာ google ကနေထုတ်တဲ့ word2vec လိုမျိုးတွေပေါ။

(‘Hello’, ‘world’, ‘!’, <pad>, <pad>, …, <pad>)

Vocabulary Mapping လုပ်ပြီးရင်

(8667, 1362, 106, 0, 0, …, 0)

0 = <pad> token ကိို ကိုယ်စားပြုတာပါ။ Transformers architecture မှာ အခြား special tokens တွေ ဖြစ်တဲ့ [CLS] (Classification token), [SEP] (separator token), [PAD] (padding token), [MASK] (masking token), [UNK] (unknown token), [EOS] (end of sentence token) ဆိုပြီး အများကြီး ရှိပါသေးတယ်။ တစ်ချို့ token တွေက နောက်မှာရှင်းမယ့် အပိုင်းတွေပါလာလိမ့်မယ်။

Word embedding ရပြီဆိုရင် Positional Embedding ကိုရှာရပါမယ်။

What is Positional Embedding ?

“Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. (https://arxiv.org/abs/1706.03762) “

အရင်က RNN architecture တုန်းကဆို Sequential data တွေရဲ့ order တွေကို သိဖို့ Sequentially learning လုပ်ကြတယ်။ အရမ်းရှည်တဲ့ sentences တွေမှာဆို RNN က နောက်ပိုင်းရောက်လာလေလေ context ကို မသိလာလေလေ ဖြစ်တာမို့ Attention mechanism တွေထည့်ကြတယ်။ ဒါတောင် RNN က long sequences မှာ dependency issue ရှိသေးတယ်။ အခု transformer architecture မှာ Sequentially မဟုတ်တော့ပဲ parallel learning တွေလုပ်လာကြတယ်။ ဒါကြောင့် Position ကို မှတိမိဖို့ဆို ပြီး Positional Embedding ကိုတွက်ပြီးထည့်ကြတယ်။ Position မပါရင် data က ဘယ်လိုဖြစ်သွားနိုင်လဲဆိုတာ အောက်က example ကြည့်ပြီး သိနိုင်ပါတယ်။

Tom bit a dog.
A dog bit Tom.

၂ ခုလုံးက အဓိပ္ပါယ် လုံးဝကွဲထွက်သွားတာပါ။ ဒါကြောင့် RNN လို Sequentially learning မလုပ်တဲ့ Transformers အတွက် positional embedding က အရေးကြီးတယ်လို့ပြောတာပါ။

Positional Embedding ကို ဘယ်လိုတွက်မလဲ။

“The positional encodings have the same dimension d_model as the embeddings, so that the two can be summed. (https://arxiv.org/abs/1706.03762) “

Positional Embedding ရဲ့ dimension က input embedding ရဲ့ dimension size နဲ့ အတူတူပါပဲ။ Base transformer မှာ Embedding size က 512 ရှိတာမို့ Positional Embedding မှာလည်း 512 ဖြစ်ပါလိမ့်မယ်။ အခု example မှာဆို Dimension က 3 Dimension ဖြစ်ပါမယ်။ Dimension တူတာမို့ ဒီအတိုင်း vector တွေကို ဒီအတိုင်း ပေါင်းလိုက်ရင် ရပါတယ် ။

Positional Embedding မှာ တွက်တဲ့ ပုံက data က even position လား odd position ပေါ် မူတည်ပြီး ကွဲပါတယ်။

“In this work, we use sine and cosine functions of different frequencies:”

Odd position = sin(pos/10000**(2i/dimension)

Even position = cos(pos/10000**(2i/dimension) “

“where pos is the position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. (https://arxiv.org/abs/1706.03762)”

example ထဲက value တွေက random လုပ်ထားတာမို့ တကယ့် formula ရဲ့ တန်ဖိုး အစစ်တွေတော့ မဟုတ်ပါဘူး။ အမက example ပေးချင်ရုံမို့ပါ။

ဒီမှာ တစ်ခု စဉ်းစားမိလား။ ဘာလို့ vector တွေကို Element wise ပေါင်းချသွားလဲ။ concatenation လို့ပြောပြီး ဘာလို့ vector တွေ concate မလုပ်တာလဲ။ 🤔

Element-Wise vs Concatenation

vector concatenation မှာဆို positional embedding က word embedding က သက်သက်စီဖြစ်တာမို့ independent ဖြစ်ပြီး learning optimization ပိုကောင်းလာနိုင်ပေမယ့် vector နှစ်ခုကို ကပ်လိုက်တာမို့ word embedding dimensions- 5d နဲ့ positional embedding-5d စုစုပေါင်း 10 dimension ဖြစ်လာပါလိမ့်မယ်။ extra dimensions ဖြစ်လာတာမို့ optimization ဖြစ်ဖို့ ပိုကြာတာတွေ dimensions 2 ဆဖြစ်လာတဲ့ data ကို သိမ်းဖို့ memory ပိုလိုတာတွေဖြစ်လာမှာပါ။ ဒါကြောင့် paper ထဲမှာတော့ element wise ကို သုံးထားတယ်ပြောပါတယ်။ ဒီအကြောင်းကို အသေးစိတ်ဖတ်ချင်ရင် research လုပ်ထားတဲ့ paper တွေအများကြီးရှိပါတယ်။

import numpy as np
import matplotlib.pyplot as plt

def get_positional_encoding(seq_length, d_model):
    """
    Compute the positional encoding matrix.

    Args:
    seq_length (int): The length of the sequence.
    d_model (int): The dimension of the embeddings.

    Returns:
    np.ndarray: A seq_length x d_model matrix containing the positional encodings.
    """
    # Initialize the positional encoding matrix
    positional_encoding = np.zeros((seq_length, d_model))
    
    # Compute the positional encoding values
    for pos in range(seq_length):
        for i in range(0, d_model, 2):
            positional_encoding[pos, i] = np.sin(pos / (10000 ** (i / d_model)))
            if i + 1 < d_model:
                positional_encoding[pos, i + 1] = np.cos(pos / (10000 ** ((i + 1) / d_model)))
    
    return positional_encoding

# Example usage
seq_length = 128  # Length of the sequence ## in our example, this will be 3
d_model = 512     # Dimension of the embeddings

# Get the positional encodings
positional_encodings = get_positional_encoding(seq_length, d_model)

# Display the positional encodings
print(positional_encodings)

# Visualize the positional encodings
plt.figure(figsize=(10, 8))
plt.pcolor(positional_encodings, cmap='viridis')
plt.colorbar()
plt.xlabel('Embedding Dimension')
plt.ylabel('Position')
plt.title('Positional Encodings')
plt.show()

y axis က 0~127 (ရှိပြီး x axis 0~511 ရှိပါမယ်။ y axis က maximum length ကိုပြောတာဖြစ်ပြီး x axis က embedding dimension ပါ။ ဘေးနားက index -1 to 1 ရဲ့ color ကိုကြည့်ပြီး အပြာဘက်ကို သွားတာက 1 ဘက်ကို ဖော်ပြပြီး အဝါရောင်က 1 ဘက်ကို ပိုနီးတာပါ။ ပုံကိုကြည့်ခြင်းအားဖြင့် transformer မှာဆို positional information တွေက အသုံးဝင်တာကို မြင်ရမှာပါ။

ဒီ step ပြီးရင်တော့ Positional embedding နဲ့ Input Embedding ပေါင်းပြီး နောက်ထပ် layer ဖြစ်တဲ့ Attention Layer ကို ရောက်ပါလိမ့်မယ်။ ဒီအကြောင်းကတော့ သက်သက် post တစ်ခုအနေနဲ့ ထပ်ရေးပေးပါမယ်။ Encoder မှာ Attention နဲ့ Feed Forward layer ၂ခုပဲရှိပေမယ့် Attention ကို Dot Product attention, Multi Head attention စသဖြင့်အများကြီးရှိတာမို့ အဲ့အကြောင်းတွေကို အသေးစိတ်ရေးပေးပါမယ်။

Stay tuned for more updates ပါ 🤗