Neural Network Page 8 — Transformer & Attention

📑 Daftar Isi — Page 8

📑 Table of Contents — Page 8

Kenapa Transformer? — Keterbatasan RNN yang diselesaikan
Self-Attention — Setiap kata "melihat" semua kata lain
Scaled Dot-Product Attention — Q, K, V dari nol
Multi-Head Attention — Perhatian paralel dari berbagai sudut
Positional Encoding — Memberitahu posisi kata
Transformer Block — Attention + FFN + LayerNorm + Residual
GPT vs BERT — Decoder-only vs Encoder-only
Ringkasan & Preview Page 9

Why Transformer? — RNN limitations solved
Self-Attention — Every word "looks at" all other words
Scaled Dot-Product Attention — Q, K, V from scratch
Multi-Head Attention — Parallel attention from multiple perspectives
Positional Encoding — Telling the model word position
Transformer Block — Attention + FFN + LayerNorm + Residual
GPT vs BERT — Decoder-only vs Encoder-only
Summary & Page 9 Preview

🤔

1. Kenapa Transformer? — "Attention Is All You Need"

1. Why Transformer? — "Attention Is All You Need"

RNN memproses sekuensial (lambat), Transformer memproses paralel (cepat!)

RNN processes sequentially (slow), Transformer processes in parallel (fast!)

RNN/LSTM punya 2 masalah: (1) memproses kata satu per satu — tidak bisa diparalelkan, lambat di GPU; (2) informasi dari kata awal "menghilang" di akhir sequence. Transformer mengatasi keduanya dengan Self-Attention — setiap kata bisa langsung "melihat" semua kata lain, sekaligus.

RNN/LSTM has 2 problems: (1) processes words one at a time — can't be parallelized, slow on GPU; (2) information from early words "fades" at the end of the sequence. Transformer solves both with Self-Attention — every word can directly "see" all other words, simultaneously.

RNN vs Transformer RNN (sequential) Transformer (parallel) w₁ → w₂ → w₃ → w₄ w₁ ⟷ w₂ ⟷ w₃ ⟷ w₄ (one at a time) (all at once!) O(n) sequential steps O(1) parallel steps Long-range: weak Long-range: strong GPU utilization: low GPU utilization: maximum

👁️

2. Self-Attention — Scaled Dot-Product

Query, Key, Value — mekanisme "pencarian" yang elegan

Query, Key, Value — an elegant "search" mechanism

Setiap kata menghasilkan 3 vektor: Query (apa yang dicari), Key (apa yang ditawarkan), Value (informasi yang diberikan). Attention score = seberapa cocok Query dengan Key. Output = weighted sum dari Values.

Each word produces 3 vectors: Query (what it's looking for), Key (what it offers), Value (the information it provides). Attention score = how well Query matches Key. Output = weighted sum of Values.

34_self_attention.py — Self-Attention from Scratchpython

import numpy as np

def softmax(x, axis=-1):
    e = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e / e.sum(axis=axis, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K, V: (batch, seq_len, d_k)
    Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) · V
    """
    d_k = Q.shape[-1]
    scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_k)  # (B, S, S)

    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)  # causal mask

    weights = softmax(scores)        # attention weights
    output = weights @ V             # weighted sum of values
    return output, weights

# Demo: 1 batch, 4 words, 8-dim embeddings
seq_len, d_model = 4, 8
X = np.random.randn(1, seq_len, d_model)

# Project to Q, K, V
W_q = np.random.randn(d_model, d_model) * 0.1
W_k = np.random.randn(d_model, d_model) * 0.1
W_v = np.random.randn(d_model, d_model) * 0.1

Q, K, V = X @ W_q, X @ W_k, X @ W_v
out, attn_weights = scaled_dot_product_attention(Q, K, V)
print(f"Output shape: {out.shape}")      # (1, 4, 8)
print(f"Attn weights:\n{attn_weights[0].round(3)}")

🔀

3. Multi-Head Attention & Transformer Block

Paralel attention dari berbagai sudut pandang + Feed-Forward + Residual

Parallel attention from multiple perspectives + Feed-Forward + Residual

35_transformer_block.py — Transformer Blockpython

import numpy as np

class MultiHeadAttention:
    def __init__(self, d_model, num_heads):
        self.h = num_heads
        self.d_k = d_model // num_heads
        s = np.sqrt(2.0/d_model)
        self.W_q = np.random.randn(d_model, d_model) * s
        self.W_k = np.random.randn(d_model, d_model) * s
        self.W_v = np.random.randn(d_model, d_model) * s
        self.W_o = np.random.randn(d_model, d_model) * s

    def forward(self, X, mask=None):
        B, S, D = X.shape
        Q = (X @ self.W_q).reshape(B, S, self.h, self.d_k).transpose(0,2,1,3)
        K = (X @ self.W_k).reshape(B, S, self.h, self.d_k).transpose(0,2,1,3)
        V = (X @ self.W_v).reshape(B, S, self.h, self.d_k).transpose(0,2,1,3)
        scores = Q @ K.transpose(0,1,3,2) / np.sqrt(self.d_k)
        if mask is not None: scores = np.where(mask==0, -1e9, scores)
        w = softmax(scores)
        out = (w @ V).transpose(0,2,1,3).reshape(B, S, D)
        return out @ self.W_o

class TransformerBlock:
    def __init__(self, d_model, num_heads, d_ff):
        self.attn = MultiHeadAttention(d_model, num_heads)
        self.ff_W1 = np.random.randn(d_model, d_ff) * np.sqrt(2/d_model)
        self.ff_b1 = np.zeros((1,1,d_ff))
        self.ff_W2 = np.random.randn(d_ff, d_model) * np.sqrt(2/d_ff)
        self.ff_b2 = np.zeros((1,1,d_model))

    def layer_norm(self, x, eps=1e-6):
        mean = x.mean(-1, keepdims=True)
        std = x.std(-1, keepdims=True)
        return (x - mean) / (std + eps)

    def forward(self, x, mask=None):
        # Self-Attention + Residual + LayerNorm
        attn_out = self.attn.forward(x, mask)
        x = self.layer_norm(x + attn_out)
        # Feed-Forward + Residual + LayerNorm
        ff = np.maximum(0, x @ self.ff_W1 + self.ff_b1) @ self.ff_W2 + self.ff_b2
        x = self.layer_norm(x + ff)
        return x

# Demo
block = TransformerBlock(d_model=64, num_heads=4, d_ff=128)
X = np.random.randn(2, 10, 64)  # batch=2, seq=10, dim=64
out = block.forward(X)
print(f"Transformer block output: {out.shape}")  # (2, 10, 64)

🎓 Positional Encoding: Karena Transformer tidak punya urutan bawaan (tidak seperti RNN), kita tambahkan sinyal posisi menggunakan fungsi sin/cos: PE(pos,2i) = sin(pos/10000^(2i/d)). Ini memberitahu model "kata ini di posisi ke-5".

🎓 Positional Encoding: Since Transformers have no built-in order (unlike RNNs), we add position signals using sin/cos functions: PE(pos,2i) = sin(pos/10000^(2i/d)). This tells the model "this word is at position 5".

📝

4. Ringkasan Page 8

4. Page 8 Summary

Apa yang sudah kita pelajari

What we learned

Konsep	Apa Itu	Kode Kunci
Self-Attention	Setiap token melihat semua token lain	`softmax(QK^T/√d)·V`
Q, K, V	Query, Key, Value — tiga proyeksi dari input	`X @ W_q, X @ W_k, X @ W_v`
Multi-Head	Attention paralel dari H sudut pandang	`concat(head_1..head_h) @ W_o`
Positional Encoding	Sinyal posisi kata (sin/cos)	`sin(pos/10000^(2i/d))`
Layer Norm	Normalisasi per-token (bukan per-batch)	`(x-mean)/(std+ε)`
Residual Connection	Skip connection: x + sublayer(x)	`x = x + attn(x)`
GPT	Decoder-only Transformer (causal mask)	`mask = triu(ones)`
BERT	Encoder-only Transformer (bidirectional)	`no causal mask`

Concept	What It Is	Key Code
Self-Attention	Every token attends to all others	`softmax(QK^T/√d)·V`
Q, K, V	Query, Key, Value — three projections	`X @ W_q, X @ W_k, X @ W_v`
Multi-Head	Parallel attention from H perspectives	`concat(head_1..head_h) @ W_o`
Positional Encoding	Word position signal (sin/cos)	`sin(pos/10000^(2i/d))`
Layer Norm	Per-token normalization	`(x-mean)/(std+ε)`
Residual Connection	Skip connection: x + sublayer(x)	`x = x + attn(x)`
GPT	Decoder-only Transformer (causal mask)	`mask = triu(ones)`
BERT	Encoder-only Transformer (bidirectional)	`no causal mask`

← Page Sebelumnya← Previous Page

Transformer &
Attention Mechanism

Transformer &
Attention Mechanism

📑 Daftar Isi — Page 8

📑 Table of Contents — Page 8

1. Kenapa Transformer? — "Attention Is All You Need"

1. Why Transformer? — "Attention Is All You Need"

2. Self-Attention — Scaled Dot-Product

2. Self-Attention — Scaled Dot-Product

3. Multi-Head Attention & Transformer Block

3. Multi-Head Attention & Transformer Block

4. Ringkasan Page 8

4. Page 8 Summary

Page 7 — Generative Adversarial Network (GAN)

Coming Next: Page 9 — Transfer Learning & Fine-Tuning

Coming Next: Page 9 — Transfer Learning & Fine-Tuning