๐ Daftar Isi โ Page 8
๐ Table of Contents โ Page 8
- Kenapa Transformer? โ Keterbatasan RNN yang diselesaikan
- Self-Attention โ Setiap kata "melihat" semua kata lain
- Scaled Dot-Product Attention โ Q, K, V dari nol
- Multi-Head Attention โ Perhatian paralel dari berbagai sudut
- Positional Encoding โ Memberitahu posisi kata
- Transformer Block โ Attention + FFN + LayerNorm + Residual
- GPT vs BERT โ Decoder-only vs Encoder-only
- Ringkasan & Preview Page 9
- Why Transformer? โ RNN limitations solved
- Self-Attention โ Every word "looks at" all other words
- Scaled Dot-Product Attention โ Q, K, V from scratch
- Multi-Head Attention โ Parallel attention from multiple perspectives
- Positional Encoding โ Telling the model word position
- Transformer Block โ Attention + FFN + LayerNorm + Residual
- GPT vs BERT โ Decoder-only vs Encoder-only
- Summary & Page 9 Preview
1. Kenapa Transformer? โ "Attention Is All You Need"
1. Why Transformer? โ "Attention Is All You Need"
RNN/LSTM punya 2 masalah: (1) memproses kata satu per satu โ tidak bisa diparalelkan, lambat di GPU; (2) informasi dari kata awal "menghilang" di akhir sequence. Transformer mengatasi keduanya dengan Self-Attention โ setiap kata bisa langsung "melihat" semua kata lain, sekaligus.
RNN/LSTM has 2 problems: (1) processes words one at a time โ can't be parallelized, slow on GPU; (2) information from early words "fades" at the end of the sequence. Transformer solves both with Self-Attention โ every word can directly "see" all other words, simultaneously.
2. Self-Attention โ Scaled Dot-Product
2. Self-Attention โ Scaled Dot-Product
Setiap kata menghasilkan 3 vektor: Query (apa yang dicari), Key (apa yang ditawarkan), Value (informasi yang diberikan). Attention score = seberapa cocok Query dengan Key. Output = weighted sum dari Values.
Each word produces 3 vectors: Query (what it's looking for), Key (what it offers), Value (the information it provides). Attention score = how well Query matches Key. Output = weighted sum of Values.
import numpy as np def softmax(x, axis=-1): e = np.exp(x - np.max(x, axis=axis, keepdims=True)) return e / e.sum(axis=axis, keepdims=True) def scaled_dot_product_attention(Q, K, V, mask=None): """ Q, K, V: (batch, seq_len, d_k) Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) ยท V """ d_k = Q.shape[-1] scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_k) # (B, S, S) if mask is not None: scores = np.where(mask == 0, -1e9, scores) # causal mask weights = softmax(scores) # attention weights output = weights @ V # weighted sum of values return output, weights # Demo: 1 batch, 4 words, 8-dim embeddings seq_len, d_model = 4, 8 X = np.random.randn(1, seq_len, d_model) # Project to Q, K, V W_q = np.random.randn(d_model, d_model) * 0.1 W_k = np.random.randn(d_model, d_model) * 0.1 W_v = np.random.randn(d_model, d_model) * 0.1 Q, K, V = X @ W_q, X @ W_k, X @ W_v out, attn_weights = scaled_dot_product_attention(Q, K, V) print(f"Output shape: {out.shape}") # (1, 4, 8) print(f"Attn weights:\n{attn_weights[0].round(3)}")
3. Multi-Head Attention & Transformer Block
3. Multi-Head Attention & Transformer Block
import numpy as np class MultiHeadAttention: def __init__(self, d_model, num_heads): self.h = num_heads self.d_k = d_model // num_heads s = np.sqrt(2.0/d_model) self.W_q = np.random.randn(d_model, d_model) * s self.W_k = np.random.randn(d_model, d_model) * s self.W_v = np.random.randn(d_model, d_model) * s self.W_o = np.random.randn(d_model, d_model) * s def forward(self, X, mask=None): B, S, D = X.shape Q = (X @ self.W_q).reshape(B, S, self.h, self.d_k).transpose(0,2,1,3) K = (X @ self.W_k).reshape(B, S, self.h, self.d_k).transpose(0,2,1,3) V = (X @ self.W_v).reshape(B, S, self.h, self.d_k).transpose(0,2,1,3) scores = Q @ K.transpose(0,1,3,2) / np.sqrt(self.d_k) if mask is not None: scores = np.where(mask==0, -1e9, scores) w = softmax(scores) out = (w @ V).transpose(0,2,1,3).reshape(B, S, D) return out @ self.W_o class TransformerBlock: def __init__(self, d_model, num_heads, d_ff): self.attn = MultiHeadAttention(d_model, num_heads) self.ff_W1 = np.random.randn(d_model, d_ff) * np.sqrt(2/d_model) self.ff_b1 = np.zeros((1,1,d_ff)) self.ff_W2 = np.random.randn(d_ff, d_model) * np.sqrt(2/d_ff) self.ff_b2 = np.zeros((1,1,d_model)) def layer_norm(self, x, eps=1e-6): mean = x.mean(-1, keepdims=True) std = x.std(-1, keepdims=True) return (x - mean) / (std + eps) def forward(self, x, mask=None): # Self-Attention + Residual + LayerNorm attn_out = self.attn.forward(x, mask) x = self.layer_norm(x + attn_out) # Feed-Forward + Residual + LayerNorm ff = np.maximum(0, x @ self.ff_W1 + self.ff_b1) @ self.ff_W2 + self.ff_b2 x = self.layer_norm(x + ff) return x # Demo block = TransformerBlock(d_model=64, num_heads=4, d_ff=128) X = np.random.randn(2, 10, 64) # batch=2, seq=10, dim=64 out = block.forward(X) print(f"Transformer block output: {out.shape}") # (2, 10, 64)
๐ Positional Encoding: Karena Transformer tidak punya urutan bawaan (tidak seperti RNN), kita tambahkan sinyal posisi menggunakan fungsi sin/cos: PE(pos,2i) = sin(pos/10000^(2i/d)). Ini memberitahu model "kata ini di posisi ke-5".
๐ Positional Encoding: Since Transformers have no built-in order (unlike RNNs), we add position signals using sin/cos functions: PE(pos,2i) = sin(pos/10000^(2i/d)). This tells the model "this word is at position 5".
4. Ringkasan Page 8
4. Page 8 Summary
| Konsep | Apa Itu | Kode Kunci |
|---|---|---|
| Self-Attention | Setiap token melihat semua token lain | softmax(QK^T/โd)ยทV |
| Q, K, V | Query, Key, Value โ tiga proyeksi dari input | X @ W_q, X @ W_k, X @ W_v |
| Multi-Head | Attention paralel dari H sudut pandang | concat(head_1..head_h) @ W_o |
| Positional Encoding | Sinyal posisi kata (sin/cos) | sin(pos/10000^(2i/d)) |
| Layer Norm | Normalisasi per-token (bukan per-batch) | (x-mean)/(std+ฮต) |
| Residual Connection | Skip connection: x + sublayer(x) | x = x + attn(x) |
| GPT | Decoder-only Transformer (causal mask) | mask = triu(ones) |
| BERT | Encoder-only Transformer (bidirectional) | no causal mask |
| Concept | What It Is | Key Code |
|---|---|---|
| Self-Attention | Every token attends to all others | softmax(QK^T/โd)ยทV |
| Q, K, V | Query, Key, Value โ three projections | X @ W_q, X @ W_k, X @ W_v |
| Multi-Head | Parallel attention from H perspectives | concat(head_1..head_h) @ W_o |
| Positional Encoding | Word position signal (sin/cos) | sin(pos/10000^(2i/d)) |
| Layer Norm | Per-token normalization | (x-mean)/(std+ฮต) |
| Residual Connection | Skip connection: x + sublayer(x) | x = x + attn(x) |
| GPT | Decoder-only Transformer (causal mask) | mask = triu(ones) |
| BERT | Encoder-only Transformer (bidirectional) | no causal mask |
Page 7 โ Generative Adversarial Network (GAN)
Coming Next: Page 9 โ Transfer Learning & Fine-Tuning
Menggunakan model pre-trained raksasa (ResNet, BERT, GPT) dan menyesuaikannya untuk task spesifik Anda. Feature extraction, fine-tuning strategies, dan domain adaptation. Stay tuned!
Coming Next: Page 9 โ Transfer Learning & Fine-Tuning
Using giant pre-trained models (ResNet, BERT, GPT) and adapting them for your specific task. Feature extraction, fine-tuning strategies, and domain adaptation. Stay tuned!