Seri Belajar PyTorch Part 8: Transformer dari Nol

📚 Seri Belajar PyTorch:

1 2 3 4 5 6 7 8 9 10

📑 Daftar Isi — Part 8

Mengapa Transformer? — Limitasi LSTM & birth of attention
Self-Attention — Q, K, V: "Kata mana yang harus diperhatikan?"
Positional Encoding — Berikan urutan ke parallelism
Multi-Head Attention — Perhatikan banyak aspek sekaligus
Kode: Transformer Block — Full implementation
BERT vs GPT — Encoder-only vs Decoder-only
Ringkasan & Preview Part 9

⚡

1. Mengapa Transformer?

LSTM lambat & sequential. Transformer paralel & jauh lebih powerful.

🐌 LSTM (2015)

Sequential: 1 kata per step. Lambat untuk teks panjang. Sulit menangkap hubungan jarak jauh ("The cat, which sat on the mat, was..."). Tidak bisa diparalelkan → training lambat.

⚡ Transformer (2017)

Parallel: SEMUA kata diproses sekaligus. Self-Attention menghubungkan SEMUA kata. Training massively parallel → 100× lebih cepat. Fondasi BERT, GPT, Claude.

🎯

2. Self-Attention — "Siapa yang Harus Diperhatikan?"

Setiap kata bertanya: "Kata mana di kalimat ini yang relevan untuk memahami saya?"

🎯 Self-Attention: "The cat sat on the mat"

🎓 Formula: Attention(Q, K, V) = softmax(QK^T / √d_k) · V

Q·K^T = seberapa "cocok" setiap pasangan kata (similarity score). Dibagi √d_k untuk stabilitas numerik. Softmax mengubah skor jadi probabilitas (jumlah = 1). Dikalikan V untuk mendapat output akhir — weighted average dari semua kata berdasarkan relevansi.

27_self_attention.py — From Scratch

import torch
import torch.nn as nn
import math

class SelfAttention(nn.Module):
    def __init__(self, d_model=512):
        super().__init__()
        self.d_model = d_model
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)

    def forward(self, x):
        # x: [batch, seq_len, d_model]
        Q = self.W_q(x)   # What am I looking for?
        K = self.W_k(x)   # What do I contain?
        V = self.W_v(x)   # What info do I have?

        # Attention scores: Q · K^T / sqrt(d_k)
        scores = torch.matmul(Q, K.transpose(-2, -1))
        scores = scores / math.sqrt(self.d_model)

        # Softmax → attention weights (probabilities)
        attn_weights = torch.softmax(scores, dim=-1)

        # Weighted sum of values
        output = torch.matmul(attn_weights, V)
        return output

🔱

4. Multi-Head Attention

8 "kepala" attention melihat aspek berbeda secara paralel

28_multihead_attention.py

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, n_heads=8):
        super().__init__()
        self.n_heads = n_heads
        self.d_k = d_model // n_heads  # 512/8 = 64 per head

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x):
        B, L, D = x.shape
        # Split into heads: [B, L, D] → [B, heads, L, d_k]
        Q = self.W_q(x).view(B, L, self.n_heads, self.d_k).transpose(1,2)
        K = self.W_k(x).view(B, L, self.n_heads, self.d_k).transpose(1,2)
        V = self.W_v(x).view(B, L, self.n_heads, self.d_k).transpose(1,2)

        # Scaled dot-product attention per head
        scores = (Q @ K.transpose(-2,-1)) / math.sqrt(self.d_k)
        attn = torch.softmax(scores, dim=-1)
        out = attn @ V  # [B, heads, L, d_k]

        # Concat heads & project
        out = out.transpose(1,2).contiguous().view(B, L, D)
        return self.W_o(out)

🏗️

5. Full Transformer Block

Multi-Head Attention + Feed-Forward + LayerNorm + Residual

29_transformer_block.py

class TransformerBlock(nn.Module):
    def __init__(self, d_model=512, n_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Sub-layer 1: Multi-Head Attention + Residual + Norm
        attn_out = self.attn(self.norm1(x))
        x = x + self.dropout(attn_out)    # Residual!

        # Sub-layer 2: Feed-Forward + Residual + Norm
        ff_out = self.ff(self.norm2(x))
        x = x + self.dropout(ff_out)       # Residual!
        return x

# Stack 6 blocks = Transformer Encoder
encoder = nn.Sequential(*[TransformerBlock() for _ in range(6)])
# Total params: ~44M (mirip BERT-small)

🔀

6. BERT vs GPT — Dua "Keturunan" Transformer

Encoder-only vs Decoder-only: dua paradigma berbeda

Aspek	BERT (Encoder)	GPT (Decoder)
Arsitektur	Encoder-only (bidirectional)	Decoder-only (autoregressive)
Attention	Melihat SEMUA kata (kiri+kanan)	Hanya melihat kata SEBELUMNYA
Training	Masked Language Model (isi kata yang dihapus)	Next Token Prediction (prediksi kata selanjutnya)
Best For	Classification, NER, QA, understanding	Text generation, chat, coding
Contoh	BERT, RoBERTa, DeBERTa	GPT-4, Claude, LLaMA, Gemini

📝

7. Ringkasan Part 8

Transformer fundamentals

Konsep	Apa Itu	Kode Kunci
Self-Attention	Setiap kata "memperhatikan" semua kata lain	`softmax(QK^T/√d) · V`
Q, K, V	Query, Key, Value projections	`W_q, W_k, W_v = nn.Linear`
Multi-Head	Parallel attention dari perspektif berbeda	`8 heads × 64 dim = 512`
Positional Encoding	Berikan info urutan kata	Sin/cos functions
LayerNorm + Residual	Stabilkan training, bantu gradient flow	`x = x + dropout(sublayer(norm(x)))`
Feed-Forward	Proses non-linear per posisi	`Linear → GELU → Linear`

📙

Next: Part 9 — Advanced Training Techniques

Mixed Precision, Distributed Training, Gradient Accumulation, Learning Rate Scheduling, Hyperparameter Tuning, dan torch.compile. Optimasi untuk training yang lebih cepat dan efisien.

🔥

Tech Review Desk — Seri Belajar PyTorch

Sumber: "Attention is All You Need" (Vaswani et al. 2017), pytorch.org, The Illustrated Transformer (Jay Alammar).

📧 rominur@gmail.com • ✈️ t.me/Jekardah_AI