๐Ÿ“ Artikel ini ditulis dalam Bahasa Indonesia
๐Ÿ”ฅ Seri Belajar PyTorch Part 8

Transformer: Build "Attention is All You Need" dari Nol

Paper paling berpengaruh di dekade ini. Part 8 membangun arsitektur Transformer dari scratch: Self-Attention mechanism, Multi-Head Attention, Positional Encoding, dan Encoder-Decoder stack. Fondasi GPT, BERT, Claude, Gemini, dan semua LLM modern โ€” semua dimulai dari sini.

๐Ÿ“… Maret 2026โฑ 35 menit baca๐Ÿท Transformer • Self-Attention • Multi-Head • Positional Encoding
๐Ÿ“š Seri Belajar PyTorch:
1 2 3 4 5 6 7 8 9 10

๐Ÿ“‘ Daftar Isi โ€” Part 8

  1. Mengapa Transformer? โ€” Limitasi LSTM & birth of attention
  2. Self-Attention โ€” Q, K, V: "Kata mana yang harus diperhatikan?"
  3. Positional Encoding โ€” Berikan urutan ke parallelism
  4. Multi-Head Attention โ€” Perhatikan banyak aspek sekaligus
  5. Kode: Transformer Block โ€” Full implementation
  6. BERT vs GPT โ€” Encoder-only vs Decoder-only
  7. Ringkasan & Preview Part 9
โšก

1. Mengapa Transformer?

LSTM lambat & sequential. Transformer paralel & jauh lebih powerful.

๐ŸŒ LSTM (2015)

Sequential: 1 kata per step. Lambat untuk teks panjang. Sulit menangkap hubungan jarak jauh ("The cat, which sat on the mat, was..."). Tidak bisa diparalelkan โ†’ training lambat.

โšก Transformer (2017)

Parallel: SEMUA kata diproses sekaligus. Self-Attention menghubungkan SEMUA kata. Training massively parallel โ†’ 100ร— lebih cepat. Fondasi BERT, GPT, Claude.

๐ŸŽฏ

2. Self-Attention โ€” "Siapa yang Harus Diperhatikan?"

Setiap kata bertanya: "Kata mana di kalimat ini yang relevan untuk memahami saya?"

๐ŸŽฏ Self-Attention: "The cat sat on the mat"

The cat sat on the mat 0.05 0.42 0.15 0.08 0.03 0.27 "sat" paling memperhatikan "cat" (0.42) dan "mat" (0.27) โ†’ Model mengerti: "siapa yang sat?" (cat) dan "di mana sat?" (mat) Q (Query) = "Apa yang saya cari?" q = x @ W_q "sat" bertanya: siapa yang terkait? K (Key) = "Ini identitas saya" k = x @ W_k "cat" bilang: saya subjek kalimat V (Value) = "Ini informasi saya" v = x @ W_v embedding "cat" = [0.2, -0.1, ...]

๐ŸŽ“ Formula: Attention(Q, K, V) = softmax(QKT / โˆšdk) ยท V

QยทKT = seberapa "cocok" setiap pasangan kata (similarity score). Dibagi โˆšdk untuk stabilitas numerik. Softmax mengubah skor jadi probabilitas (jumlah = 1). Dikalikan V untuk mendapat output akhir โ€” weighted average dari semua kata berdasarkan relevansi.

27_self_attention.py โ€” From Scratch
import torch import torch.nn as nn import math class SelfAttention(nn.Module): def __init__(self, d_model=512): super().__init__() self.d_model = d_model self.W_q = nn.Linear(d_model, d_model) self.W_k = nn.Linear(d_model, d_model) self.W_v = nn.Linear(d_model, d_model) def forward(self, x): # x: [batch, seq_len, d_model] Q = self.W_q(x) # What am I looking for? K = self.W_k(x) # What do I contain? V = self.W_v(x) # What info do I have? # Attention scores: Q ยท K^T / sqrt(d_k) scores = torch.matmul(Q, K.transpose(-2, -1)) scores = scores / math.sqrt(self.d_model) # Softmax โ†’ attention weights (probabilities) attn_weights = torch.softmax(scores, dim=-1) # Weighted sum of values output = torch.matmul(attn_weights, V) return output
๐Ÿ”ฑ

4. Multi-Head Attention

8 "kepala" attention melihat aspek berbeda secara paralel
28_multihead_attention.py
class MultiHeadAttention(nn.Module): def __init__(self, d_model=512, n_heads=8): super().__init__() self.n_heads = n_heads self.d_k = d_model // n_heads # 512/8 = 64 per head self.W_q = nn.Linear(d_model, d_model) self.W_k = nn.Linear(d_model, d_model) self.W_v = nn.Linear(d_model, d_model) self.W_o = nn.Linear(d_model, d_model) def forward(self, x): B, L, D = x.shape # Split into heads: [B, L, D] โ†’ [B, heads, L, d_k] Q = self.W_q(x).view(B, L, self.n_heads, self.d_k).transpose(1,2) K = self.W_k(x).view(B, L, self.n_heads, self.d_k).transpose(1,2) V = self.W_v(x).view(B, L, self.n_heads, self.d_k).transpose(1,2) # Scaled dot-product attention per head scores = (Q @ K.transpose(-2,-1)) / math.sqrt(self.d_k) attn = torch.softmax(scores, dim=-1) out = attn @ V # [B, heads, L, d_k] # Concat heads & project out = out.transpose(1,2).contiguous().view(B, L, D) return self.W_o(out)
๐Ÿ—๏ธ

5. Full Transformer Block

Multi-Head Attention + Feed-Forward + LayerNorm + Residual
29_transformer_block.py
class TransformerBlock(nn.Module): def __init__(self, d_model=512, n_heads=8, d_ff=2048, dropout=0.1): super().__init__() self.attn = MultiHeadAttention(d_model, n_heads) self.norm1 = nn.LayerNorm(d_model) self.norm2 = nn.LayerNorm(d_model) self.ff = nn.Sequential( nn.Linear(d_model, d_ff), nn.GELU(), nn.Linear(d_ff, d_model), ) self.dropout = nn.Dropout(dropout) def forward(self, x): # Sub-layer 1: Multi-Head Attention + Residual + Norm attn_out = self.attn(self.norm1(x)) x = x + self.dropout(attn_out) # Residual! # Sub-layer 2: Feed-Forward + Residual + Norm ff_out = self.ff(self.norm2(x)) x = x + self.dropout(ff_out) # Residual! return x # Stack 6 blocks = Transformer Encoder encoder = nn.Sequential(*[TransformerBlock() for _ in range(6)]) # Total params: ~44M (mirip BERT-small)
๐Ÿ”€

6. BERT vs GPT โ€” Dua "Keturunan" Transformer

Encoder-only vs Decoder-only: dua paradigma berbeda
AspekBERT (Encoder)GPT (Decoder)
ArsitekturEncoder-only (bidirectional)Decoder-only (autoregressive)
AttentionMelihat SEMUA kata (kiri+kanan)Hanya melihat kata SEBELUMNYA
TrainingMasked Language Model (isi kata yang dihapus)Next Token Prediction (prediksi kata selanjutnya)
Best ForClassification, NER, QA, understandingText generation, chat, coding
ContohBERT, RoBERTa, DeBERTaGPT-4, Claude, LLaMA, Gemini
๐Ÿ“

7. Ringkasan Part 8

Transformer fundamentals
KonsepApa ItuKode Kunci
Self-AttentionSetiap kata "memperhatikan" semua kata lainsoftmax(QKT/โˆšd) ยท V
Q, K, VQuery, Key, Value projectionsW_q, W_k, W_v = nn.Linear
Multi-HeadParallel attention dari perspektif berbeda8 heads ร— 64 dim = 512
Positional EncodingBerikan info urutan kataSin/cos functions
LayerNorm + ResidualStabilkan training, bantu gradient flowx = x + dropout(sublayer(norm(x)))
Feed-ForwardProses non-linear per posisiLinear โ†’ GELU โ†’ Linear
๐Ÿ”ฅ
Tech Review Desk โ€” Seri Belajar PyTorch
Sumber: "Attention is All You Need" (Vaswani et al. 2017), pytorch.org, The Illustrated Transformer (Jay Alammar).
๐Ÿ“ง rominur@gmail.com  โ€ข  โœˆ๏ธ t.me/Jekardah_AI