๐Ÿ“ Artikel ini ditulis dalam Bahasa Indonesia & English
๐Ÿ“ This article is available in English & Bahasa Indonesia

โšก Tutorial Neural Network โ€” Page 8Neural Network Tutorial โ€” Page 8

Transformer &
Attention Mechanism

Transformer &
Attention Mechanism

Arsitektur revolusioner di balik GPT, BERT, dan semua LLM modern. Page 8 membahas: Self-Attention, Scaled Dot-Product Attention, Multi-Head Attention, Positional Encoding, dan membangun Transformer block dari nol โ€” memahami mesin di balik AI modern.

The revolutionary architecture behind GPT, BERT, and all modern LLMs. Page 8 covers: Self-Attention, Scaled Dot-Product Attention, Multi-Head Attention, Positional Encoding, and building a Transformer block from scratch โ€” understanding the engine behind modern AI.

๐Ÿ“… MaretMarch 2026โฑ 32 menit baca32 min read
๐Ÿท TransformerSelf-AttentionMulti-HeadPositional EncodingGPTBERT
๐Ÿ“š Seri Tutorial Neural Network:Neural Network Tutorial Series:

๐Ÿ“‘ Daftar Isi โ€” Page 8

๐Ÿ“‘ Table of Contents โ€” Page 8

  1. Kenapa Transformer? โ€” Keterbatasan RNN yang diselesaikan
  2. Self-Attention โ€” Setiap kata "melihat" semua kata lain
  3. Scaled Dot-Product Attention โ€” Q, K, V dari nol
  4. Multi-Head Attention โ€” Perhatian paralel dari berbagai sudut
  5. Positional Encoding โ€” Memberitahu posisi kata
  6. Transformer Block โ€” Attention + FFN + LayerNorm + Residual
  7. GPT vs BERT โ€” Decoder-only vs Encoder-only
  8. Ringkasan & Preview Page 9
  1. Why Transformer? โ€” RNN limitations solved
  2. Self-Attention โ€” Every word "looks at" all other words
  3. Scaled Dot-Product Attention โ€” Q, K, V from scratch
  4. Multi-Head Attention โ€” Parallel attention from multiple perspectives
  5. Positional Encoding โ€” Telling the model word position
  6. Transformer Block โ€” Attention + FFN + LayerNorm + Residual
  7. GPT vs BERT โ€” Decoder-only vs Encoder-only
  8. Summary & Page 9 Preview
๐Ÿค”

1. Kenapa Transformer? โ€” "Attention Is All You Need"

1. Why Transformer? โ€” "Attention Is All You Need"

RNN memproses sekuensial (lambat), Transformer memproses paralel (cepat!)
RNN processes sequentially (slow), Transformer processes in parallel (fast!)

RNN/LSTM punya 2 masalah: (1) memproses kata satu per satu โ€” tidak bisa diparalelkan, lambat di GPU; (2) informasi dari kata awal "menghilang" di akhir sequence. Transformer mengatasi keduanya dengan Self-Attention โ€” setiap kata bisa langsung "melihat" semua kata lain, sekaligus.

RNN/LSTM has 2 problems: (1) processes words one at a time โ€” can't be parallelized, slow on GPU; (2) information from early words "fades" at the end of the sequence. Transformer solves both with Self-Attention โ€” every word can directly "see" all other words, simultaneously.

RNN vs Transformer RNN (sequential) Transformer (parallel) wโ‚ โ†’ wโ‚‚ โ†’ wโ‚ƒ โ†’ wโ‚„ wโ‚ โŸท wโ‚‚ โŸท wโ‚ƒ โŸท wโ‚„ (one at a time) (all at once!) O(n) sequential steps O(1) parallel steps Long-range: weak Long-range: strong GPU utilization: low GPU utilization: maximum
๐Ÿ‘๏ธ

2. Self-Attention โ€” Scaled Dot-Product

2. Self-Attention โ€” Scaled Dot-Product

Query, Key, Value โ€” mekanisme "pencarian" yang elegan
Query, Key, Value โ€” an elegant "search" mechanism

Setiap kata menghasilkan 3 vektor: Query (apa yang dicari), Key (apa yang ditawarkan), Value (informasi yang diberikan). Attention score = seberapa cocok Query dengan Key. Output = weighted sum dari Values.

Each word produces 3 vectors: Query (what it's looking for), Key (what it offers), Value (the information it provides). Attention score = how well Query matches Key. Output = weighted sum of Values.

34_self_attention.py โ€” Self-Attention from Scratchpython
import numpy as np

def softmax(x, axis=-1):
    e = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e / e.sum(axis=axis, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K, V: (batch, seq_len, d_k)
    Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) ยท V
    """
    d_k = Q.shape[-1]
    scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_k)  # (B, S, S)

    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)  # causal mask

    weights = softmax(scores)        # attention weights
    output = weights @ V             # weighted sum of values
    return output, weights

# Demo: 1 batch, 4 words, 8-dim embeddings
seq_len, d_model = 4, 8
X = np.random.randn(1, seq_len, d_model)

# Project to Q, K, V
W_q = np.random.randn(d_model, d_model) * 0.1
W_k = np.random.randn(d_model, d_model) * 0.1
W_v = np.random.randn(d_model, d_model) * 0.1

Q, K, V = X @ W_q, X @ W_k, X @ W_v
out, attn_weights = scaled_dot_product_attention(Q, K, V)
print(f"Output shape: {out.shape}")      # (1, 4, 8)
print(f"Attn weights:\n{attn_weights[0].round(3)}")
๐Ÿ”€

3. Multi-Head Attention & Transformer Block

3. Multi-Head Attention & Transformer Block

Paralel attention dari berbagai sudut pandang + Feed-Forward + Residual
Parallel attention from multiple perspectives + Feed-Forward + Residual
35_transformer_block.py โ€” Transformer Blockpython
import numpy as np

class MultiHeadAttention:
    def __init__(self, d_model, num_heads):
        self.h = num_heads
        self.d_k = d_model // num_heads
        s = np.sqrt(2.0/d_model)
        self.W_q = np.random.randn(d_model, d_model) * s
        self.W_k = np.random.randn(d_model, d_model) * s
        self.W_v = np.random.randn(d_model, d_model) * s
        self.W_o = np.random.randn(d_model, d_model) * s

    def forward(self, X, mask=None):
        B, S, D = X.shape
        Q = (X @ self.W_q).reshape(B, S, self.h, self.d_k).transpose(0,2,1,3)
        K = (X @ self.W_k).reshape(B, S, self.h, self.d_k).transpose(0,2,1,3)
        V = (X @ self.W_v).reshape(B, S, self.h, self.d_k).transpose(0,2,1,3)
        scores = Q @ K.transpose(0,1,3,2) / np.sqrt(self.d_k)
        if mask is not None: scores = np.where(mask==0, -1e9, scores)
        w = softmax(scores)
        out = (w @ V).transpose(0,2,1,3).reshape(B, S, D)
        return out @ self.W_o

class TransformerBlock:
    def __init__(self, d_model, num_heads, d_ff):
        self.attn = MultiHeadAttention(d_model, num_heads)
        self.ff_W1 = np.random.randn(d_model, d_ff) * np.sqrt(2/d_model)
        self.ff_b1 = np.zeros((1,1,d_ff))
        self.ff_W2 = np.random.randn(d_ff, d_model) * np.sqrt(2/d_ff)
        self.ff_b2 = np.zeros((1,1,d_model))

    def layer_norm(self, x, eps=1e-6):
        mean = x.mean(-1, keepdims=True)
        std = x.std(-1, keepdims=True)
        return (x - mean) / (std + eps)

    def forward(self, x, mask=None):
        # Self-Attention + Residual + LayerNorm
        attn_out = self.attn.forward(x, mask)
        x = self.layer_norm(x + attn_out)
        # Feed-Forward + Residual + LayerNorm
        ff = np.maximum(0, x @ self.ff_W1 + self.ff_b1) @ self.ff_W2 + self.ff_b2
        x = self.layer_norm(x + ff)
        return x

# Demo
block = TransformerBlock(d_model=64, num_heads=4, d_ff=128)
X = np.random.randn(2, 10, 64)  # batch=2, seq=10, dim=64
out = block.forward(X)
print(f"Transformer block output: {out.shape}")  # (2, 10, 64)

๐ŸŽ“ Positional Encoding: Karena Transformer tidak punya urutan bawaan (tidak seperti RNN), kita tambahkan sinyal posisi menggunakan fungsi sin/cos: PE(pos,2i) = sin(pos/10000^(2i/d)). Ini memberitahu model "kata ini di posisi ke-5".

๐ŸŽ“ Positional Encoding: Since Transformers have no built-in order (unlike RNNs), we add position signals using sin/cos functions: PE(pos,2i) = sin(pos/10000^(2i/d)). This tells the model "this word is at position 5".

๐Ÿ“

4. Ringkasan Page 8

4. Page 8 Summary

Apa yang sudah kita pelajari
What we learned
KonsepApa ItuKode Kunci
Self-AttentionSetiap token melihat semua token lainsoftmax(QK^T/โˆšd)ยทV
Q, K, VQuery, Key, Value โ€” tiga proyeksi dari inputX @ W_q, X @ W_k, X @ W_v
Multi-HeadAttention paralel dari H sudut pandangconcat(head_1..head_h) @ W_o
Positional EncodingSinyal posisi kata (sin/cos)sin(pos/10000^(2i/d))
Layer NormNormalisasi per-token (bukan per-batch)(x-mean)/(std+ฮต)
Residual ConnectionSkip connection: x + sublayer(x)x = x + attn(x)
GPTDecoder-only Transformer (causal mask)mask = triu(ones)
BERTEncoder-only Transformer (bidirectional)no causal mask
ConceptWhat It IsKey Code
Self-AttentionEvery token attends to all otherssoftmax(QK^T/โˆšd)ยทV
Q, K, VQuery, Key, Value โ€” three projectionsX @ W_q, X @ W_k, X @ W_v
Multi-HeadParallel attention from H perspectivesconcat(head_1..head_h) @ W_o
Positional EncodingWord position signal (sin/cos)sin(pos/10000^(2i/d))
Layer NormPer-token normalization(x-mean)/(std+ฮต)
Residual ConnectionSkip connection: x + sublayer(x)x = x + attn(x)
GPTDecoder-only Transformer (causal mask)mask = triu(ones)
BERTEncoder-only Transformer (bidirectional)no causal mask
โ† Page Sebelumnyaโ† Previous Page

Page 7 โ€” Generative Adversarial Network (GAN)

๐Ÿ“˜

Coming Next: Page 9 โ€” Transfer Learning & Fine-Tuning

Menggunakan model pre-trained raksasa (ResNet, BERT, GPT) dan menyesuaikannya untuk task spesifik Anda. Feature extraction, fine-tuning strategies, dan domain adaptation. Stay tuned!

๐Ÿ“˜

Coming Next: Page 9 โ€” Transfer Learning & Fine-Tuning

Using giant pre-trained models (ResNet, BERT, GPT) and adapting them for your specific task. Feature extraction, fine-tuning strategies, and domain adaptation. Stay tuned!