Belajar TensorFlow Page 6 — Transformer & BERT

📑 Daftar Isi — Page 6

📑 Table of Contents — Page 6

LSTM vs Transformer — Sequential vs parallel
MultiHeadAttention — Built-in Keras layer
TransformerBlock — Attn + FFN + Norm + Residual
Positional Encoding — Sinusoidal & learned
Proyek: Transformer Text Classifier
BERT Fine-Tuning — TF Hub & Hugging Face
BERT vs GPT — Encoder vs Decoder
Tokenizer — WordPiece, BPE, SentencePiece
Ringkasan & Preview Page 7

LSTM vs Transformer — Sequential vs parallel
MultiHeadAttention — Built-in Keras layer
TransformerBlock — Attn + FFN + Norm + Residual
Positional Encoding — Sinusoidal & learned
Project: Transformer Text Classifier
BERT Fine-Tuning — TF Hub & Hugging Face
BERT vs GPT — Encoder vs Decoder
Tokenizer — WordPiece, BPE, SentencePiece
Summary & Page 7 Preview

🔄

1. Review — Dari LSTM ke Transformer

1. Review — From LSTM to Transformer

LSTM memproses kata satu-per-satu (sequential). Transformer memproses semua kata SEKALIGUS (parallel).

LSTM processes words one-by-one (sequential). Transformer processes ALL words SIMULTANEOUSLY (parallel).

Di seri Neural Network Page 8, kita membahas Transformer dari nol: scaled dot-product attention, multi-head attention, positional encoding, dan TransformerBlock. Di Page 5, kita menggunakan LSTM/GRU untuk NLP. Sekarang kita implementasi Transformer di Keras — dan fine-tune BERT untuk akurasi 95%+.

In Neural Network series Page 8, we discussed Transformer from scratch: scaled dot-product attention, multi-head attention, positional encoding, and TransformerBlock. In Page 5, we used LSTM/GRU for NLP. Now we implement Transformer in Keras — and fine-tune BERT for 95%+ accuracy.

LSTM vs Transformer — Kenapa Transformer Menang / Why Transformer Wins LSTM (sequential): "I" → "love" → "this" → "movie" → "very" → "much" ↓ ↓ ↓ ↓ ↓ ↓ h₁ → h₂ → h₃ → h₄ → h₅ → h₆ → output Masalah: informasi "I" sudah pudar saat sampai di "much" Problem: "I" information fades by the time we reach "much" Speed: O(n) — harus sequential, TIDAK BISA parallel Transformer (parallel): "I" "love" "this" "movie" "very" "much" ↕ ↕ ↕ ↕ ↕ ↕ ← ALL attend to ALL! [=============== Self-Attention ================] Setiap kata bisa langsung "melihat" kata lain — tanpa jarak! Every word can directly "see" every other word — no distance decay! Speed: O(1) per layer — fully parallelizable on GPU! Result: Transformer = faster training + better long-range understanding BERT, GPT, T5, LLaMA → ALL built on Transformer architecture

Aspek	LSTM/GRU	Transformer
Processing	Sequential (satu per satu)	Parallel (sekaligus)
Long-range	Lemah (vanishing gradient)	Kuat (direct attention)
Training Speed	Lambat (sequential)	Cepat (GPU parallel)
Model Size	Kecil-medium (1-50M)	Besar (110M-175B)
Best For	Small datasets, simple tasks	Large datasets, complex NLP
Contoh	Sentiment, simple classification	BERT, GPT, ChatGPT, LLaMA

Aspect	LSTM/GRU	Transformer
Processing	Sequential (one by one)	Parallel (all at once)
Long-range	Weak (vanishing gradient)	Strong (direct attention)
Training Speed	Slow (sequential)	Fast (GPU parallel)
Model Size	Small-medium (1-50M)	Large (110M-175B)
Best For	Small datasets, simple tasks	Large datasets, complex NLP
Examples	Sentiment, simple classification	BERT, GPT, ChatGPT, LLaMA

👁️

2. MultiHeadAttention — Built-in Keras Layer

Di seri NN kita implementasi manual. Di Keras: satu baris.

In NN series we implemented manually. In Keras: one line.

39_multihead_attention.py — Self-Attention in Keraspython

import tensorflow as tf
from tensorflow.keras import layers

# ===========================
# 1. MultiHeadAttention — one line!
# ===========================
mha = layers.MultiHeadAttention(
    num_heads=8,         # 8 attention heads (parallel)
    key_dim=64,          # dimension per head
    dropout=0.1,         # attention dropout
)
# Total attention dimension = num_heads × key_dim = 8 × 64 = 512

# Self-attention: Q, K, V all come from same input
x = tf.random.normal([2, 200, 128])  # (batch, seq_len, d_model)
attn_output = mha(query=x, value=x, key=x)
print(attn_output.shape)  # (2, 200, 128) — same shape as input!

# Cross-attention: Q from one source, K/V from another
# Used in encoder-decoder (translation, etc.)
# attn_output = mha(query=decoder_input, value=encoder_output)

# ===========================
# 2. Attention weights — where does it look?
# ===========================
attn_output, attn_weights = mha(query=x, value=x, key=x,
                                 return_attention_scores=True)
print(attn_weights.shape)  # (2, 8, 200, 200)
# For each head: 200×200 attention matrix
# attn_weights[0, 0, 5, :] = how much word 5 attends to all other words

# ===========================
# 3. Recall from NN Series Page 8:
# Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) × V
# Multi-head = concat(head_1, ..., head_h) × W_o
# Each head learns DIFFERENT attention patterns:
#   - Head 1: syntactic relationships (subject-verb)
#   - Head 2: semantic similarity (synonyms)
#   - Head 3: position-based (nearby words)
#   - etc.
# ===========================

🏗️

3. Transformer Block — Membangun dari Keras Layers

3. Transformer Block — Building from Keras Layers

Attention + FeedForward + LayerNorm + Residual = satu Transformer block

Attention + FeedForward + LayerNorm + Residual = one Transformer block

40_transformer_block.py — Transformer Encoder Blockpython

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# ===========================
# Transformer Encoder Block
# Same architecture as NN Series Page 8!
# ===========================
class TransformerBlock(layers.Layer):
    """One Transformer encoder block.
    Identical to what we built in NN series, but using Keras layers.
    """

    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.d_model = d_model

        # Multi-Head Self-Attention
        self.mha = layers.MultiHeadAttention(
            num_heads=num_heads,
            key_dim=d_model // num_heads,
            dropout=dropout
        )

        # Feed-Forward Network (2 Dense layers)
        self.ffn = keras.Sequential([
            layers.Dense(d_ff, activation='relu'),   # expand
            layers.Dense(d_model),                   # project back
        ])

        # Layer Normalization (NOT BatchNorm!)
        self.norm1 = layers.LayerNormalization(epsilon=1e-6)
        self.norm2 = layers.LayerNormalization(epsilon=1e-6)

        # Dropout
        self.dropout1 = layers.Dropout(dropout)
        self.dropout2 = layers.Dropout(dropout)

    def call(self, x, training=False):
        # Sub-layer 1: Multi-Head Self-Attention + Residual + Norm
        attn_output = self.mha(x, x)                          # self-attention
        attn_output = self.dropout1(attn_output, training=training)
        x = self.norm1(x + attn_output)                       # residual + norm

        # Sub-layer 2: Feed-Forward + Residual + Norm
        ffn_output = self.ffn(x)
        ffn_output = self.dropout2(ffn_output, training=training)
        x = self.norm2(x + ffn_output)                        # residual + norm

        return x

# Test
block = TransformerBlock(d_model=128, num_heads=8, d_ff=256)
x = tf.random.normal([2, 200, 128])
out = block(x)
print(out.shape)  # (2, 200, 128) — same shape! (residual connection)

Transformer Block Architecture (same as NN Series Page 8) ┌─────────────────┐ Input ──────────►│ Multi-Head │ │ │ Self-Attention │ │ │ (8 heads) │ │ └───────┬─────────┘ │ │ Dropout └──── Add ◄─────────┘ ← Residual Connection │ LayerNorm │ ┌──────┴──────┐ │ FeedForward │ │ Dense(d_ff) │ ← expand (e.g., 128 → 512) │ ReLU │ │ Dense(d_m) │ ← project back (512 → 128) └──────┬──────┘ │ Dropout ┌───── Add ◄────── (same x) ← Residual Connection │ LayerNorm │ Output ← same shape as input! Can stack N blocks.

📍

4. Positional Encoding — Menambahkan Informasi Posisi

4. Positional Encoding — Adding Position Information

Attention tidak tahu urutan kata — positional encoding menambahkan info posisi

Attention doesn't know word order — positional encoding adds position info

41_positional_encoding.py — Sinusoidal & Learned PEpython

import tensorflow as tf
from tensorflow.keras import layers
import numpy as np

# ===========================
# Method 1: Sinusoidal PE (original "Attention Is All You Need")
# ===========================
class SinusoidalPositionalEncoding(layers.Layer):
    def __init__(self, max_len, d_model):
        super().__init__()
        # Pre-compute positional encodings
        positions = np.arange(max_len)[:, np.newaxis]      # (max_len, 1)
        dims = np.arange(d_model)[np.newaxis, :]           # (1, d_model)
        angles = positions / np.power(10000, (2 * (dims // 2)) / d_model)

        pe = np.zeros((max_len, d_model))
        pe[:, 0::2] = np.sin(angles[:, 0::2])  # even dimensions: sin
        pe[:, 1::2] = np.cos(angles[:, 1::2])  # odd dimensions: cos
        self.pe = tf.constant(pe[np.newaxis, :, :], dtype=tf.float32)

    def call(self, x):
        return x + self.pe[:, :tf.shape(x)[1], :]  # add PE to embeddings

# ===========================
# Method 2: Learned PE (simpler, used in BERT)
# ===========================
class LearnedPositionalEncoding(layers.Layer):
    def __init__(self, max_len, d_model):
        super().__init__()
        # Just another Embedding layer! Position → vector
        self.pos_embedding = layers.Embedding(max_len, d_model)

    def call(self, x):
        positions = tf.range(tf.shape(x)[1])
        return x + self.pos_embedding(positions)

# In practice, learned PE works just as well as sinusoidal
# BERT uses learned PE, original Transformer uses sinusoidal

🎯

5. Proyek: Transformer Text Classifier dari Nol

5. Project: Transformer Text Classifier from Scratch

Gabungkan semua komponen: Embedding + PE + TransformerBlock + Classifier

Combine all components: Embedding + PE + TransformerBlock + Classifier

42_transformer_classifier.py — Complete Transformer 🔥python

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# ===========================
# Complete Transformer Text Classifier
# ===========================
VOCAB_SIZE = 10000
MAX_LEN = 200
D_MODEL = 128
NUM_HEADS = 4
D_FF = 256
NUM_BLOCKS = 2
DROPOUT = 0.1

# Token + Position Embedding
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, max_len, vocab_size, d_model):
        super().__init__()
        self.token_emb = layers.Embedding(vocab_size, d_model)
        self.pos_emb = layers.Embedding(max_len, d_model)

    def call(self, x):
        positions = tf.range(tf.shape(x)[-1])
        return self.token_emb(x) + self.pos_emb(positions)

# Build model with Functional API
inputs = keras.Input(shape=(MAX_LEN,))

# Embedding + Positional Encoding
x = TokenAndPositionEmbedding(MAX_LEN, VOCAB_SIZE, D_MODEL)(inputs)
x = layers.Dropout(DROPOUT)(x)

# Stack Transformer blocks
for _ in range(NUM_BLOCKS):
    x = TransformerBlock(D_MODEL, NUM_HEADS, D_FF, DROPOUT)(x)

# Pool + Classify
x = layers.GlobalAveragePooling1D()(x)  # (batch, seq, dim) → (batch, dim)
x = layers.Dropout(0.3)(x)
x = layers.Dense(64, activation='relu')(x)
x = layers.Dropout(0.2)(x)
outputs = layers.Dense(1, activation='sigmoid')(x)

model = keras.Model(inputs=inputs, outputs=outputs, name="transformer_classifier")
model.summary()

# Train on IMDB
(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data(num_words=VOCAB_SIZE)
X_train = keras.utils.pad_sequences(X_train, maxlen=MAX_LEN)
X_test = keras.utils.pad_sequences(X_test, maxlen=MAX_LEN)

model.compile(optimizer=keras.optimizers.Adam(1e-4),
              loss='binary_crossentropy', metrics=['accuracy'])

model.fit(X_train, y_train, epochs=10, batch_size=64,
          validation_data=(X_test, y_test),
          callbacks=[keras.callbacks.EarlyStopping(patience=3,
                     restore_best_weights=True)])

test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"Transformer Accuracy: {test_acc:.1%}")
# ~88% — slightly better than BiLSTM for this dataset size!
# The real power of Transformer shows with LARGE models (BERT, GPT)

🤖

6. BERT Fine-Tuning — 95%+ Akurasi

6. BERT Fine-Tuning — 95%+ Accuracy

Pre-trained Transformer yang merevolusi NLP — fine-tune untuk tugas apapun

Pre-trained Transformer that revolutionized NLP — fine-tune for any task

BERT (Bidirectional Encoder Representations from Transformers) adalah model Transformer yang di-pre-train pada Wikipedia + BookCorpus (3.3 miliar kata) menggunakan dua tugas: Masked Language Modeling dan Next Sentence Prediction. Hasilnya: representasi teks yang sangat kaya yang bisa di-fine-tune untuk klasifikasi, QA, NER, dan hampir semua tugas NLP.

BERT (Bidirectional Encoder Representations from Transformers) is a Transformer model pre-trained on Wikipedia + BookCorpus (3.3 billion words) using two tasks: Masked Language Modeling and Next Sentence Prediction. Result: very rich text representations that can be fine-tuned for classification, QA, NER, and almost any NLP task.

43_bert_finetuning.py — BERT with TF Hubpython

# ===========================
# Method 1: TF Hub BERT
# ===========================
# import tensorflow_hub as hub
# import tensorflow_text  # needed for BERT preprocessing
# 
# # BERT preprocessor + encoder
# preprocess_url = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
# encoder_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"
# 
# # Build model
# text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
# preprocessor = hub.KerasLayer(preprocess_url)
# encoder_inputs = preprocessor(text_input)
# encoder = hub.KerasLayer(encoder_url, trainable=True)
# outputs = encoder(encoder_inputs)
# 
# # Use [CLS] token output for classification
# pooled = outputs["pooled_output"]  # (batch, 768)
# x = tf.keras.layers.Dropout(0.3)(pooled)
# x = tf.keras.layers.Dense(64, activation='relu')(x)
# predictions = tf.keras.layers.Dense(1, activation='sigmoid')(x)
# 
# model = tf.keras.Model(text_input, predictions)
# model.compile(optimizer=tf.keras.optimizers.Adam(2e-5),
#               loss='binary_crossentropy', metrics=['accuracy'])
# 
# # Train with RAW TEXT!
# model.fit(train_texts, train_labels, epochs=3, batch_size=32)
# # → 95%+ accuracy on IMDB! (vs 87% with BiLSTM)

# ===========================
# Method 2: Hugging Face Transformers + TF
# ===========================
# from transformers import TFBertModel, BertTokenizer
# 
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# bert = TFBertModel.from_pretrained('bert-base-uncased')
# 
# # Tokenize
# inputs = tokenizer(texts, padding=True, truncation=True,
#                    max_length=128, return_tensors='tf')
# # inputs['input_ids']:      token indices
# # inputs['attention_mask']:  1=real, 0=padding
# 
# # Get BERT output
# outputs = bert(inputs)
# cls_output = outputs.last_hidden_state[:, 0, :]  # [CLS] token
# # Shape: (batch, 768)
# 
# # Add classifier head and fine-tune
# # Same 2-phase approach as CNN transfer learning (Page 3):
# # Phase 1: Freeze BERT, train head (LR=1e-3)
# # Phase 2: Unfreeze BERT, fine-tune all (LR=2e-5)

🤖 BERT vs BiLSTM — Perbandingan Akurasi IMDB:
• Simple LSTM: 84%
• BiLSTM: 87%
• Custom Transformer (our code): 88%
• BERT fine-tuned: 95%+ ← Game changer!

Kenapa BERT jauh lebih baik? BERT sudah "membaca" 3.3 miliar kata sebelum melihat data Anda. Ia sudah memahami grammar, semantik, dan nuansa bahasa. Fine-tuning hanya perlu mengajarkan tugas spesifik — bukan bahasa dari nol.

🤖 BERT vs BiLSTM — IMDB Accuracy Comparison:
• Simple LSTM: 84%
• BiLSTM: 87%
• Custom Transformer (our code): 88%
• BERT fine-tuned: 95%+ ← Game changer!

Why is BERT so much better? BERT already "read" 3.3 billion words before seeing your data. It already understands grammar, semantics, and language nuances. Fine-tuning only needs to teach the specific task — not language from scratch.

⚖️

7. BERT vs GPT — Encoder vs Decoder

Dua pendekatan Transformer: bidirectional understanding vs autoregressive generation

Two Transformer approaches: bidirectional understanding vs autoregressive generation

Aspek	BERT (Encoder)	GPT (Decoder)
Attention	Bidirectional (lihat semua kata)	Causal/Left-only (lihat kata sebelumnya)
Pre-training	Masked LM: tebak kata yang dihapus	Next token: prediksi kata berikutnya
Best For	Understanding: classification, NER, QA	Generation: text completion, chat, code
Model Terkenal	BERT, RoBERTa, ALBERT, DeBERTa	GPT-2/3/4, LLaMA, Gemma
Fine-tuning	Add classifier head + fine-tune	Prompt engineering / instruction tuning
Parameter	110M (base) - 340M (large)	117M (GPT-2) - 1.76T (GPT-4)

Aspect	BERT (Encoder)	GPT (Decoder)
Attention	Bidirectional (sees all words)	Causal/Left-only (sees previous words)
Pre-training	Masked LM: guess deleted words	Next token: predict next word
Best For	Understanding: classification, NER, QA	Generation: text completion, chat, code
Famous Models	BERT, RoBERTa, ALBERT, DeBERTa	GPT-2/3/4, LLaMA, Gemma
Fine-tuning	Add classifier head + fine-tune	Prompt engineering / instruction tuning
Parameters	110M (base) - 340M (large)	117M (GPT-2) - 1.76T (GPT-4)

✂️

8. Tokenizer — WordPiece, BPE, dan SentencePiece

8. Tokenizer — WordPiece, BPE, and SentencePiece

Memecah teks menjadi subword — kunci menangani kata yang tidak dikenal

Breaking text into subwords — the key to handling unknown words

44_tokenizers.py — Subword Tokenizationpython

# ===========================
# Word-level vs Subword tokenization
# ===========================

# Word-level: "unbelievable" → ["unbelievable"]
# Problem: if "unbelievable" not in vocab → [UNK] (lost!)

# Subword (WordPiece — used by BERT):
# "unbelievable" → ["un", "##believ", "##able"]
# All subwords are in vocab! Handles ANY word!

# BPE (Byte Pair Encoding — used by GPT):
# "lower" → ["low", "er"]
# "lowest" → ["low", "est"]
# Learns common character pairs from data

# SentencePiece (used by T5, LLaMA):
# Language-agnostic, treats input as raw bytes
# Works for ANY language without word boundaries

# ===========================
# Using Hugging Face tokenizer
# ===========================
# from transformers import BertTokenizer
# 
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# tokens = tokenizer.tokenize("I love TensorFlow programming")
# print(tokens)
# ['i', 'love', 'tensor', '##flow', 'programming']
#  ↑ "TensorFlow" split into "tensor" + "##flow"
# 
# encoded = tokenizer("I love TensorFlow", return_tensors="tf",
#                     padding=True, truncation=True, max_length=128)
# print(encoded['input_ids'])
# [[101, 1045, 2293, 23435, 12314, 102, 0, 0, ...]]
# 101=[CLS], 102=[SEP], 0=[PAD]

# ===========================
# Vocab sizes for popular models
# ===========================
# BERT:   30,522 tokens (WordPiece)
# GPT-2:  50,257 tokens (BPE)
# T5:     32,000 tokens (SentencePiece)
# LLaMA:  32,000 tokens (SentencePiece)

📝

9. Ringkasan Page 6

9. Page 6 Summary

Semua yang sudah kita pelajari

Everything we learned

Konsep	Apa Itu	Kode Kunci
MultiHeadAttention	Self-attention built-in Keras	`MultiHeadAttention(num_heads=8, key_dim=64)`
TransformerBlock	Attn + FFN + Norm + Residual	`class TransformerBlock(layers.Layer)`
Positional Encoding	Info posisi kata (sin/cos atau learned)	`Embedding(max_len, d_model)`
BERT	Pre-trained bidirectional Transformer	`hub.KerasLayer(bert_url)`
Fine-tuning BERT	2-phase: head → unfreeze	`trainable=True, lr=2e-5`
BERT vs GPT	Understanding vs Generation	Encoder vs Decoder
WordPiece/BPE	Subword tokenization	`BertTokenizer.from_pretrained()`

Concept	What It Is	Key Code
MultiHeadAttention	Built-in Keras self-attention	`MultiHeadAttention(num_heads=8, key_dim=64)`
TransformerBlock	Attn + FFN + Norm + Residual	`class TransformerBlock(layers.Layer)`
Positional Encoding	Word position info (sin/cos or learned)	`Embedding(max_len, d_model)`
BERT	Pre-trained bidirectional Transformer	`hub.KerasLayer(bert_url)`
BERT Fine-tuning	2-phase: head → unfreeze	`trainable=True, lr=2e-5`
BERT vs GPT	Understanding vs Generation	Encoder vs Decoder
WordPiece/BPE	Subword tokenization	`BertTokenizer.from_pretrained()`

← Page Sebelumnya← Previous Page

Page 5 — NLP dengan TensorFlow

📘

Coming Next: Page 7 — Custom Training & Advanced Keras

Melampaui model.fit(): custom training loop dengan GradientTape (full control!), custom loss functions (Focal Loss, Contrastive Loss), custom metrics (F1 Score, AUC custom), Model subclassing untuk arsitektur research, multi-GPU training dengan tf.distribute.MirroredStrategy, dan mixed precision untuk training lebih cepat lagi.

📘

Coming Next: Page 7 — Custom Training & Advanced Keras

Going beyond model.fit(): custom training loops with GradientTape (full control!), custom loss functions (Focal Loss, Contrastive Loss), custom metrics (F1 Score, custom AUC), Model subclassing for research architectures, multi-GPU training with tf.distribute.MirroredStrategy, and mixed precision for even faster training.

Transformer & BERT
di TensorFlow

Transformer & BERT
in TensorFlow

📑 Daftar Isi — Page 6

📑 Table of Contents — Page 6

1. Review — Dari LSTM ke Transformer

1. Review — From LSTM to Transformer

2. MultiHeadAttention — Built-in Keras Layer

2. MultiHeadAttention — Built-in Keras Layer

3. Transformer Block — Membangun dari Keras Layers

3. Transformer Block — Building from Keras Layers

4. Positional Encoding — Menambahkan Informasi Posisi

4. Positional Encoding — Adding Position Information

5. Proyek: Transformer Text Classifier dari Nol

5. Project: Transformer Text Classifier from Scratch

6. BERT Fine-Tuning — 95%+ Akurasi

6. BERT Fine-Tuning — 95%+ Accuracy

7. BERT vs GPT — Encoder vs Decoder

7. BERT vs GPT — Encoder vs Decoder

8. Tokenizer — WordPiece, BPE, dan SentencePiece

8. Tokenizer — WordPiece, BPE, and SentencePiece

9. Ringkasan Page 6

9. Page 6 Summary

Page 5 — NLP dengan TensorFlow

Coming Next: Page 7 — Custom Training & Advanced Keras

Coming Next: Page 7 — Custom Training & Advanced Keras