📝 Artikel ini ditulis dalam Bahasa Indonesia & English
📝 This article is available in English & Bahasa Indonesia

🤖 Belajar TensorFlow — Page 6Learn TensorFlow — Page 6

Transformer & BERT
di TensorFlow

Transformer & BERT
in TensorFlow

Arsitektur modern yang merevolusi NLP. Page 6 membahas secara mendalam: perbandingan LSTM vs Transformer, MultiHeadAttention layer (built-in Keras), membangun TransformerBlock dari building blocks, Positional Encoding (sinusoidal & learned), Transformer text classifier dari nol, fine-tuning BERT dari TF Hub dan Hugging Face, perbandingan BERT vs GPT (encoder vs decoder), dan subword tokenization (WordPiece, BPE, SentencePiece).

The modern architecture that revolutionized NLP. Page 6 covers in depth: LSTM vs Transformer comparison, MultiHeadAttention layer (built-in Keras), building TransformerBlock from building blocks, Positional Encoding (sinusoidal & learned), Transformer text classifier from scratch, fine-tuning BERT from TF Hub and Hugging Face, BERT vs GPT comparison (encoder vs decoder), and subword tokenization (WordPiece, BPE, SentencePiece).

📅 MaretMarch 202635 menit baca35 min read
🏷 TransformerBERTAttentionTF HubHugging FaceGPTWordPiece
📚 Seri Belajar TensorFlow:Learn TensorFlow Series:

📑 Daftar Isi — Page 6

📑 Table of Contents — Page 6

  1. LSTM vs Transformer — Sequential vs parallel
  2. MultiHeadAttention — Built-in Keras layer
  3. TransformerBlock — Attn + FFN + Norm + Residual
  4. Positional Encoding — Sinusoidal & learned
  5. Proyek: Transformer Text Classifier
  6. BERT Fine-Tuning — TF Hub & Hugging Face
  7. BERT vs GPT — Encoder vs Decoder
  8. Tokenizer — WordPiece, BPE, SentencePiece
  9. Ringkasan & Preview Page 7
  1. LSTM vs Transformer — Sequential vs parallel
  2. MultiHeadAttention — Built-in Keras layer
  3. TransformerBlock — Attn + FFN + Norm + Residual
  4. Positional Encoding — Sinusoidal & learned
  5. Project: Transformer Text Classifier
  6. BERT Fine-Tuning — TF Hub & Hugging Face
  7. BERT vs GPT — Encoder vs Decoder
  8. Tokenizer — WordPiece, BPE, SentencePiece
  9. Summary & Page 7 Preview
🔄

1. Review — Dari LSTM ke Transformer

1. Review — From LSTM to Transformer

LSTM memproses kata satu-per-satu (sequential). Transformer memproses semua kata SEKALIGUS (parallel).
LSTM processes words one-by-one (sequential). Transformer processes ALL words SIMULTANEOUSLY (parallel).

Di seri Neural Network Page 8, kita membahas Transformer dari nol: scaled dot-product attention, multi-head attention, positional encoding, dan TransformerBlock. Di Page 5, kita menggunakan LSTM/GRU untuk NLP. Sekarang kita implementasi Transformer di Keras — dan fine-tune BERT untuk akurasi 95%+.

In Neural Network series Page 8, we discussed Transformer from scratch: scaled dot-product attention, multi-head attention, positional encoding, and TransformerBlock. In Page 5, we used LSTM/GRU for NLP. Now we implement Transformer in Keras — and fine-tune BERT for 95%+ accuracy.

LSTM vs Transformer — Kenapa Transformer Menang / Why Transformer Wins LSTM (sequential): "I" → "love" → "this" → "movie" → "very" → "much" ↓ ↓ ↓ ↓ ↓ ↓ h₁ → h₂ → h₃ → h₄ → h₅ → h₆ → output Masalah: informasi "I" sudah pudar saat sampai di "much" Problem: "I" information fades by the time we reach "much" Speed: O(n) — harus sequential, TIDAK BISA parallel Transformer (parallel): "I" "love" "this" "movie" "very" "much" ↕ ↕ ↕ ↕ ↕ ↕ ← ALL attend to ALL! [=============== Self-Attention ================] Setiap kata bisa langsung "melihat" kata lain — tanpa jarak! Every word can directly "see" every other word — no distance decay! Speed: O(1) per layer — fully parallelizable on GPU! Result: Transformer = faster training + better long-range understanding BERT, GPT, T5, LLaMA → ALL built on Transformer architecture
AspekLSTM/GRUTransformer
ProcessingSequential (satu per satu)Parallel (sekaligus)
Long-rangeLemah (vanishing gradient)Kuat (direct attention)
Training SpeedLambat (sequential)Cepat (GPU parallel)
Model SizeKecil-medium (1-50M)Besar (110M-175B)
Best ForSmall datasets, simple tasksLarge datasets, complex NLP
ContohSentiment, simple classificationBERT, GPT, ChatGPT, LLaMA
AspectLSTM/GRUTransformer
ProcessingSequential (one by one)Parallel (all at once)
Long-rangeWeak (vanishing gradient)Strong (direct attention)
Training SpeedSlow (sequential)Fast (GPU parallel)
Model SizeSmall-medium (1-50M)Large (110M-175B)
Best ForSmall datasets, simple tasksLarge datasets, complex NLP
ExamplesSentiment, simple classificationBERT, GPT, ChatGPT, LLaMA
👁️

2. MultiHeadAttention — Built-in Keras Layer

2. MultiHeadAttention — Built-in Keras Layer

Di seri NN kita implementasi manual. Di Keras: satu baris.
In NN series we implemented manually. In Keras: one line.
39_multihead_attention.py — Self-Attention in Keraspython
import tensorflow as tf
from tensorflow.keras import layers

# ===========================
# 1. MultiHeadAttention — one line!
# ===========================
mha = layers.MultiHeadAttention(
    num_heads=8,         # 8 attention heads (parallel)
    key_dim=64,          # dimension per head
    dropout=0.1,         # attention dropout
)
# Total attention dimension = num_heads × key_dim = 8 × 64 = 512

# Self-attention: Q, K, V all come from same input
x = tf.random.normal([2, 200, 128])  # (batch, seq_len, d_model)
attn_output = mha(query=x, value=x, key=x)
print(attn_output.shape)  # (2, 200, 128) — same shape as input!

# Cross-attention: Q from one source, K/V from another
# Used in encoder-decoder (translation, etc.)
# attn_output = mha(query=decoder_input, value=encoder_output)

# ===========================
# 2. Attention weights — where does it look?
# ===========================
attn_output, attn_weights = mha(query=x, value=x, key=x,
                                 return_attention_scores=True)
print(attn_weights.shape)  # (2, 8, 200, 200)
# For each head: 200×200 attention matrix
# attn_weights[0, 0, 5, :] = how much word 5 attends to all other words

# ===========================
# 3. Recall from NN Series Page 8:
# Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) × V
# Multi-head = concat(head_1, ..., head_h) × W_o
# Each head learns DIFFERENT attention patterns:
#   - Head 1: syntactic relationships (subject-verb)
#   - Head 2: semantic similarity (synonyms)
#   - Head 3: position-based (nearby words)
#   - etc.
# ===========================
🏗️

3. Transformer Block — Membangun dari Keras Layers

3. Transformer Block — Building from Keras Layers

Attention + FeedForward + LayerNorm + Residual = satu Transformer block
Attention + FeedForward + LayerNorm + Residual = one Transformer block
40_transformer_block.py — Transformer Encoder Blockpython
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# ===========================
# Transformer Encoder Block
# Same architecture as NN Series Page 8!
# ===========================
class TransformerBlock(layers.Layer):
    """One Transformer encoder block.
    Identical to what we built in NN series, but using Keras layers.
    """

    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.d_model = d_model

        # Multi-Head Self-Attention
        self.mha = layers.MultiHeadAttention(
            num_heads=num_heads,
            key_dim=d_model // num_heads,
            dropout=dropout
        )

        # Feed-Forward Network (2 Dense layers)
        self.ffn = keras.Sequential([
            layers.Dense(d_ff, activation='relu'),   # expand
            layers.Dense(d_model),                   # project back
        ])

        # Layer Normalization (NOT BatchNorm!)
        self.norm1 = layers.LayerNormalization(epsilon=1e-6)
        self.norm2 = layers.LayerNormalization(epsilon=1e-6)

        # Dropout
        self.dropout1 = layers.Dropout(dropout)
        self.dropout2 = layers.Dropout(dropout)

    def call(self, x, training=False):
        # Sub-layer 1: Multi-Head Self-Attention + Residual + Norm
        attn_output = self.mha(x, x)                          # self-attention
        attn_output = self.dropout1(attn_output, training=training)
        x = self.norm1(x + attn_output)                       # residual + norm

        # Sub-layer 2: Feed-Forward + Residual + Norm
        ffn_output = self.ffn(x)
        ffn_output = self.dropout2(ffn_output, training=training)
        x = self.norm2(x + ffn_output)                        # residual + norm

        return x

# Test
block = TransformerBlock(d_model=128, num_heads=8, d_ff=256)
x = tf.random.normal([2, 200, 128])
out = block(x)
print(out.shape)  # (2, 200, 128) — same shape! (residual connection)
Transformer Block Architecture (same as NN Series Page 8) ┌─────────────────┐ Input ──────────►│ Multi-Head │ │ │ Self-Attention │ │ │ (8 heads) │ │ └───────┬─────────┘ │ │ Dropout └──── Add ◄─────────┘ ← Residual Connection │ LayerNorm │ ┌──────┴──────┐ │ FeedForward │ │ Dense(d_ff) │ ← expand (e.g., 128 → 512) │ ReLU │ │ Dense(d_m) │ ← project back (512 → 128) └──────┬──────┘ │ Dropout ┌───── Add ◄────── (same x) ← Residual Connection │ LayerNorm │ Output ← same shape as input! Can stack N blocks.
📍

4. Positional Encoding — Menambahkan Informasi Posisi

4. Positional Encoding — Adding Position Information

Attention tidak tahu urutan kata — positional encoding menambahkan info posisi
Attention doesn't know word order — positional encoding adds position info
41_positional_encoding.py — Sinusoidal & Learned PEpython
import tensorflow as tf
from tensorflow.keras import layers
import numpy as np

# ===========================
# Method 1: Sinusoidal PE (original "Attention Is All You Need")
# ===========================
class SinusoidalPositionalEncoding(layers.Layer):
    def __init__(self, max_len, d_model):
        super().__init__()
        # Pre-compute positional encodings
        positions = np.arange(max_len)[:, np.newaxis]      # (max_len, 1)
        dims = np.arange(d_model)[np.newaxis, :]           # (1, d_model)
        angles = positions / np.power(10000, (2 * (dims // 2)) / d_model)

        pe = np.zeros((max_len, d_model))
        pe[:, 0::2] = np.sin(angles[:, 0::2])  # even dimensions: sin
        pe[:, 1::2] = np.cos(angles[:, 1::2])  # odd dimensions: cos
        self.pe = tf.constant(pe[np.newaxis, :, :], dtype=tf.float32)

    def call(self, x):
        return x + self.pe[:, :tf.shape(x)[1], :]  # add PE to embeddings

# ===========================
# Method 2: Learned PE (simpler, used in BERT)
# ===========================
class LearnedPositionalEncoding(layers.Layer):
    def __init__(self, max_len, d_model):
        super().__init__()
        # Just another Embedding layer! Position → vector
        self.pos_embedding = layers.Embedding(max_len, d_model)

    def call(self, x):
        positions = tf.range(tf.shape(x)[1])
        return x + self.pos_embedding(positions)

# In practice, learned PE works just as well as sinusoidal
# BERT uses learned PE, original Transformer uses sinusoidal
🎯

5. Proyek: Transformer Text Classifier dari Nol

5. Project: Transformer Text Classifier from Scratch

Gabungkan semua komponen: Embedding + PE + TransformerBlock + Classifier
Combine all components: Embedding + PE + TransformerBlock + Classifier
42_transformer_classifier.py — Complete Transformer 🔥python
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# ===========================
# Complete Transformer Text Classifier
# ===========================
VOCAB_SIZE = 10000
MAX_LEN = 200
D_MODEL = 128
NUM_HEADS = 4
D_FF = 256
NUM_BLOCKS = 2
DROPOUT = 0.1

# Token + Position Embedding
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, max_len, vocab_size, d_model):
        super().__init__()
        self.token_emb = layers.Embedding(vocab_size, d_model)
        self.pos_emb = layers.Embedding(max_len, d_model)

    def call(self, x):
        positions = tf.range(tf.shape(x)[-1])
        return self.token_emb(x) + self.pos_emb(positions)

# Build model with Functional API
inputs = keras.Input(shape=(MAX_LEN,))

# Embedding + Positional Encoding
x = TokenAndPositionEmbedding(MAX_LEN, VOCAB_SIZE, D_MODEL)(inputs)
x = layers.Dropout(DROPOUT)(x)

# Stack Transformer blocks
for _ in range(NUM_BLOCKS):
    x = TransformerBlock(D_MODEL, NUM_HEADS, D_FF, DROPOUT)(x)

# Pool + Classify
x = layers.GlobalAveragePooling1D()(x)  # (batch, seq, dim) → (batch, dim)
x = layers.Dropout(0.3)(x)
x = layers.Dense(64, activation='relu')(x)
x = layers.Dropout(0.2)(x)
outputs = layers.Dense(1, activation='sigmoid')(x)

model = keras.Model(inputs=inputs, outputs=outputs, name="transformer_classifier")
model.summary()

# Train on IMDB
(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data(num_words=VOCAB_SIZE)
X_train = keras.utils.pad_sequences(X_train, maxlen=MAX_LEN)
X_test = keras.utils.pad_sequences(X_test, maxlen=MAX_LEN)

model.compile(optimizer=keras.optimizers.Adam(1e-4),
              loss='binary_crossentropy', metrics=['accuracy'])

model.fit(X_train, y_train, epochs=10, batch_size=64,
          validation_data=(X_test, y_test),
          callbacks=[keras.callbacks.EarlyStopping(patience=3,
                     restore_best_weights=True)])

test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"Transformer Accuracy: {test_acc:.1%}")
# ~88% — slightly better than BiLSTM for this dataset size!
# The real power of Transformer shows with LARGE models (BERT, GPT)
🤖

6. BERT Fine-Tuning — 95%+ Akurasi

6. BERT Fine-Tuning — 95%+ Accuracy

Pre-trained Transformer yang merevolusi NLP — fine-tune untuk tugas apapun
Pre-trained Transformer that revolutionized NLP — fine-tune for any task

BERT (Bidirectional Encoder Representations from Transformers) adalah model Transformer yang di-pre-train pada Wikipedia + BookCorpus (3.3 miliar kata) menggunakan dua tugas: Masked Language Modeling dan Next Sentence Prediction. Hasilnya: representasi teks yang sangat kaya yang bisa di-fine-tune untuk klasifikasi, QA, NER, dan hampir semua tugas NLP.

BERT (Bidirectional Encoder Representations from Transformers) is a Transformer model pre-trained on Wikipedia + BookCorpus (3.3 billion words) using two tasks: Masked Language Modeling and Next Sentence Prediction. Result: very rich text representations that can be fine-tuned for classification, QA, NER, and almost any NLP task.

43_bert_finetuning.py — BERT with TF Hubpython
# ===========================
# Method 1: TF Hub BERT
# ===========================
# import tensorflow_hub as hub
# import tensorflow_text  # needed for BERT preprocessing
# 
# # BERT preprocessor + encoder
# preprocess_url = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
# encoder_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"
# 
# # Build model
# text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
# preprocessor = hub.KerasLayer(preprocess_url)
# encoder_inputs = preprocessor(text_input)
# encoder = hub.KerasLayer(encoder_url, trainable=True)
# outputs = encoder(encoder_inputs)
# 
# # Use [CLS] token output for classification
# pooled = outputs["pooled_output"]  # (batch, 768)
# x = tf.keras.layers.Dropout(0.3)(pooled)
# x = tf.keras.layers.Dense(64, activation='relu')(x)
# predictions = tf.keras.layers.Dense(1, activation='sigmoid')(x)
# 
# model = tf.keras.Model(text_input, predictions)
# model.compile(optimizer=tf.keras.optimizers.Adam(2e-5),
#               loss='binary_crossentropy', metrics=['accuracy'])
# 
# # Train with RAW TEXT!
# model.fit(train_texts, train_labels, epochs=3, batch_size=32)
# # → 95%+ accuracy on IMDB! (vs 87% with BiLSTM)

# ===========================
# Method 2: Hugging Face Transformers + TF
# ===========================
# from transformers import TFBertModel, BertTokenizer
# 
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# bert = TFBertModel.from_pretrained('bert-base-uncased')
# 
# # Tokenize
# inputs = tokenizer(texts, padding=True, truncation=True,
#                    max_length=128, return_tensors='tf')
# # inputs['input_ids']:      token indices
# # inputs['attention_mask']:  1=real, 0=padding
# 
# # Get BERT output
# outputs = bert(inputs)
# cls_output = outputs.last_hidden_state[:, 0, :]  # [CLS] token
# # Shape: (batch, 768)
# 
# # Add classifier head and fine-tune
# # Same 2-phase approach as CNN transfer learning (Page 3):
# # Phase 1: Freeze BERT, train head (LR=1e-3)
# # Phase 2: Unfreeze BERT, fine-tune all (LR=2e-5)

🤖 BERT vs BiLSTM — Perbandingan Akurasi IMDB:
• Simple LSTM: 84%
• BiLSTM: 87%
• Custom Transformer (our code): 88%
• BERT fine-tuned: 95%+ ← Game changer!

Kenapa BERT jauh lebih baik? BERT sudah "membaca" 3.3 miliar kata sebelum melihat data Anda. Ia sudah memahami grammar, semantik, dan nuansa bahasa. Fine-tuning hanya perlu mengajarkan tugas spesifik — bukan bahasa dari nol.

🤖 BERT vs BiLSTM — IMDB Accuracy Comparison:
• Simple LSTM: 84%
• BiLSTM: 87%
• Custom Transformer (our code): 88%
• BERT fine-tuned: 95%+ ← Game changer!

Why is BERT so much better? BERT already "read" 3.3 billion words before seeing your data. It already understands grammar, semantics, and language nuances. Fine-tuning only needs to teach the specific task — not language from scratch.

⚖️

7. BERT vs GPT — Encoder vs Decoder

7. BERT vs GPT — Encoder vs Decoder

Dua pendekatan Transformer: bidirectional understanding vs autoregressive generation
Two Transformer approaches: bidirectional understanding vs autoregressive generation
AspekBERT (Encoder)GPT (Decoder)
AttentionBidirectional (lihat semua kata)Causal/Left-only (lihat kata sebelumnya)
Pre-trainingMasked LM: tebak kata yang dihapusNext token: prediksi kata berikutnya
Best ForUnderstanding: classification, NER, QAGeneration: text completion, chat, code
Model TerkenalBERT, RoBERTa, ALBERT, DeBERTaGPT-2/3/4, LLaMA, Gemma
Fine-tuningAdd classifier head + fine-tunePrompt engineering / instruction tuning
Parameter110M (base) - 340M (large)117M (GPT-2) - 1.76T (GPT-4)
AspectBERT (Encoder)GPT (Decoder)
AttentionBidirectional (sees all words)Causal/Left-only (sees previous words)
Pre-trainingMasked LM: guess deleted wordsNext token: predict next word
Best ForUnderstanding: classification, NER, QAGeneration: text completion, chat, code
Famous ModelsBERT, RoBERTa, ALBERT, DeBERTaGPT-2/3/4, LLaMA, Gemma
Fine-tuningAdd classifier head + fine-tunePrompt engineering / instruction tuning
Parameters110M (base) - 340M (large)117M (GPT-2) - 1.76T (GPT-4)
✂️

8. Tokenizer — WordPiece, BPE, dan SentencePiece

8. Tokenizer — WordPiece, BPE, and SentencePiece

Memecah teks menjadi subword — kunci menangani kata yang tidak dikenal
Breaking text into subwords — the key to handling unknown words
44_tokenizers.py — Subword Tokenizationpython
# ===========================
# Word-level vs Subword tokenization
# ===========================

# Word-level: "unbelievable" → ["unbelievable"]
# Problem: if "unbelievable" not in vocab → [UNK] (lost!)

# Subword (WordPiece — used by BERT):
# "unbelievable" → ["un", "##believ", "##able"]
# All subwords are in vocab! Handles ANY word!

# BPE (Byte Pair Encoding — used by GPT):
# "lower" → ["low", "er"]
# "lowest" → ["low", "est"]
# Learns common character pairs from data

# SentencePiece (used by T5, LLaMA):
# Language-agnostic, treats input as raw bytes
# Works for ANY language without word boundaries

# ===========================
# Using Hugging Face tokenizer
# ===========================
# from transformers import BertTokenizer
# 
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# tokens = tokenizer.tokenize("I love TensorFlow programming")
# print(tokens)
# ['i', 'love', 'tensor', '##flow', 'programming']
#  ↑ "TensorFlow" split into "tensor" + "##flow"
# 
# encoded = tokenizer("I love TensorFlow", return_tensors="tf",
#                     padding=True, truncation=True, max_length=128)
# print(encoded['input_ids'])
# [[101, 1045, 2293, 23435, 12314, 102, 0, 0, ...]]
# 101=[CLS], 102=[SEP], 0=[PAD]

# ===========================
# Vocab sizes for popular models
# ===========================
# BERT:   30,522 tokens (WordPiece)
# GPT-2:  50,257 tokens (BPE)
# T5:     32,000 tokens (SentencePiece)
# LLaMA:  32,000 tokens (SentencePiece)
📝

9. Ringkasan Page 6

9. Page 6 Summary

Semua yang sudah kita pelajari
Everything we learned
KonsepApa ItuKode Kunci
MultiHeadAttentionSelf-attention built-in KerasMultiHeadAttention(num_heads=8, key_dim=64)
TransformerBlockAttn + FFN + Norm + Residualclass TransformerBlock(layers.Layer)
Positional EncodingInfo posisi kata (sin/cos atau learned)Embedding(max_len, d_model)
BERTPre-trained bidirectional Transformerhub.KerasLayer(bert_url)
Fine-tuning BERT2-phase: head → unfreezetrainable=True, lr=2e-5
BERT vs GPTUnderstanding vs GenerationEncoder vs Decoder
WordPiece/BPESubword tokenizationBertTokenizer.from_pretrained()
ConceptWhat It IsKey Code
MultiHeadAttentionBuilt-in Keras self-attentionMultiHeadAttention(num_heads=8, key_dim=64)
TransformerBlockAttn + FFN + Norm + Residualclass TransformerBlock(layers.Layer)
Positional EncodingWord position info (sin/cos or learned)Embedding(max_len, d_model)
BERTPre-trained bidirectional Transformerhub.KerasLayer(bert_url)
BERT Fine-tuning2-phase: head → unfreezetrainable=True, lr=2e-5
BERT vs GPTUnderstanding vs GenerationEncoder vs Decoder
WordPiece/BPESubword tokenizationBertTokenizer.from_pretrained()
← Page Sebelumnya← Previous Page

Page 5 — NLP dengan TensorFlow

📘

Coming Next: Page 7 — Custom Training & Advanced Keras

Melampaui model.fit(): custom training loop dengan GradientTape (full control!), custom loss functions (Focal Loss, Contrastive Loss), custom metrics (F1 Score, AUC custom), Model subclassing untuk arsitektur research, multi-GPU training dengan tf.distribute.MirroredStrategy, dan mixed precision untuk training lebih cepat lagi.

📘

Coming Next: Page 7 — Custom Training & Advanced Keras

Going beyond model.fit(): custom training loops with GradientTape (full control!), custom loss functions (Focal Loss, Contrastive Loss), custom metrics (F1 Score, custom AUC), Model subclassing for research architectures, multi-GPU training with tf.distribute.MirroredStrategy, and mixed precision for even faster training.