📑 Daftar Isi — Page 6
📑 Table of Contents — Page 6
- LSTM vs Transformer — Sequential vs parallel
- MultiHeadAttention — Built-in Keras layer
- TransformerBlock — Attn + FFN + Norm + Residual
- Positional Encoding — Sinusoidal & learned
- Proyek: Transformer Text Classifier
- BERT Fine-Tuning — TF Hub & Hugging Face
- BERT vs GPT — Encoder vs Decoder
- Tokenizer — WordPiece, BPE, SentencePiece
- Ringkasan & Preview Page 7
- LSTM vs Transformer — Sequential vs parallel
- MultiHeadAttention — Built-in Keras layer
- TransformerBlock — Attn + FFN + Norm + Residual
- Positional Encoding — Sinusoidal & learned
- Project: Transformer Text Classifier
- BERT Fine-Tuning — TF Hub & Hugging Face
- BERT vs GPT — Encoder vs Decoder
- Tokenizer — WordPiece, BPE, SentencePiece
- Summary & Page 7 Preview
1. Review — Dari LSTM ke Transformer
1. Review — From LSTM to Transformer
Di seri Neural Network Page 8, kita membahas Transformer dari nol: scaled dot-product attention, multi-head attention, positional encoding, dan TransformerBlock. Di Page 5, kita menggunakan LSTM/GRU untuk NLP. Sekarang kita implementasi Transformer di Keras — dan fine-tune BERT untuk akurasi 95%+.
In Neural Network series Page 8, we discussed Transformer from scratch: scaled dot-product attention, multi-head attention, positional encoding, and TransformerBlock. In Page 5, we used LSTM/GRU for NLP. Now we implement Transformer in Keras — and fine-tune BERT for 95%+ accuracy.
| Aspek | LSTM/GRU | Transformer |
|---|---|---|
| Processing | Sequential (satu per satu) | Parallel (sekaligus) |
| Long-range | Lemah (vanishing gradient) | Kuat (direct attention) |
| Training Speed | Lambat (sequential) | Cepat (GPU parallel) |
| Model Size | Kecil-medium (1-50M) | Besar (110M-175B) |
| Best For | Small datasets, simple tasks | Large datasets, complex NLP |
| Contoh | Sentiment, simple classification | BERT, GPT, ChatGPT, LLaMA |
| Aspect | LSTM/GRU | Transformer |
|---|---|---|
| Processing | Sequential (one by one) | Parallel (all at once) |
| Long-range | Weak (vanishing gradient) | Strong (direct attention) |
| Training Speed | Slow (sequential) | Fast (GPU parallel) |
| Model Size | Small-medium (1-50M) | Large (110M-175B) |
| Best For | Small datasets, simple tasks | Large datasets, complex NLP |
| Examples | Sentiment, simple classification | BERT, GPT, ChatGPT, LLaMA |
2. MultiHeadAttention — Built-in Keras Layer
2. MultiHeadAttention — Built-in Keras Layer
import tensorflow as tf from tensorflow.keras import layers # =========================== # 1. MultiHeadAttention — one line! # =========================== mha = layers.MultiHeadAttention( num_heads=8, # 8 attention heads (parallel) key_dim=64, # dimension per head dropout=0.1, # attention dropout ) # Total attention dimension = num_heads × key_dim = 8 × 64 = 512 # Self-attention: Q, K, V all come from same input x = tf.random.normal([2, 200, 128]) # (batch, seq_len, d_model) attn_output = mha(query=x, value=x, key=x) print(attn_output.shape) # (2, 200, 128) — same shape as input! # Cross-attention: Q from one source, K/V from another # Used in encoder-decoder (translation, etc.) # attn_output = mha(query=decoder_input, value=encoder_output) # =========================== # 2. Attention weights — where does it look? # =========================== attn_output, attn_weights = mha(query=x, value=x, key=x, return_attention_scores=True) print(attn_weights.shape) # (2, 8, 200, 200) # For each head: 200×200 attention matrix # attn_weights[0, 0, 5, :] = how much word 5 attends to all other words # =========================== # 3. Recall from NN Series Page 8: # Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) × V # Multi-head = concat(head_1, ..., head_h) × W_o # Each head learns DIFFERENT attention patterns: # - Head 1: syntactic relationships (subject-verb) # - Head 2: semantic similarity (synonyms) # - Head 3: position-based (nearby words) # - etc. # ===========================
3. Transformer Block — Membangun dari Keras Layers
3. Transformer Block — Building from Keras Layers
import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers # =========================== # Transformer Encoder Block # Same architecture as NN Series Page 8! # =========================== class TransformerBlock(layers.Layer): """One Transformer encoder block. Identical to what we built in NN series, but using Keras layers. """ def __init__(self, d_model, num_heads, d_ff, dropout=0.1): super().__init__() self.d_model = d_model # Multi-Head Self-Attention self.mha = layers.MultiHeadAttention( num_heads=num_heads, key_dim=d_model // num_heads, dropout=dropout ) # Feed-Forward Network (2 Dense layers) self.ffn = keras.Sequential([ layers.Dense(d_ff, activation='relu'), # expand layers.Dense(d_model), # project back ]) # Layer Normalization (NOT BatchNorm!) self.norm1 = layers.LayerNormalization(epsilon=1e-6) self.norm2 = layers.LayerNormalization(epsilon=1e-6) # Dropout self.dropout1 = layers.Dropout(dropout) self.dropout2 = layers.Dropout(dropout) def call(self, x, training=False): # Sub-layer 1: Multi-Head Self-Attention + Residual + Norm attn_output = self.mha(x, x) # self-attention attn_output = self.dropout1(attn_output, training=training) x = self.norm1(x + attn_output) # residual + norm # Sub-layer 2: Feed-Forward + Residual + Norm ffn_output = self.ffn(x) ffn_output = self.dropout2(ffn_output, training=training) x = self.norm2(x + ffn_output) # residual + norm return x # Test block = TransformerBlock(d_model=128, num_heads=8, d_ff=256) x = tf.random.normal([2, 200, 128]) out = block(x) print(out.shape) # (2, 200, 128) — same shape! (residual connection)
4. Positional Encoding — Menambahkan Informasi Posisi
4. Positional Encoding — Adding Position Information
import tensorflow as tf from tensorflow.keras import layers import numpy as np # =========================== # Method 1: Sinusoidal PE (original "Attention Is All You Need") # =========================== class SinusoidalPositionalEncoding(layers.Layer): def __init__(self, max_len, d_model): super().__init__() # Pre-compute positional encodings positions = np.arange(max_len)[:, np.newaxis] # (max_len, 1) dims = np.arange(d_model)[np.newaxis, :] # (1, d_model) angles = positions / np.power(10000, (2 * (dims // 2)) / d_model) pe = np.zeros((max_len, d_model)) pe[:, 0::2] = np.sin(angles[:, 0::2]) # even dimensions: sin pe[:, 1::2] = np.cos(angles[:, 1::2]) # odd dimensions: cos self.pe = tf.constant(pe[np.newaxis, :, :], dtype=tf.float32) def call(self, x): return x + self.pe[:, :tf.shape(x)[1], :] # add PE to embeddings # =========================== # Method 2: Learned PE (simpler, used in BERT) # =========================== class LearnedPositionalEncoding(layers.Layer): def __init__(self, max_len, d_model): super().__init__() # Just another Embedding layer! Position → vector self.pos_embedding = layers.Embedding(max_len, d_model) def call(self, x): positions = tf.range(tf.shape(x)[1]) return x + self.pos_embedding(positions) # In practice, learned PE works just as well as sinusoidal # BERT uses learned PE, original Transformer uses sinusoidal
5. Proyek: Transformer Text Classifier dari Nol
5. Project: Transformer Text Classifier from Scratch
import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers # =========================== # Complete Transformer Text Classifier # =========================== VOCAB_SIZE = 10000 MAX_LEN = 200 D_MODEL = 128 NUM_HEADS = 4 D_FF = 256 NUM_BLOCKS = 2 DROPOUT = 0.1 # Token + Position Embedding class TokenAndPositionEmbedding(layers.Layer): def __init__(self, max_len, vocab_size, d_model): super().__init__() self.token_emb = layers.Embedding(vocab_size, d_model) self.pos_emb = layers.Embedding(max_len, d_model) def call(self, x): positions = tf.range(tf.shape(x)[-1]) return self.token_emb(x) + self.pos_emb(positions) # Build model with Functional API inputs = keras.Input(shape=(MAX_LEN,)) # Embedding + Positional Encoding x = TokenAndPositionEmbedding(MAX_LEN, VOCAB_SIZE, D_MODEL)(inputs) x = layers.Dropout(DROPOUT)(x) # Stack Transformer blocks for _ in range(NUM_BLOCKS): x = TransformerBlock(D_MODEL, NUM_HEADS, D_FF, DROPOUT)(x) # Pool + Classify x = layers.GlobalAveragePooling1D()(x) # (batch, seq, dim) → (batch, dim) x = layers.Dropout(0.3)(x) x = layers.Dense(64, activation='relu')(x) x = layers.Dropout(0.2)(x) outputs = layers.Dense(1, activation='sigmoid')(x) model = keras.Model(inputs=inputs, outputs=outputs, name="transformer_classifier") model.summary() # Train on IMDB (X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data(num_words=VOCAB_SIZE) X_train = keras.utils.pad_sequences(X_train, maxlen=MAX_LEN) X_test = keras.utils.pad_sequences(X_test, maxlen=MAX_LEN) model.compile(optimizer=keras.optimizers.Adam(1e-4), loss='binary_crossentropy', metrics=['accuracy']) model.fit(X_train, y_train, epochs=10, batch_size=64, validation_data=(X_test, y_test), callbacks=[keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True)]) test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0) print(f"Transformer Accuracy: {test_acc:.1%}") # ~88% — slightly better than BiLSTM for this dataset size! # The real power of Transformer shows with LARGE models (BERT, GPT)
6. BERT Fine-Tuning — 95%+ Akurasi
6. BERT Fine-Tuning — 95%+ Accuracy
BERT (Bidirectional Encoder Representations from Transformers) adalah model Transformer yang di-pre-train pada Wikipedia + BookCorpus (3.3 miliar kata) menggunakan dua tugas: Masked Language Modeling dan Next Sentence Prediction. Hasilnya: representasi teks yang sangat kaya yang bisa di-fine-tune untuk klasifikasi, QA, NER, dan hampir semua tugas NLP.
BERT (Bidirectional Encoder Representations from Transformers) is a Transformer model pre-trained on Wikipedia + BookCorpus (3.3 billion words) using two tasks: Masked Language Modeling and Next Sentence Prediction. Result: very rich text representations that can be fine-tuned for classification, QA, NER, and almost any NLP task.
# =========================== # Method 1: TF Hub BERT # =========================== # import tensorflow_hub as hub # import tensorflow_text # needed for BERT preprocessing # # # BERT preprocessor + encoder # preprocess_url = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3" # encoder_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4" # # # Build model # text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text') # preprocessor = hub.KerasLayer(preprocess_url) # encoder_inputs = preprocessor(text_input) # encoder = hub.KerasLayer(encoder_url, trainable=True) # outputs = encoder(encoder_inputs) # # # Use [CLS] token output for classification # pooled = outputs["pooled_output"] # (batch, 768) # x = tf.keras.layers.Dropout(0.3)(pooled) # x = tf.keras.layers.Dense(64, activation='relu')(x) # predictions = tf.keras.layers.Dense(1, activation='sigmoid')(x) # # model = tf.keras.Model(text_input, predictions) # model.compile(optimizer=tf.keras.optimizers.Adam(2e-5), # loss='binary_crossentropy', metrics=['accuracy']) # # # Train with RAW TEXT! # model.fit(train_texts, train_labels, epochs=3, batch_size=32) # # → 95%+ accuracy on IMDB! (vs 87% with BiLSTM) # =========================== # Method 2: Hugging Face Transformers + TF # =========================== # from transformers import TFBertModel, BertTokenizer # # tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # bert = TFBertModel.from_pretrained('bert-base-uncased') # # # Tokenize # inputs = tokenizer(texts, padding=True, truncation=True, # max_length=128, return_tensors='tf') # # inputs['input_ids']: token indices # # inputs['attention_mask']: 1=real, 0=padding # # # Get BERT output # outputs = bert(inputs) # cls_output = outputs.last_hidden_state[:, 0, :] # [CLS] token # # Shape: (batch, 768) # # # Add classifier head and fine-tune # # Same 2-phase approach as CNN transfer learning (Page 3): # # Phase 1: Freeze BERT, train head (LR=1e-3) # # Phase 2: Unfreeze BERT, fine-tune all (LR=2e-5)
🤖 BERT vs BiLSTM — Perbandingan Akurasi IMDB:
• Simple LSTM: 84%
• BiLSTM: 87%
• Custom Transformer (our code): 88%
• BERT fine-tuned: 95%+ ← Game changer!
Kenapa BERT jauh lebih baik? BERT sudah "membaca" 3.3 miliar kata sebelum melihat data Anda. Ia sudah memahami grammar, semantik, dan nuansa bahasa. Fine-tuning hanya perlu mengajarkan tugas spesifik — bukan bahasa dari nol.
🤖 BERT vs BiLSTM — IMDB Accuracy Comparison:
• Simple LSTM: 84%
• BiLSTM: 87%
• Custom Transformer (our code): 88%
• BERT fine-tuned: 95%+ ← Game changer!
Why is BERT so much better? BERT already "read" 3.3 billion words before seeing your data. It already understands grammar, semantics, and language nuances. Fine-tuning only needs to teach the specific task — not language from scratch.
7. BERT vs GPT — Encoder vs Decoder
7. BERT vs GPT — Encoder vs Decoder
| Aspek | BERT (Encoder) | GPT (Decoder) |
|---|---|---|
| Attention | Bidirectional (lihat semua kata) | Causal/Left-only (lihat kata sebelumnya) |
| Pre-training | Masked LM: tebak kata yang dihapus | Next token: prediksi kata berikutnya |
| Best For | Understanding: classification, NER, QA | Generation: text completion, chat, code |
| Model Terkenal | BERT, RoBERTa, ALBERT, DeBERTa | GPT-2/3/4, LLaMA, Gemma |
| Fine-tuning | Add classifier head + fine-tune | Prompt engineering / instruction tuning |
| Parameter | 110M (base) - 340M (large) | 117M (GPT-2) - 1.76T (GPT-4) |
| Aspect | BERT (Encoder) | GPT (Decoder) |
|---|---|---|
| Attention | Bidirectional (sees all words) | Causal/Left-only (sees previous words) |
| Pre-training | Masked LM: guess deleted words | Next token: predict next word |
| Best For | Understanding: classification, NER, QA | Generation: text completion, chat, code |
| Famous Models | BERT, RoBERTa, ALBERT, DeBERTa | GPT-2/3/4, LLaMA, Gemma |
| Fine-tuning | Add classifier head + fine-tune | Prompt engineering / instruction tuning |
| Parameters | 110M (base) - 340M (large) | 117M (GPT-2) - 1.76T (GPT-4) |
8. Tokenizer — WordPiece, BPE, dan SentencePiece
8. Tokenizer — WordPiece, BPE, and SentencePiece
# =========================== # Word-level vs Subword tokenization # =========================== # Word-level: "unbelievable" → ["unbelievable"] # Problem: if "unbelievable" not in vocab → [UNK] (lost!) # Subword (WordPiece — used by BERT): # "unbelievable" → ["un", "##believ", "##able"] # All subwords are in vocab! Handles ANY word! # BPE (Byte Pair Encoding — used by GPT): # "lower" → ["low", "er"] # "lowest" → ["low", "est"] # Learns common character pairs from data # SentencePiece (used by T5, LLaMA): # Language-agnostic, treats input as raw bytes # Works for ANY language without word boundaries # =========================== # Using Hugging Face tokenizer # =========================== # from transformers import BertTokenizer # # tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # tokens = tokenizer.tokenize("I love TensorFlow programming") # print(tokens) # ['i', 'love', 'tensor', '##flow', 'programming'] # ↑ "TensorFlow" split into "tensor" + "##flow" # # encoded = tokenizer("I love TensorFlow", return_tensors="tf", # padding=True, truncation=True, max_length=128) # print(encoded['input_ids']) # [[101, 1045, 2293, 23435, 12314, 102, 0, 0, ...]] # 101=[CLS], 102=[SEP], 0=[PAD] # =========================== # Vocab sizes for popular models # =========================== # BERT: 30,522 tokens (WordPiece) # GPT-2: 50,257 tokens (BPE) # T5: 32,000 tokens (SentencePiece) # LLaMA: 32,000 tokens (SentencePiece)
9. Ringkasan Page 6
9. Page 6 Summary
| Konsep | Apa Itu | Kode Kunci |
|---|---|---|
| MultiHeadAttention | Self-attention built-in Keras | MultiHeadAttention(num_heads=8, key_dim=64) |
| TransformerBlock | Attn + FFN + Norm + Residual | class TransformerBlock(layers.Layer) |
| Positional Encoding | Info posisi kata (sin/cos atau learned) | Embedding(max_len, d_model) |
| BERT | Pre-trained bidirectional Transformer | hub.KerasLayer(bert_url) |
| Fine-tuning BERT | 2-phase: head → unfreeze | trainable=True, lr=2e-5 |
| BERT vs GPT | Understanding vs Generation | Encoder vs Decoder |
| WordPiece/BPE | Subword tokenization | BertTokenizer.from_pretrained() |
| Concept | What It Is | Key Code |
|---|---|---|
| MultiHeadAttention | Built-in Keras self-attention | MultiHeadAttention(num_heads=8, key_dim=64) |
| TransformerBlock | Attn + FFN + Norm + Residual | class TransformerBlock(layers.Layer) |
| Positional Encoding | Word position info (sin/cos or learned) | Embedding(max_len, d_model) |
| BERT | Pre-trained bidirectional Transformer | hub.KerasLayer(bert_url) |
| BERT Fine-tuning | 2-phase: head → unfreeze | trainable=True, lr=2e-5 |
| BERT vs GPT | Understanding vs Generation | Encoder vs Decoder |
| WordPiece/BPE | Subword tokenization | BertTokenizer.from_pretrained() |
Page 5 — NLP dengan TensorFlow
Coming Next: Page 7 — Custom Training & Advanced Keras
Melampaui model.fit(): custom training loop dengan GradientTape (full control!), custom loss functions (Focal Loss, Contrastive Loss), custom metrics (F1 Score, AUC custom), Model subclassing untuk arsitektur research, multi-GPU training dengan tf.distribute.MirroredStrategy, dan mixed precision untuk training lebih cepat lagi.
Coming Next: Page 7 — Custom Training & Advanced Keras
Going beyond model.fit(): custom training loops with GradientTape (full control!), custom loss functions (Focal Loss, Contrastive Loss), custom metrics (F1 Score, custom AUC), Model subclassing for research architectures, multi-GPU training with tf.distribute.MirroredStrategy, and mixed precision for even faster training.