šŸ“ Artikel ini ditulis dalam Bahasa Indonesia & English
šŸ“ This article is available in English & Bahasa Indonesia

šŸ”§ Belajar TensorFlow — Page 7Learn TensorFlow — Page 7

Custom Training &
Advanced Keras

Custom Training &
Advanced Keras

Melampaui model.fit(). Page 7 membahas secara mendalam: kapan dan kenapa butuh custom training loop, implementasi lengkap dengan GradientTape dan @tf.function, custom loss functions (Focal Loss, Contrastive Loss, Triplet Loss), custom metrics (F1 Score, Matthews Correlation), Model subclassing untuk arsitektur research, multi-GPU training dengan tf.distribute.MirroredStrategy dan TPUStrategy, gradient accumulation untuk batch size besar di GPU kecil, dan gradient clipping untuk stabilitas.

Going beyond model.fit(). Page 7 covers in depth: when and why you need custom training loops, complete implementation with GradientTape and @tf.function, custom loss functions (Focal Loss, Contrastive Loss, Triplet Loss), custom metrics (F1 Score, Matthews Correlation), Model subclassing for research architectures, multi-GPU training with tf.distribute.MirroredStrategy and TPUStrategy, gradient accumulation for large batch sizes on small GPUs, and gradient clipping for stability.

šŸ“… MaretMarch 2026ā± 30 menit baca30 min read
šŸ· GradientTapeCustom LossCustom MetricsSubclassingMulti-GPUtf.distributeTPU
šŸ“š Seri Belajar TensorFlow:Learn TensorFlow Series:

šŸ“‘ Daftar Isi — Page 7

šŸ“‘ Table of Contents — Page 7

  1. Kapan Butuh Custom Training? — model.fit() vs GradientTape
  2. Custom Training Loop — GradientTape + @tf.function lengkap
  3. Custom Loss Functions — Focal Loss, Contrastive, Triplet
  4. Custom Metrics — F1 Score, Matthews Correlation
  5. Model Subclassing — tf.keras.Model inheritance
  6. tf.distribute — Multi-GPU — MirroredStrategy & TPUStrategy
  7. Gradient Accumulation — Batch besar di GPU kecil
  8. Gradient Clipping — Stabilitas training
  9. Proyek: Custom GAN Training Loop
  10. Ringkasan & Preview Page 8
  1. When You Need Custom Training? — model.fit() vs GradientTape
  2. Custom Training Loop — Complete GradientTape + @tf.function
  3. Custom Loss Functions — Focal Loss, Contrastive, Triplet
  4. Custom Metrics — F1 Score, Matthews Correlation
  5. Model Subclassing — tf.keras.Model inheritance
  6. tf.distribute — Multi-GPU — MirroredStrategy & TPUStrategy
  7. Gradient Accumulation — Large batches on small GPUs
  8. Gradient Clipping — Training stability
  9. Project: Custom GAN Training Loop
  10. Summary & Page 8 Preview
šŸ¤”

1. Kapan Butuh Custom Training Loop?

1. When Do You Need Custom Training?

model.fit() sangat powerful — tapi ada situasi yang membutuhkan kontrol penuh
model.fit() is very powerful — but some situations need full control

model.fit() menangani 90% kebutuhan training. Tapi ada situasi di mana Anda butuh kontrol penuh atas setiap aspek training loop:

model.fit() handles 90% of training needs. But there are situations where you need full control over every aspect of the training loop:

model.fit() vs Custom Training Loop — Kapan Pakai Mana? āœ… model.fit() — gunakan untuk: • Standard classification/regression • Transfer learning (Page 3) • Any model with 1 input → 1 output → 1 loss • Prototyping dan iterasi cepat • 90% of all deep learning projects! šŸ”§ Custom Training Loop — gunakan untuk: • GAN training: alternating Generator & Discriminator updates • Reinforcement Learning: reward-based gradient updates • Multiple loss functions with custom weighting • Gradient accumulation: simulate large batch on small GPU • Research experiments: custom gradient modifications • Meta-learning: learning to learn (MAML, etc.) • Curriculum learning: change data difficulty over time Rule: Start with model.fit(). Switch to custom loop ONLY when needed.
⚔

2. Custom Training Loop — GradientTape Lengkap

2. Custom Training Loop — Complete GradientTape

Full control: forward pass → loss → gradients → optimizer → metrics → logging
Full control: forward pass → loss → gradients → optimizer → metrics → logging
45_custom_training.py — Production-Grade Custom Loop šŸ”„python
import tensorflow as tf
from tensorflow import keras
import time

# ===========================
# 1. Setup
# ===========================
model = keras.Sequential([
    keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(10, activation='softmax')
])

optimizer = keras.optimizers.Adam(learning_rate=1e-3)
loss_fn = keras.losses.SparseCategoricalCrossentropy()
train_acc_metric = keras.metrics.SparseCategoricalAccuracy()
val_acc_metric = keras.metrics.SparseCategoricalAccuracy()
train_loss_metric = keras.metrics.Mean()

# Load data
(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()
X_train = X_train.reshape(-1, 784).astype('float32') / 255.0
X_test = X_test.reshape(-1, 784).astype('float32') / 255.0

train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
train_ds = train_ds.shuffle(60000).batch(64).prefetch(tf.data.AUTOTUNE)
val_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test)).batch(64)

# ===========================
# 2. Train step (compiled with @tf.function for speed!)
# ===========================
@tf.function
def train_step(x_batch, y_batch):
    with tf.GradientTape() as tape:
        # Forward pass (training=True for Dropout/BatchNorm)
        predictions = model(x_batch, training=True)
        loss = loss_fn(y_batch, predictions)

    # Compute gradients
    gradients = tape.gradient(loss, model.trainable_variables)

    # Apply gradients (update weights)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    # Update metrics
    train_acc_metric.update_state(y_batch, predictions)
    train_loss_metric.update_state(loss)
    return loss

# ===========================
# 3. Validation step
# ===========================
@tf.function
def val_step(x_batch, y_batch):
    predictions = model(x_batch, training=False)  # training=False!
    val_acc_metric.update_state(y_batch, predictions)

# ===========================
# 4. Training loop — FULL CONTROL
# ===========================
EPOCHS = 10
best_val_acc = 0

for epoch in range(EPOCHS):
    start_time = time.time()

    # Training
    for x_batch, y_batch in train_ds:
        train_step(x_batch, y_batch)

    train_acc = train_acc_metric.result()
    train_loss = train_loss_metric.result()

    # Validation
    for x_batch, y_batch in val_ds:
        val_step(x_batch, y_batch)

    val_acc = val_acc_metric.result()
    elapsed = time.time() - start_time

    # Logging
    print(f"Epoch {epoch+1}/{EPOCHS} ({elapsed:.1f}s) — "
          f"loss: {train_loss:.4f} — acc: {train_acc:.1%} — "
          f"val_acc: {val_acc:.1%}")

    # Save best model
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        model.save_weights('best_weights.weights.h5')
        print(f"  ↑ New best: {val_acc:.1%}")

    # Reset metrics for next epoch
    train_acc_metric.reset_state()
    train_loss_metric.reset_state()
    val_acc_metric.reset_state()

print(f"\nšŸŽÆ Best Val Accuracy: {best_val_acc:.1%}")
# šŸŽÆ Best Val Accuracy: 98.2%

šŸŽ“ @tf.function — Kenapa Penting?
Tanpa @tf.function: Python eager mode — setiap operasi langsung dieksekusi. Mudah di-debug, tapi lambat.
Dengan @tf.function: TF mengkompilasi fungsi menjadi graph — operasi di-fuse, memory dioptimasi, bisa dijalankan di GPU/TPU. 2-10Ɨ lebih cepat!

Tips: Develop dan debug tanpa @tf.function. Setelah yakin berjalan benar, tambahkan @tf.function untuk kecepatan production.

šŸŽ“ @tf.function — Why It Matters?
Without @tf.function: Python eager mode — each op executes immediately. Easy to debug, but slow.
With @tf.function: TF compiles function into graph — ops are fused, memory optimized, can run on GPU/TPU. 2-10Ɨ faster!

Tip: Develop and debug without @tf.function. Once confirmed working, add @tf.function for production speed.

šŸ“

3. Custom Loss Functions — Beyond CrossEntropy

3. Custom Loss Functions — Beyond CrossEntropy

Focal Loss untuk imbalanced data, Contrastive untuk similarity, Triplet untuk embeddings
Focal Loss for imbalanced data, Contrastive for similarity, Triplet for embeddings
46_custom_losses.py — 3 Custom Loss Functionspython
import tensorflow as tf

# ===========================
# 1. Focal Loss — for IMBALANCED classes
# Down-weights easy examples, focuses on hard ones
# Paper: "Focal Loss for Dense Object Detection" (Lin et al.)
# ===========================
class FocalLoss(tf.keras.losses.Loss):
    """Focal Loss: reduces loss for well-classified examples.
    Great for imbalanced datasets (e.g., 99% negative, 1% positive).
    gamma=0 → standard cross-entropy. gamma=2 → recommended default.
    """

    def __init__(self, gamma=2.0, alpha=0.25, **kwargs):
        super().__init__(**kwargs)
        self.gamma = gamma  # focusing parameter
        self.alpha = alpha  # class weight

    def call(self, y_true, y_pred):
        y_pred = tf.clip_by_value(y_pred, 1e-7, 1 - 1e-7)
        bce = -y_true * tf.math.log(y_pred) - (1 - y_true) * tf.math.log(1 - y_pred)
        p_t = y_true * y_pred + (1 - y_true) * (1 - y_pred)
        focal_weight = self.alpha * ((1 - p_t) ** self.gamma)
        return tf.reduce_mean(focal_weight * bce)

# Use: model.compile(loss=FocalLoss(gamma=2.0))
# When: fraud detection, medical diagnosis, rare event prediction

# ===========================
# 2. Contrastive Loss — for SIMILARITY learning
# Brings similar pairs closer, pushes dissimilar pairs apart
# ===========================
class ContrastiveLoss(tf.keras.losses.Loss):
    """Contrastive Loss for Siamese networks.
    y=1: same class → minimize distance.
    y=0: different class → maximize distance (up to margin).
    """

    def __init__(self, margin=1.0, **kwargs):
        super().__init__(**kwargs)
        self.margin = margin

    def call(self, y_true, distance):
        y_true = tf.cast(y_true, tf.float32)
        loss_positive = y_true * tf.square(distance)
        loss_negative = (1 - y_true) * tf.square(
            tf.maximum(self.margin - distance, 0))
        return tf.reduce_mean(0.5 * (loss_positive + loss_negative))

# When: face verification, signature matching, duplicate detection

# ===========================
# 3. Simple function-based loss (easiest approach)
# ===========================
def weighted_mse(y_true, y_pred):
    """MSE with higher weight for large errors"""
    error = y_true - y_pred
    weight = tf.where(tf.abs(error) > 1.0, 2.0, 1.0)
    return tf.reduce_mean(weight * tf.square(error))

# Use: model.compile(loss=weighted_mse)
# Any function(y_true, y_pred) → scalar works as a loss!

# ===========================
# 4. Label Smoothing (built-in, but useful to know)
# ===========================
loss = tf.keras.losses.CategoricalCrossentropy(label_smoothing=0.1)
# Instead of [0, 1, 0]: uses [0.033, 0.933, 0.033]
# Prevents overconfidence → better generalization
šŸ“Š

4. Custom Metrics — F1 Score & Beyond

4. Custom Metrics — F1 Score & Beyond

Accuracy saja tidak cukup — F1, MCC, dan metric domain-spesifik
Accuracy alone isn't enough — F1, MCC, and domain-specific metrics
47_custom_metrics.py — F1 Score & MCCpython
import tensorflow as tf

# ===========================
# 1. F1 Score — harmonic mean of precision & recall
# ===========================
class F1Score(tf.keras.metrics.Metric):
    """F1 Score: 2 Ɨ (precision Ɨ recall) / (precision + recall)
    Better than accuracy for imbalanced datasets.
    """

    def __init__(self, name='f1_score', threshold=0.5, **kwargs):
        super().__init__(name=name, **kwargs)
        self.precision = tf.keras.metrics.Precision(thresholds=threshold)
        self.recall = tf.keras.metrics.Recall(thresholds=threshold)

    def update_state(self, y_true, y_pred, sample_weight=None):
        self.precision.update_state(y_true, y_pred, sample_weight)
        self.recall.update_state(y_true, y_pred, sample_weight)

    def result(self):
        p = self.precision.result()
        r = self.recall.result()
        return 2 * ((p * r) / (p + r + tf.keras.backend.epsilon()))

    def reset_state(self):
        self.precision.reset_state()
        self.recall.reset_state()

# Use: model.compile(metrics=['accuracy', F1Score()])
# Output: "f1_score: 0.8723"

# ===========================
# 2. Matthews Correlation Coefficient (MCC)
# Best single metric for binary classification!
# ===========================
class MCC(tf.keras.metrics.Metric):
    """Matthews Correlation Coefficient.
    Range: [-1, +1]. 0 = random, 1 = perfect, -1 = inverse.
    Better than F1 for imbalanced datasets.
    """

    def __init__(self, name='mcc', threshold=0.5, **kwargs):
        super().__init__(name=name, **kwargs)
        self.tp = self.add_weight(name='tp', initializer='zeros')
        self.tn = self.add_weight(name='tn', initializer='zeros')
        self.fp = self.add_weight(name='fp', initializer='zeros')
        self.fn = self.add_weight(name='fn', initializer='zeros')
        self.threshold = threshold

    def update_state(self, y_true, y_pred, sample_weight=None):
        y_pred = tf.cast(y_pred >= self.threshold, tf.float32)
        y_true = tf.cast(y_true, tf.float32)
        self.tp.assign_add(tf.reduce_sum(y_true * y_pred))
        self.tn.assign_add(tf.reduce_sum((1-y_true) * (1-y_pred)))
        self.fp.assign_add(tf.reduce_sum((1-y_true) * y_pred))
        self.fn.assign_add(tf.reduce_sum(y_true * (1-y_pred)))

    def result(self):
        num = self.tp * self.tn - self.fp * self.fn
        den = tf.sqrt((self.tp+self.fp) * (self.tp+self.fn) *
                      (self.tn+self.fp) * (self.tn+self.fn) + 1e-7)
        return num / den

    def reset_state(self):
        self.tp.assign(0); self.tn.assign(0)
        self.fp.assign(0); self.fn.assign(0)

šŸŽ“ Kapan Pakai Metric Apa?
Accuracy: Balanced classes, simple tasks.
F1 Score: Imbalanced classes, care about both precision & recall.
MCC: Best single metric for binary — balanced even with extreme imbalance.
AUC-ROC: When you need threshold-independent evaluation.
Domain-specific: BLEU (translation), IoU (segmentation), mAP (detection).

šŸŽ“ When to Use Which Metric?
Accuracy: Balanced classes, simple tasks.
F1 Score: Imbalanced classes, care about both precision & recall.
MCC: Best single metric for binary — balanced even with extreme imbalance.
AUC-ROC: When you need threshold-independent evaluation.
Domain-specific: BLEU (translation), IoU (segmentation), mAP (detection).

šŸ—ļø

5. Model Subclassing — Full Custom Architecture

5. Model Subclassing — Full Custom Architecture

tf.keras.Model inheritance untuk arsitektur research dengan dynamic logic
tf.keras.Model inheritance for research architectures with dynamic logic
48_model_subclassing.py — Custom Model Architecturepython
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# ===========================
# 1. Custom Residual Block
# ===========================
class ResidualBlock(layers.Layer):
    """Residual block: output = ReLU(x + F(x))"""

    def __init__(self, units, dropout=0.1, **kwargs):
        super().__init__(**kwargs)
        self.dense1 = layers.Dense(units, activation='relu',
                                   kernel_initializer='he_normal')
        self.dense2 = layers.Dense(units, kernel_initializer='he_normal')
        self.bn1 = layers.BatchNormalization()
        self.bn2 = layers.BatchNormalization()
        self.dropout = layers.Dropout(dropout)
        self.add = layers.Add()

    def call(self, inputs, training=False):
        x = self.dense1(inputs)
        x = self.bn1(x, training=training)
        x = self.dropout(x, training=training)
        x = self.dense2(x)
        x = self.bn2(x, training=training)
        return tf.nn.relu(self.add([x, inputs]))  # residual!

# ===========================
# 2. Full Custom Model with Dynamic Logic
# ===========================
class AdaptiveClassifier(keras.Model):
    """Model with dynamic routing based on input complexity."""

    def __init__(self, num_classes, num_blocks=3):
        super().__init__()
        self.flatten = layers.Flatten()
        self.project = layers.Dense(128, activation='relu')

        # Stack of residual blocks
        self.blocks = [ResidualBlock(128) for _ in range(num_blocks)]

        # Complexity estimator (dynamic routing!)
        self.complexity = layers.Dense(1, activation='sigmoid')

        self.dropout = layers.Dropout(0.3)
        self.classifier = layers.Dense(num_classes, activation='softmax')

    def call(self, inputs, training=False):
        x = self.flatten(inputs)
        x = self.project(x)

        # Estimate complexity — decide how many blocks to use
        complexity_score = self.complexity(tf.stop_gradient(x))

        # Dynamic depth! (not possible with Sequential/Functional)
        for i, block in enumerate(self.blocks):
            x = block(x, training=training)
            # Could add early exit logic here based on complexity_score

        x = self.dropout(x, training=training)
        return self.classifier(x)

# Use exactly like any Keras model!
model = AdaptiveClassifier(num_classes=10, num_blocks=4)
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy', F1Score()])
# model.fit(X_train, y_train, ...) → works!

šŸŽ“ 3 Cara Build Model — Quick Decision:
Sequential: Linear stack. 80% kasus. model = Sequential([...])
Functional: Branching, skip connections, multi-input. 15% kasus. Model(inputs, outputs)
Subclassing: Dynamic logic (if/else, loops), research. 5% kasus. class MyModel(Model)
Start simple. Upgrade complexity only when needed.

šŸŽ“ 3 Ways to Build Models — Quick Decision:
Sequential: Linear stack. 80% of cases. model = Sequential([...])
Functional: Branching, skip connections, multi-input. 15%. Model(inputs, outputs)
Subclassing: Dynamic logic (if/else, loops), research. 5%. class MyModel(Model)
Start simple. Upgrade complexity only when needed.

šŸ–„ļø

6. tf.distribute — Multi-GPU & TPU Training

6. tf.distribute — Multi-GPU & TPU Training

Scale training dari 1 GPU ke banyak GPU/TPU — minimal code changes
Scale training from 1 GPU to many GPUs/TPUs — minimal code changes
49_multi_gpu.py — tf.distribute Strategiespython
import tensorflow as tf

# ===========================
# 1. MirroredStrategy — multi-GPU on ONE machine
# ===========================
strategy = tf.distribute.MirroredStrategy()
print(f"Number of devices: {strategy.num_replicas_in_sync}")
# → "Number of devices: 4" (if 4 GPUs)

# Build model INSIDE strategy scope!
with strategy.scope():
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

# Training — AUTOMATIC data distribution!
model.fit(train_ds, epochs=10)
# Each GPU processes batch_size/num_gpus samples
# Gradients are synchronized across GPUs (all-reduce)
# 2 GPUs ā‰ˆ 1.8Ɨ throughput, 4 GPUs ā‰ˆ 3.5Ɨ throughput

# ===========================
# 2. TPUStrategy — Google TPU (Colab/Cloud)
# ===========================
# resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
# tf.config.experimental_connect_to_cluster(resolver)
# tf.tpu.experimental.initialize_tpu_system(resolver)
# strategy = tf.distribute.TPUStrategy(resolver)
# 
# with strategy.scope():
#     model = build_model()
#     model.compile(...)
# model.fit(train_ds, epochs=10)
# TPU v3-8: ~8Ɨ faster than single GPU!

# ===========================
# 3. MultiWorkerMirroredStrategy — multi-MACHINE
# For distributed training across multiple servers
# ===========================
# strategy = tf.distribute.MultiWorkerMirroredStrategy()
# Same API, but runs across network-connected machines

# ===========================
# Tips for multi-GPU training
# ===========================
# 1. Scale batch_size Ɨ num_GPUs (e.g., 64 Ɨ 4 = 256)
# 2. Scale learning_rate Ɨ num_GPUs (linear scaling rule)
# 3. Use tf.data pipeline (prefetch!) — GPUs are HUNGRY
# 4. Use mixed precision — even faster on multiple GPUs
šŸ“¦

7. Gradient Accumulation — Batch Besar di GPU Kecil

7. Gradient Accumulation — Large Batches on Small GPUs

Simulasikan batch size 256 di GPU yang hanya muat batch 32
Simulate batch size 256 on a GPU that only fits batch 32
50_gradient_accumulation.pypython
import tensorflow as tf

# Problem: BERT fine-tuning needs batch_size=32+
# But GPU only fits batch_size=8 (OOM with 32!)
# Solution: accumulate gradients over 4 mini-batches → effective batch=32

ACCUM_STEPS = 4  # accumulate 4 mini-batches
MINI_BATCH = 8   # each mini-batch
# Effective batch = 4 Ɨ 8 = 32

optimizer = tf.keras.optimizers.Adam(1e-3)

# Accumulator variables (same shape as model weights)
accum_gradients = [tf.Variable(tf.zeros_like(v), trainable=False)
                   for v in model.trainable_variables]

@tf.function
def train_step_accumulate(x, y, step):
    with tf.GradientTape() as tape:
        preds = model(x, training=True)
        loss = loss_fn(y, preds) / ACCUM_STEPS  # scale loss!

    grads = tape.gradient(loss, model.trainable_variables)

    # Accumulate
    for accum, grad in zip(accum_gradients, grads):
        accum.assign_add(grad)

    # Apply when accumulated enough
    if (step + 1) % ACCUM_STEPS == 0:
        optimizer.apply_gradients(
            zip(accum_gradients, model.trainable_variables))
        for accum in accum_gradients:
            accum.assign(tf.zeros_like(accum))  # reset!

    return loss * ACCUM_STEPS  # return unscaled loss for logging
āœ‚ļø

8. Gradient Clipping — Stabilitas Training

8. Gradient Clipping — Training Stability

Mencegah exploding gradients — terutama penting untuk RNN dan Transformer
Preventing exploding gradients — especially important for RNN and Transformer
51_gradient_clipping.pypython
import tensorflow as tf

# ===========================
# Method 1: In optimizer (easiest!)
# ===========================
optimizer = tf.keras.optimizers.Adam(
    learning_rate=1e-3,
    clipnorm=1.0,      # clip gradients with L2 norm > 1.0
    # clipvalue=0.5,   # clip each gradient value to [-0.5, 0.5]
)
# clipnorm: scales all gradients so total norm ≤ 1.0
# clipvalue: clips each individual gradient value
# clipnorm is generally preferred (preserves direction)

# ===========================
# Method 2: Manual clipping in custom loop
# ===========================
@tf.function
def train_step_clipped(x, y):
    with tf.GradientTape() as tape:
        preds = model(x, training=True)
        loss = loss_fn(y, preds)

    grads = tape.gradient(loss, model.trainable_variables)

    # Clip gradients
    grads, global_norm = tf.clip_by_global_norm(grads, clip_norm=1.0)
    # global_norm = original gradient norm (useful for monitoring)

    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    return loss, global_norm

# Monitor gradient norm — if consistently > 10, you have a problem!
# Healthy range: 0.1 — 5.0
# > 10: exploding gradients → reduce LR or add clipping
# < 0.001: vanishing gradients → use residual connections

šŸ’” Best Practice: Selalu gunakan clipnorm=1.0 di optimizer untuk RNN, LSTM, GRU, dan Transformer training. Ini mencegah gradient explosion tanpa mengorbankan konvergensi. Untuk training yang sangat stabil: kombinasikan gradient clipping + learning rate warmup + cosine decay.

šŸ’” Best Practice: Always use clipnorm=1.0 in the optimizer for RNN, LSTM, GRU, and Transformer training. This prevents gradient explosion without sacrificing convergence. For very stable training: combine gradient clipping + learning rate warmup + cosine decay.

šŸŽØ

9. Proyek: Custom GAN Training Loop

9. Project: Custom GAN Training Loop

Contoh nyata kenapa custom loop dibutuhkan — alternating 2 optimizers
Real example of why custom loops are needed — alternating 2 optimizers
52_gan_training_loop.py — GAN Custom Training šŸ”„python
import tensorflow as tf

# This is IMPOSSIBLE with model.fit()!
# GAN needs to alternate between training D and G

generator = build_generator()      # noise → fake image
discriminator = build_discriminator()  # image → real/fake

g_optimizer = tf.keras.optimizers.Adam(2e-4, beta_1=0.5)
d_optimizer = tf.keras.optimizers.Adam(2e-4, beta_1=0.5)
bce = tf.keras.losses.BinaryCrossentropy(from_logits=True)

NOISE_DIM = 100
BATCH_SIZE = 64

@tf.function
def train_step(real_images):
    noise = tf.random.normal([BATCH_SIZE, NOISE_DIM])

    with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
        # Forward pass
        fake_images = generator(noise, training=True)
        real_output = discriminator(real_images, training=True)
        fake_output = discriminator(fake_images, training=True)

        # Losses
        d_loss_real = bce(tf.ones_like(real_output), real_output)
        d_loss_fake = bce(tf.zeros_like(fake_output), fake_output)
        d_loss = d_loss_real + d_loss_fake

        g_loss = bce(tf.ones_like(fake_output), fake_output)

    # Separate gradient updates — THIS is why we need custom loop!
    d_grads = disc_tape.gradient(d_loss, discriminator.trainable_variables)
    g_grads = gen_tape.gradient(g_loss, generator.trainable_variables)

    d_optimizer.apply_gradients(
        zip(d_grads, discriminator.trainable_variables))
    g_optimizer.apply_gradients(
        zip(g_grads, generator.trainable_variables))

    return d_loss, g_loss

# Training loop
for epoch in range(100):
    for real_batch in train_dataset:
        d_loss, g_loss = train_step(real_batch)

    print(f"Epoch {epoch+1} | D_loss: {d_loss:.4f} | G_loss: {g_loss:.4f}")

    # Generate sample images every 10 epochs
    if (epoch + 1) % 10 == 0:
        noise = tf.random.normal([16, NOISE_DIM])
        generated = generator(noise, training=False)
        # save_images(generated, f"gen_epoch_{epoch+1}.png")

šŸŽ“ Kenapa GAN Butuh Custom Loop?
model.fit() mengoptimasi satu model dengan satu loss dan satu optimizer. GAN punya:
• 2 model (Generator + Discriminator)
• 2 loss (G_loss + D_loss — berlawanan arah!)
• 2 optimizer (satu per model)
• Alternating updates: train D → freeze D → train G → repeat
Ini tidak bisa dilakukan dengan model.fit(). Custom loop = satu-satunya cara.

šŸŽ“ Why GAN Needs Custom Loop?
model.fit() optimizes one model with one loss and one optimizer. GAN has:
• 2 models (Generator + Discriminator)
• 2 losses (G_loss + D_loss — opposing directions!)
• 2 optimizers (one per model)
• Alternating updates: train D → freeze D → train G → repeat
This cannot be done with model.fit(). Custom loop = the only way.

šŸ“

10. Ringkasan Page 7

10. Page 7 Summary

Semua yang sudah kita pelajari
Everything we learned
KonsepApa ItuKode Kunci
Custom TrainingFull control GradientTapetape.gradient(loss, vars)
@tf.functionCompile ke graph (2-10Ɨ faster)@tf.function
Focal LossLoss untuk imbalanced dataclass FocalLoss(Loss)
Contrastive LossLoss untuk similarity learningclass ContrastiveLoss(Loss)
F1 ScoreCustom metric balancedclass F1Score(Metric)
SubclassingModel dengan dynamic logicclass MyModel(keras.Model)
MirroredStrategyMulti-GPU satu mesintf.distribute.MirroredStrategy()
Gradient AccumBatch besar, GPU kecilaccum.assign_add(grad)
Gradient ClippingCegah explosionclipnorm=1.0
ConceptWhat It IsKey Code
Custom TrainingFull control GradientTapetape.gradient(loss, vars)
@tf.functionCompile to graph (2-10Ɨ faster)@tf.function
Focal LossLoss for imbalanced dataclass FocalLoss(Loss)
Contrastive LossLoss for similarity learningclass ContrastiveLoss(Loss)
F1 ScoreCustom balanced metricclass F1Score(Metric)
SubclassingModel with dynamic logicclass MyModel(keras.Model)
MirroredStrategyMulti-GPU one machinetf.distribute.MirroredStrategy()
Gradient AccumLarge batch, small GPUaccum.assign_add(grad)
Gradient ClippingPrevent explosionclipnorm=1.0
← Page Sebelumnya← Previous Page

Page 6 — Transformer & BERT di TensorFlow

šŸ“˜

Coming Next: Page 8 — GAN & Generative Models

Membuat gambar dari nol! Page 8 membahas: arsitektur DCGAN lengkap (Generator Conv2DTranspose + Discriminator Conv2D), training loop adversarial, Variational Autoencoder (VAE) dan reparameterization trick, conditional GAN untuk generate kelas tertentu, Wasserstein GAN untuk training stabil, latent space interpolation, dan tips production GAN training.

šŸ“˜

Coming Next: Page 8 — GAN & Generative Models

Creating images from scratch! Page 8 covers: complete DCGAN architecture (Generator Conv2DTranspose + Discriminator Conv2D), adversarial training loop, Variational Autoencoder (VAE) and reparameterization trick, conditional GAN for specific class generation, Wasserstein GAN for stable training, latent space interpolation, and production GAN training tips.