Neural Network Page 4 — Regularization & Optimization

📑 Daftar Isi — Page 4

📑 Table of Contents — Page 4

Overfitting vs Underfitting — Masalah utama machine learning
L2 Regularization — Hukum: jangan punya weight besar
Dropout — Matikan neuron random saat training
Batch Normalization — Stabilkan input setiap layer
Adam Optimizer — Raja optimizer: momentum + adaptive LR
Learning Rate Scheduling — LR yang berubah seiring waktu
Semua Digabung — Model production-ready
Ringkasan & Preview Page 5

Overfitting vs Underfitting — The central ML problem
L2 Regularization — Penalty: don't have large weights
Dropout — Randomly kill neurons during training
Batch Normalization — Stabilize input at every layer
Adam Optimizer — King of optimizers: momentum + adaptive LR
Learning Rate Scheduling — LR that changes over time
Putting It All Together — A production-ready model
Summary & Page 5 Preview

📊

1. Overfitting vs Underfitting

Masalah fundamental: model terlalu "hafal" atau terlalu "bodoh"

The fundamental problem: model "memorizes" or is too "dumb"

Overfitting = model terlalu hafal data training, tidak bisa generalize ke data baru. Train accuracy: 99%, Test accuracy: 70%. Underfitting = model terlalu sederhana, tidak menangkap pola. Keduanya buruk — tujuan kita ada di tengah: sweet spot.

Overfitting = model memorizes training data, can't generalize to new data. Train accuracy: 99%, Test accuracy: 70%. Underfitting = model too simple, doesn't capture patterns. Both are bad — our goal is the middle: the sweet spot.

Underfitting Just Right Overfitting Train: 60% Train: 95% Train: 99.9% Test: 58% Test: 93% Test: 72% ╭──╮ ╭─╮ ╭╮╭╮╭╮╭╮ ╱ ╲ ╱ ╲ ╱╰╯╰╯╰╯╰╲ ─╱──────╲─ ─╱──·──╲─ ─╱────────────╲─ Too simple Captures the Memorizes noise (high bias) true pattern (high variance) Fix: more layers ✅ Goal! Fix: regularization more neurons dropout, more data

💡 Analogi: Belajar untuk Ujian
Underfitting = tidak belajar sama sekali → nilai jelek.
Overfitting = hafal semua soal latihan persis, tapi gagal di soal baru karena tidak paham konsepnya.
Just Right = paham konsepnya → bisa jawab soal baru yang belum pernah dilihat.

💡 Analogy: Studying for an Exam
Underfitting = didn't study at all → bad grade.
Overfitting = memorized all practice problems exactly, but fails on new questions because you don't understand the concepts.
Just Right = understand the concepts → can answer new questions you've never seen.

⚖️

2. L2 Regularization (Weight Decay)

Tambahkan "hukuman" untuk weight yang terlalu besar

Add a "penalty" for weights that grow too large

Weight yang besar → model terlalu percaya diri → overfitting. L2 Regularization menambahkan penalti ke loss function: Loss = Loss_original + λ × Σ(w²). Ini memaksa weight tetap kecil dan model lebih "humble".

Large weights → model is overconfident → overfitting. L2 Regularization adds a penalty to the loss function: Loss = Loss_original + λ × Σ(w²). This forces weights to stay small and keeps the model more "humble".

17_l2_regularization.py python

import numpy as np

class RegularizedNetwork:
    """Network with L2 Regularization"""

    def __init__(self, sizes, lambd=0.01):
        self.L = len(sizes) - 1
        self.lambd = lambd  # regularization strength
        self.params = {}
        for l in range(1, self.L + 1):
            self.params[f'W{l}'] = np.random.randn(
                sizes[l-1], sizes[l]) * np.sqrt(2.0/sizes[l-1])
            self.params[f'b{l}'] = np.zeros((1, sizes[l]))

    def compute_loss(self, y_pred, y_true):
        """Cross-entropy + L2 penalty"""
        m = y_true.shape[0]
        ce_loss = -np.sum(y_true * np.log(np.clip(y_pred, 1e-12, 1))) / m

        # L2 penalty: sum of all weights squared
        l2_penalty = 0
        for l in range(1, self.L + 1):
            l2_penalty += np.sum(self.params[f'W{l}'] ** 2)
        l2_penalty *= self.lambd / (2 * m)

        return ce_loss + l2_penalty

    def backward_with_l2(self, dW, W, m):
        """Add L2 gradient to weight gradient"""
        # dW_regularized = dW + (λ/m) * W
        return dW + (self.lambd / m) * W

# Usage in training:
# dW2 = (1/m) * a1.T @ dz2
# dW2 = model.backward_with_l2(dW2, W2, m)  ← added penalty!
# W2 -= lr * dW2

print("λ=0.0  → no regularization (might overfit)")
print("λ=0.01 → mild regularization (good start)")
print("λ=0.1  → strong regularization (might underfit)")

🎓 λ (lambda) mengontrol seberapa kuat penalti. λ terlalu kecil → tidak efektif. λ terlalu besar → underfitting. Mulai dari 0.01 dan tuning dari sana.

🎓 λ (lambda) controls penalty strength. Too small → ineffective. Too large → underfitting. Start from 0.01 and tune from there.

🎲

3. Dropout — Matikan Neuron Secara Acak

3. Dropout — Randomly Kill Neurons

Teknik regularization paling populer di deep learning

The most popular regularization technique in deep learning

Dropout = selama training, matikan sebagian neuron secara acak (biasanya 20-50%). Ini memaksa network tidak bergantung pada neuron tertentu — setiap neuron harus berguna sendiri. Saat inference, semua neuron aktif (tapi output di-scale).

Dropout = during training, randomly deactivate some neurons (typically 20-50%). This forces the network not to rely on any single neuron — each neuron must be useful on its own. During inference, all neurons are active (but outputs are scaled).

Dropout: Training vs Inference Training (drop_rate = 0.3) Inference (no dropout) ○ ──── ● ──── ○ ○ ──── ○ ──── ○ ○ ──── ✕ ──── ○ ○ ──── ○ ──── ○ ○ ──── ● ──── ○ → ○ ──── ○ ──── ○ ○ ──── ✕ ──── ○ ○ ──── ○ ──── ○ ○ ──── ● ──── ○ ○ ──── ○ ──── ○ ✕ = neuron dimatikan All neurons active 30% random tiap batch Output × (1 - drop_rate) Setiap batch → mask berbeda! or use inverted dropout

18_dropout.py — Dropout from Scratch python

import numpy as np

class DropoutLayer:
    """Inverted Dropout — no scaling needed at inference"""

    def __init__(self, drop_rate=0.3):
        self.drop_rate = drop_rate
        self.training = True

    def forward(self, x):
        if not self.training:
            return x  # inference: no dropout!

        # Create random mask (same shape as x)
        self.mask = (np.random.rand(*x.shape) > self.drop_rate)
        # Inverted: scale up by 1/(1-p) during training
        # So we DON'T need to scale during inference
        return x * self.mask / (1 - self.drop_rate)

    def backward(self, d_out):
        return d_out * self.mask / (1 - self.drop_rate)

# ===========================
# Demo
# ===========================
dropout = DropoutLayer(drop_rate=0.3)

x = np.array([[1.0, 2.0, 3.0, 4.0, 5.0]])

# Training: some neurons killed
dropout.training = True
print("Training:", dropout.forward(x))
# e.g. [[ 1.43  0.    4.29  5.71  0.  ]]  ← zeros + scaled

# Inference: all neurons active, no scaling needed
dropout.training = False
print("Inference:", dropout.forward(x))
# [[ 1.  2.  3.  4.  5. ]]  ← untouched!

💡 Inverted Dropout: Kita scale output saat training (bagi dengan 1-p), bukan saat inference. Keuntungan: saat inference tidak perlu ubah apa-apa — lebih cepat dan simpel di production.

💡 Inverted Dropout: We scale outputs during training (divide by 1-p), not during inference. Benefit: at inference time, nothing needs to change — faster and simpler in production.

📏

4. Batch Normalization

Stabilkan distribusi input di setiap layer — training lebih cepat

Stabilize input distribution at every layer — faster training

Batch Normalization menormalkan output setiap layer sehingga mean ≈ 0 dan variance ≈ 1 (per mini-batch). Ini mengatasi internal covariate shift — distribusi input berubah setiap kali weight di-update. Hasilnya: training jauh lebih cepat, gradient lebih stabil, dan bisa pakai learning rate lebih besar.

Batch Normalization normalizes each layer's output so mean ≈ 0 and variance ≈ 1 (per mini-batch). This addresses internal covariate shift — input distribution changes every time weights are updated. Result: much faster training, more stable gradients, and you can use larger learning rates.

19_batchnorm.py — Batch Normalization python

import numpy as np

class BatchNormLayer:
    """Batch Normalization for fully connected layers"""

    def __init__(self, num_features):
        # Learnable parameters
        self.gamma = np.ones((1, num_features))    # scale
        self.beta = np.zeros((1, num_features))    # shift
        # Running stats for inference
        self.running_mean = np.zeros((1, num_features))
        self.running_var = np.ones((1, num_features))
        self.momentum = 0.9
        self.training = True
        self.eps = 1e-8

    def forward(self, x):
        if self.training:
            # Use batch statistics
            self.mean = x.mean(axis=0, keepdims=True)
            self.var = x.var(axis=0, keepdims=True)

            # Normalize
            self.x_norm = (x - self.mean) / np.sqrt(self.var + self.eps)

            # Update running stats
            self.running_mean = (self.momentum * self.running_mean
                                + (1 - self.momentum) * self.mean)
            self.running_var = (self.momentum * self.running_var
                               + (1 - self.momentum) * self.var)
        else:
            # Use running stats at inference
            self.x_norm = (x - self.running_mean) / np.sqrt(
                self.running_var + self.eps)

        # Scale & shift (learnable!)
        return self.gamma * self.x_norm + self.beta

    def backward(self, d_out, lr=0.01):
        m = d_out.shape[0]

        # Gradients for gamma and beta
        d_gamma = np.sum(d_out * self.x_norm, axis=0, keepdims=True)
        d_beta = np.sum(d_out, axis=0, keepdims=True)

        # Gradient through normalization
        dx_norm = d_out * self.gamma
        d_var = np.sum(dx_norm * (self.x_norm * -0.5
                  / (self.var + self.eps)), axis=0, keepdims=True)
        d_mean = np.sum(dx_norm * -1/np.sqrt(self.var + self.eps),
                  axis=0, keepdims=True)
        dx = (dx_norm / np.sqrt(self.var + self.eps)
              + d_var * 2 * self.x_norm / m + d_mean / m)

        # Update learnable params
        self.gamma -= lr * d_gamma
        self.beta -= lr * d_beta
        return dx

🎓 γ (gamma) dan β (beta) adalah parameter yang dipelajari oleh network. Setelah normalisasi, network bisa memilih skala dan shift optimal melalui backpropagation — jadi BatchNorm tidak membatasi representasi.

🎓 γ (gamma) and β (beta) are parameters learned by the network. After normalization, the network can choose the optimal scale and shift through backpropagation — so BatchNorm doesn't limit representational power.

🚀

5. Adam Optimizer — Raja Optimizer

5. Adam Optimizer — King of Optimizers

Momentum + Adaptive Learning Rate = konvergensi cepat dan stabil

Momentum + Adaptive Learning Rate = fast and stable convergence

Adam (Adaptive Moment Estimation) menggabungkan dua ide: Momentum (mengingat arah gradient sebelumnya) dan RMSprop (menyesuaikan LR per-parameter). Ini optimizer default yang hampir selalu bekerja dengan baik.

Adam (Adaptive Moment Estimation) combines two ideas: Momentum (remembering previous gradient direction) and RMSprop (adapting LR per-parameter). It's the default optimizer that almost always works well.

Optimizer Comparison SGD SGD + Momentum Adam ↓↗↓↗↓↗ ↓↘↓→↓→→ ↓→→→→→ ↓↗↓↗↓↗ →→→→→ →→→→ Zig-zag Smooth path Fastest & smoothest lr = 0.01 lr = 0.01 lr = 0.001 (default) (fixed) β = 0.9 β₁=0.9, β₂=0.999 Simple but slow Better trajectory Best for most cases

20_adam_optimizer.py — Adam from Scratch python

import numpy as np

class AdamOptimizer:
    """Adam: Adaptive Moment Estimation"""

    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1    # momentum decay
        self.beta2 = beta2    # RMSprop decay
        self.eps = eps
        self.m = {}           # 1st moment (mean of gradients)
        self.v = {}           # 2nd moment (mean of squared gradients)
        self.t = 0            # timestep

    def update(self, params, grads):
        """Update all parameters using Adam"""
        self.t += 1

        for key in params:
            if key not in self.m:
                self.m[key] = np.zeros_like(params[key])
                self.v[key] = np.zeros_like(params[key])

            g = grads[key]

            # Update biased moments
            self.m[key] = self.beta1 * self.m[key] + (1-self.beta1) * g
            self.v[key] = self.beta2 * self.v[key] + (1-self.beta2) * g**2

            # Bias correction (important early on!)
            m_hat = self.m[key] / (1 - self.beta1**self.t)
            v_hat = self.v[key] / (1 - self.beta2**self.t)

            # Update parameter
            params[key] -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

        return params

# Usage:
# adam = AdamOptimizer(lr=0.001)
# for epoch in range(100):
#     grads = compute_gradients(...)  # {W1: dW1, b1: db1, ...}
#     params = adam.update(params, grads)

🎓 Default Hyperparameters Adam:
lr=0.001, β₁=0.9, β₂=0.999, ε=1e-8
Ini sudah optimal untuk kebanyakan kasus. Biasanya yang perlu Anda tuning hanyalah learning rate. Mulai dari 0.001, jika loss tidak turun → naikkan jadi 0.01 atau turunkan jadi 0.0001.

🎓 Adam Default Hyperparameters:
lr=0.001, β₁=0.9, β₂=0.999, ε=1e-8
These are already optimal for most cases. Usually the only thing you need to tune is the learning rate. Start at 0.001, if loss doesn't decrease → try 0.01 or 0.0001.

📉

6. Learning Rate Scheduling

LR besar di awal (eksplorasi), kecil di akhir (fine-tune)

Large LR early (exploration), small LR later (fine-tuning)

Learning rate tetap bisa sub-optimal. Idealnya: LR besar di awal agar cepat mendekati minimum, lalu LR kecil di akhir agar konvergensi presisi. Ini disebut LR scheduling.

A fixed learning rate can be sub-optimal. Ideally: large LR early to quickly approach the minimum, then small LR later for precise convergence. This is called LR scheduling.

21_lr_scheduling.py python

import numpy as np

# ===========================
# 1. Step Decay — halve LR every N epochs
# ===========================
def step_decay(initial_lr, epoch, drop_every=10, drop_factor=0.5):
    return initial_lr * (drop_factor ** (epoch // drop_every))

# ===========================
# 2. Exponential Decay
# ===========================
def exp_decay(initial_lr, epoch, decay_rate=0.95):
    return initial_lr * (decay_rate ** epoch)

# ===========================
# 3. Cosine Annealing — smooth decrease
# ===========================
def cosine_annealing(initial_lr, epoch, total_epochs):
    return initial_lr * 0.5 * (1 + np.cos(np.pi * epoch / total_epochs))

# ===========================
# 4. Warmup + Decay (modern best practice)
# ===========================
def warmup_cosine(initial_lr, epoch, warmup=5, total=100):
    if epoch < warmup:
        return initial_lr * (epoch + 1) / warmup  # linear warmup
    return initial_lr * 0.5 * (1 + np.cos(
        np.pi * (epoch - warmup) / (total - warmup)))

# Print LR over epochs
print("Warmup + Cosine Schedule:")
for e in [0, 2, 5, 10, 50, 90, 99]:
    print(f"  Epoch {e:>2}: lr = {warmup_cosine(0.001, e, 5, 100):.6f}")
# Epoch  0: lr = 0.000200  ← warming up
# Epoch  5: lr = 0.001000  ← peak
# Epoch 50: lr = 0.000524  ← decaying
# Epoch 99: lr = 0.000005  ← near zero

🏆

7. Semua Digabung — Model Production-Ready

7. Putting It All Together — Production-Ready Model

L2 + Dropout + BatchNorm + Adam + LR Schedule = state-of-the-art

Sekarang kita gabungkan semua teknik dalam satu training pipeline. Ini adalah blueprint yang dipakai di hampir semua model deep learning modern.

Now let's combine all techniques in one training pipeline. This is the blueprint used in virtually all modern deep learning models.

22_production_model.py — Full Pipeline 🏆 python

import numpy as np

def train_production_model(X_train, y_train, X_test, y_test):
    """Complete training pipeline with all techniques"""

    # Architecture
    sizes = [784, 256, 128, 10]
    model = RegularizedNetwork(sizes, lambd=0.005)
    dropout1 = DropoutLayer(drop_rate=0.3)
    dropout2 = DropoutLayer(drop_rate=0.2)
    bn1 = BatchNormLayer(256)
    bn2 = BatchNormLayer(128)
    adam = AdamOptimizer(lr=0.001)

    epochs = 30
    batch_size = 64
    best_acc = 0

    print("🏆 Training with ALL techniques")
    print("   L2=0.005 | Dropout=0.3/0.2 | BatchNorm | Adam | Cosine LR")
    print("─" * 55)

    for epoch in range(epochs):
        # LR Schedule
        lr = warmup_cosine(0.001, epoch, warmup=3, total=epochs)

        # Set training mode
        dropout1.training = True
        dropout2.training = True
        bn1.training = True
        bn2.training = True

        # Mini-batch training
        idx = np.random.permutation(len(X_train))
        for i in range(0, len(X_train), batch_size):
            Xb = X_train[idx[i:i+batch_size]]
            yb = y_train_oh[idx[i:i+batch_size]]

            # Forward: Linear → BN → ReLU → Dropout (per layer)
            z1 = Xb @ model.params['W1'] + model.params['b1']
            z1 = bn1.forward(z1)
            a1 = np.maximum(0, z1)  # ReLU
            a1 = dropout1.forward(a1)

            z2 = a1 @ model.params['W2'] + model.params['b2']
            z2 = bn2.forward(z2)
            a2 = np.maximum(0, z2)
            a2 = dropout2.forward(a2)

            z3 = a2 @ model.params['W3'] + model.params['b3']
            a3 = softmax(z3)

            # Backward + Adam update (simplified)
            # ... backprop through each layer in reverse ...
            # adam.update(model.params, grads)

        # Evaluate (switch to inference mode)
        dropout1.training = False
        dropout2.training = False
        bn1.training = False
        bn2.training = False

        if (epoch+1) % 5 == 0:
            test_acc = evaluate(model, X_test, y_test)
            print(f"  Epoch {epoch+1:>2} │ lr={lr:.6f} │ Test: {test_acc:.1f}%")
            if test_acc > best_acc:
                best_acc = test_acc

    print(f"\n🎯 Best Test Accuracy: {best_acc:.1f}%")

🎉 Checklist Model Production-Ready:
✅ He Initialization — weight init yang tepat
✅ ReLU hidden layers + Softmax output
✅ Batch Normalization — setelah linear, sebelum activation
✅ Dropout — setelah activation (0.2-0.5)
✅ L2 Regularization — λ = 0.001-0.01
✅ Adam optimizer — lr = 0.001
✅ LR Schedule — warmup + cosine annealing
✅ Mini-batch — batch size 32-128
✅ Early stopping — stop ketika validation loss naik

🎉 Production-Ready Model Checklist:
✅ He Initialization — proper weight init
✅ ReLU hidden layers + Softmax output
✅ Batch Normalization — after linear, before activation
✅ Dropout — after activation (0.2-0.5)
✅ L2 Regularization — λ = 0.001-0.01
✅ Adam optimizer — lr = 0.001
✅ LR Schedule — warmup + cosine annealing
✅ Mini-batch — batch size 32-128
✅ Early stopping — stop when validation loss increases

📝

8. Ringkasan Page 4

8. Page 4 Summary

Apa yang sudah kita pelajari

What we've learned

Konsep	Apa Itu	Kode Kunci
Overfitting	Model terlalu hafal training data	`train_acc ≫ test_acc`
L2 Regularization	Penalti weight besar → model lebih simpel	`loss += λ*Σ(w²)`
Dropout	Matikan neuron random saat training	`x * mask / (1-p)`
Batch Norm	Normalkan output tiap layer	`γ*(x-μ)/σ + β`
Adam	Optimizer: momentum + adaptive LR	`AdamOptimizer(lr=.001)`
LR Schedule	LR berubah seiring epoch	`cosine_annealing()`
Warmup	LR naik pelan di awal training	`lr * epoch / warmup`
Early Stopping	Berhenti saat val loss naik	`if val_loss > best`

Concept	What It Is	Key Code
Overfitting	Model memorizes training data	`train_acc ≫ test_acc`
L2 Regularization	Penalty for large weights → simpler model	`loss += λ*Σ(w²)`
Dropout	Randomly kill neurons during training	`x * mask / (1-p)`
Batch Norm	Normalize each layer's output	`γ*(x-μ)/σ + β`
Adam	Optimizer: momentum + adaptive LR	`AdamOptimizer(lr=.001)`
LR Schedule	LR changes over epochs	`cosine_annealing()`
Warmup	LR ramps up slowly at start	`lr * epoch / warmup`
Early Stopping	Stop when val loss increases	`if val_loss > best`

← Page Sebelumnya← Previous Page

Regularization &
Optimization Lanjutan

Regularization &
Advanced Optimization

📑 Daftar Isi — Page 4

📑 Table of Contents — Page 4

1. Overfitting vs Underfitting

1. Overfitting vs Underfitting

2. L2 Regularization (Weight Decay)

2. L2 Regularization (Weight Decay)

3. Dropout — Matikan Neuron Secara Acak

3. Dropout — Randomly Kill Neurons

4. Batch Normalization

4. Batch Normalization

5. Adam Optimizer — Raja Optimizer

5. Adam Optimizer — King of Optimizers

6. Learning Rate Scheduling

6. Learning Rate Scheduling

7. Semua Digabung — Model Production-Ready

7. Putting It All Together — Production-Ready Model

8. Ringkasan Page 4

8. Page 4 Summary

Page 3 — Convolutional Neural Network (CNN)

Coming Next: Page 5 — Recurrent Neural Network (RNN) & Sequence Data

Coming Next: Page 5 — Recurrent Neural Network (RNN) & Sequence Data