๐Ÿ“ Artikel ini ditulis dalam Bahasa Indonesia & English
๐Ÿ“ This article is available in English & Bahasa Indonesia

โš™๏ธ Tutorial Neural Network โ€” Page 4Neural Network Tutorial โ€” Page 4

Regularization &
Optimization Lanjutan

Regularization &
Advanced Optimization

Model akurat saja tidak cukup โ€” harus robust dan generalize ke data baru. Page 4 membahas: overfitting vs underfitting, Dropout, L2 Regularization, Batch Normalization, plus optimizer canggih Adam dan learning rate scheduling. Membuat model production-ready.

An accurate model isn't enough โ€” it must be robust and generalize to new data. Page 4 covers: overfitting vs underfitting, Dropout, L2 Regularization, Batch Normalization, plus the powerful Adam optimizer and learning rate scheduling. Building production-ready models.

๐Ÿ“… MaretMarch 2026 โฑ 26 menit baca26 min read
๐Ÿท OverfittingDropoutL2BatchNormAdamLR Schedule
๐Ÿ“š Seri Tutorial Neural Network:Neural Network Tutorial Series:

๐Ÿ“‘ Daftar Isi โ€” Page 4

๐Ÿ“‘ Table of Contents โ€” Page 4

  1. Overfitting vs Underfitting โ€” Masalah utama machine learning
  2. L2 Regularization โ€” Hukum: jangan punya weight besar
  3. Dropout โ€” Matikan neuron random saat training
  4. Batch Normalization โ€” Stabilkan input setiap layer
  5. Adam Optimizer โ€” Raja optimizer: momentum + adaptive LR
  6. Learning Rate Scheduling โ€” LR yang berubah seiring waktu
  7. Semua Digabung โ€” Model production-ready
  8. Ringkasan & Preview Page 5
  1. Overfitting vs Underfitting โ€” The central ML problem
  2. L2 Regularization โ€” Penalty: don't have large weights
  3. Dropout โ€” Randomly kill neurons during training
  4. Batch Normalization โ€” Stabilize input at every layer
  5. Adam Optimizer โ€” King of optimizers: momentum + adaptive LR
  6. Learning Rate Scheduling โ€” LR that changes over time
  7. Putting It All Together โ€” A production-ready model
  8. Summary & Page 5 Preview
๐Ÿ“Š

1. Overfitting vs Underfitting

1. Overfitting vs Underfitting

Masalah fundamental: model terlalu "hafal" atau terlalu "bodoh"
The fundamental problem: model "memorizes" or is too "dumb"

Overfitting = model terlalu hafal data training, tidak bisa generalize ke data baru. Train accuracy: 99%, Test accuracy: 70%. Underfitting = model terlalu sederhana, tidak menangkap pola. Keduanya buruk โ€” tujuan kita ada di tengah: sweet spot.

Overfitting = model memorizes training data, can't generalize to new data. Train accuracy: 99%, Test accuracy: 70%. Underfitting = model too simple, doesn't capture patterns. Both are bad โ€” our goal is the middle: the sweet spot.

Underfitting Just Right Overfitting Train: 60% Train: 95% Train: 99.9% Test: 58% Test: 93% Test: 72% โ•ญโ”€โ”€โ•ฎ โ•ญโ”€โ•ฎ โ•ญโ•ฎโ•ญโ•ฎโ•ญโ•ฎโ•ญโ•ฎ โ•ฑ โ•ฒ โ•ฑ โ•ฒ โ•ฑโ•ฐโ•ฏโ•ฐโ•ฏโ•ฐโ•ฏโ•ฐโ•ฒ โ”€โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ•ฒโ”€ โ”€โ•ฑโ”€โ”€ยทโ”€โ”€โ•ฒโ”€ โ”€โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒโ”€ Too simple Captures the Memorizes noise (high bias) true pattern (high variance) Fix: more layers โœ… Goal! Fix: regularization more neurons dropout, more data

๐Ÿ’ก Analogi: Belajar untuk Ujian
Underfitting = tidak belajar sama sekali โ†’ nilai jelek.
Overfitting = hafal semua soal latihan persis, tapi gagal di soal baru karena tidak paham konsepnya.
Just Right = paham konsepnya โ†’ bisa jawab soal baru yang belum pernah dilihat.

๐Ÿ’ก Analogy: Studying for an Exam
Underfitting = didn't study at all โ†’ bad grade.
Overfitting = memorized all practice problems exactly, but fails on new questions because you don't understand the concepts.
Just Right = understand the concepts โ†’ can answer new questions you've never seen.

โš–๏ธ

2. L2 Regularization (Weight Decay)

2. L2 Regularization (Weight Decay)

Tambahkan "hukuman" untuk weight yang terlalu besar
Add a "penalty" for weights that grow too large

Weight yang besar โ†’ model terlalu percaya diri โ†’ overfitting. L2 Regularization menambahkan penalti ke loss function: Loss = Loss_original + ฮป ร— ฮฃ(wยฒ). Ini memaksa weight tetap kecil dan model lebih "humble".

Large weights โ†’ model is overconfident โ†’ overfitting. L2 Regularization adds a penalty to the loss function: Loss = Loss_original + ฮป ร— ฮฃ(wยฒ). This forces weights to stay small and keeps the model more "humble".

17_l2_regularization.py python
import numpy as np

class RegularizedNetwork:
    """Network with L2 Regularization"""

    def __init__(self, sizes, lambd=0.01):
        self.L = len(sizes) - 1
        self.lambd = lambd  # regularization strength
        self.params = {}
        for l in range(1, self.L + 1):
            self.params[f'W{l}'] = np.random.randn(
                sizes[l-1], sizes[l]) * np.sqrt(2.0/sizes[l-1])
            self.params[f'b{l}'] = np.zeros((1, sizes[l]))

    def compute_loss(self, y_pred, y_true):
        """Cross-entropy + L2 penalty"""
        m = y_true.shape[0]
        ce_loss = -np.sum(y_true * np.log(np.clip(y_pred, 1e-12, 1))) / m

        # L2 penalty: sum of all weights squared
        l2_penalty = 0
        for l in range(1, self.L + 1):
            l2_penalty += np.sum(self.params[f'W{l}'] ** 2)
        l2_penalty *= self.lambd / (2 * m)

        return ce_loss + l2_penalty

    def backward_with_l2(self, dW, W, m):
        """Add L2 gradient to weight gradient"""
        # dW_regularized = dW + (ฮป/m) * W
        return dW + (self.lambd / m) * W

# Usage in training:
# dW2 = (1/m) * a1.T @ dz2
# dW2 = model.backward_with_l2(dW2, W2, m)  โ† added penalty!
# W2 -= lr * dW2

print("ฮป=0.0  โ†’ no regularization (might overfit)")
print("ฮป=0.01 โ†’ mild regularization (good start)")
print("ฮป=0.1  โ†’ strong regularization (might underfit)")

๐ŸŽ“ ฮป (lambda) mengontrol seberapa kuat penalti. ฮป terlalu kecil โ†’ tidak efektif. ฮป terlalu besar โ†’ underfitting. Mulai dari 0.01 dan tuning dari sana.

๐ŸŽ“ ฮป (lambda) controls penalty strength. Too small โ†’ ineffective. Too large โ†’ underfitting. Start from 0.01 and tune from there.

๐ŸŽฒ

3. Dropout โ€” Matikan Neuron Secara Acak

3. Dropout โ€” Randomly Kill Neurons

Teknik regularization paling populer di deep learning
The most popular regularization technique in deep learning

Dropout = selama training, matikan sebagian neuron secara acak (biasanya 20-50%). Ini memaksa network tidak bergantung pada neuron tertentu โ€” setiap neuron harus berguna sendiri. Saat inference, semua neuron aktif (tapi output di-scale).

Dropout = during training, randomly deactivate some neurons (typically 20-50%). This forces the network not to rely on any single neuron โ€” each neuron must be useful on its own. During inference, all neurons are active (but outputs are scaled).

Dropout: Training vs Inference Training (drop_rate = 0.3) Inference (no dropout) โ—‹ โ”€โ”€โ”€โ”€ โ— โ”€โ”€โ”€โ”€ โ—‹ โ—‹ โ”€โ”€โ”€โ”€ โ—‹ โ”€โ”€โ”€โ”€ โ—‹ โ—‹ โ”€โ”€โ”€โ”€ โœ• โ”€โ”€โ”€โ”€ โ—‹ โ—‹ โ”€โ”€โ”€โ”€ โ—‹ โ”€โ”€โ”€โ”€ โ—‹ โ—‹ โ”€โ”€โ”€โ”€ โ— โ”€โ”€โ”€โ”€ โ—‹ โ†’ โ—‹ โ”€โ”€โ”€โ”€ โ—‹ โ”€โ”€โ”€โ”€ โ—‹ โ—‹ โ”€โ”€โ”€โ”€ โœ• โ”€โ”€โ”€โ”€ โ—‹ โ—‹ โ”€โ”€โ”€โ”€ โ—‹ โ”€โ”€โ”€โ”€ โ—‹ โ—‹ โ”€โ”€โ”€โ”€ โ— โ”€โ”€โ”€โ”€ โ—‹ โ—‹ โ”€โ”€โ”€โ”€ โ—‹ โ”€โ”€โ”€โ”€ โ—‹ โœ• = neuron dimatikan All neurons active 30% random tiap batch Output ร— (1 - drop_rate) Setiap batch โ†’ mask berbeda! or use inverted dropout
18_dropout.py โ€” Dropout from Scratch python
import numpy as np

class DropoutLayer:
    """Inverted Dropout โ€” no scaling needed at inference"""

    def __init__(self, drop_rate=0.3):
        self.drop_rate = drop_rate
        self.training = True

    def forward(self, x):
        if not self.training:
            return x  # inference: no dropout!

        # Create random mask (same shape as x)
        self.mask = (np.random.rand(*x.shape) > self.drop_rate)
        # Inverted: scale up by 1/(1-p) during training
        # So we DON'T need to scale during inference
        return x * self.mask / (1 - self.drop_rate)

    def backward(self, d_out):
        return d_out * self.mask / (1 - self.drop_rate)

# ===========================
# Demo
# ===========================
dropout = DropoutLayer(drop_rate=0.3)

x = np.array([[1.0, 2.0, 3.0, 4.0, 5.0]])

# Training: some neurons killed
dropout.training = True
print("Training:", dropout.forward(x))
# e.g. [[ 1.43  0.    4.29  5.71  0.  ]]  โ† zeros + scaled

# Inference: all neurons active, no scaling needed
dropout.training = False
print("Inference:", dropout.forward(x))
# [[ 1.  2.  3.  4.  5. ]]  โ† untouched!

๐Ÿ’ก Inverted Dropout: Kita scale output saat training (bagi dengan 1-p), bukan saat inference. Keuntungan: saat inference tidak perlu ubah apa-apa โ€” lebih cepat dan simpel di production.

๐Ÿ’ก Inverted Dropout: We scale outputs during training (divide by 1-p), not during inference. Benefit: at inference time, nothing needs to change โ€” faster and simpler in production.

๐Ÿ“

4. Batch Normalization

4. Batch Normalization

Stabilkan distribusi input di setiap layer โ€” training lebih cepat
Stabilize input distribution at every layer โ€” faster training

Batch Normalization menormalkan output setiap layer sehingga mean โ‰ˆ 0 dan variance โ‰ˆ 1 (per mini-batch). Ini mengatasi internal covariate shift โ€” distribusi input berubah setiap kali weight di-update. Hasilnya: training jauh lebih cepat, gradient lebih stabil, dan bisa pakai learning rate lebih besar.

Batch Normalization normalizes each layer's output so mean โ‰ˆ 0 and variance โ‰ˆ 1 (per mini-batch). This addresses internal covariate shift โ€” input distribution changes every time weights are updated. Result: much faster training, more stable gradients, and you can use larger learning rates.

19_batchnorm.py โ€” Batch Normalization python
import numpy as np

class BatchNormLayer:
    """Batch Normalization for fully connected layers"""

    def __init__(self, num_features):
        # Learnable parameters
        self.gamma = np.ones((1, num_features))    # scale
        self.beta = np.zeros((1, num_features))    # shift
        # Running stats for inference
        self.running_mean = np.zeros((1, num_features))
        self.running_var = np.ones((1, num_features))
        self.momentum = 0.9
        self.training = True
        self.eps = 1e-8

    def forward(self, x):
        if self.training:
            # Use batch statistics
            self.mean = x.mean(axis=0, keepdims=True)
            self.var = x.var(axis=0, keepdims=True)

            # Normalize
            self.x_norm = (x - self.mean) / np.sqrt(self.var + self.eps)

            # Update running stats
            self.running_mean = (self.momentum * self.running_mean
                                + (1 - self.momentum) * self.mean)
            self.running_var = (self.momentum * self.running_var
                               + (1 - self.momentum) * self.var)
        else:
            # Use running stats at inference
            self.x_norm = (x - self.running_mean) / np.sqrt(
                self.running_var + self.eps)

        # Scale & shift (learnable!)
        return self.gamma * self.x_norm + self.beta

    def backward(self, d_out, lr=0.01):
        m = d_out.shape[0]

        # Gradients for gamma and beta
        d_gamma = np.sum(d_out * self.x_norm, axis=0, keepdims=True)
        d_beta = np.sum(d_out, axis=0, keepdims=True)

        # Gradient through normalization
        dx_norm = d_out * self.gamma
        d_var = np.sum(dx_norm * (self.x_norm * -0.5
                  / (self.var + self.eps)), axis=0, keepdims=True)
        d_mean = np.sum(dx_norm * -1/np.sqrt(self.var + self.eps),
                  axis=0, keepdims=True)
        dx = (dx_norm / np.sqrt(self.var + self.eps)
              + d_var * 2 * self.x_norm / m + d_mean / m)

        # Update learnable params
        self.gamma -= lr * d_gamma
        self.beta -= lr * d_beta
        return dx

๐ŸŽ“ ฮณ (gamma) dan ฮฒ (beta) adalah parameter yang dipelajari oleh network. Setelah normalisasi, network bisa memilih skala dan shift optimal melalui backpropagation โ€” jadi BatchNorm tidak membatasi representasi.

๐ŸŽ“ ฮณ (gamma) and ฮฒ (beta) are parameters learned by the network. After normalization, the network can choose the optimal scale and shift through backpropagation โ€” so BatchNorm doesn't limit representational power.

๐Ÿš€

5. Adam Optimizer โ€” Raja Optimizer

5. Adam Optimizer โ€” King of Optimizers

Momentum + Adaptive Learning Rate = konvergensi cepat dan stabil
Momentum + Adaptive Learning Rate = fast and stable convergence

Adam (Adaptive Moment Estimation) menggabungkan dua ide: Momentum (mengingat arah gradient sebelumnya) dan RMSprop (menyesuaikan LR per-parameter). Ini optimizer default yang hampir selalu bekerja dengan baik.

Adam (Adaptive Moment Estimation) combines two ideas: Momentum (remembering previous gradient direction) and RMSprop (adapting LR per-parameter). It's the default optimizer that almost always works well.

Optimizer Comparison SGD SGD + Momentum Adam โ†“โ†—โ†“โ†—โ†“โ†— โ†“โ†˜โ†“โ†’โ†“โ†’โ†’ โ†“โ†’โ†’โ†’โ†’โ†’ โ†“โ†—โ†“โ†—โ†“โ†— โ†’โ†’โ†’โ†’โ†’ โ†’โ†’โ†’โ†’ Zig-zag Smooth path Fastest & smoothest lr = 0.01 lr = 0.01 lr = 0.001 (default) (fixed) ฮฒ = 0.9 ฮฒโ‚=0.9, ฮฒโ‚‚=0.999 Simple but slow Better trajectory Best for most cases
20_adam_optimizer.py โ€” Adam from Scratch python
import numpy as np

class AdamOptimizer:
    """Adam: Adaptive Moment Estimation"""

    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1    # momentum decay
        self.beta2 = beta2    # RMSprop decay
        self.eps = eps
        self.m = {}           # 1st moment (mean of gradients)
        self.v = {}           # 2nd moment (mean of squared gradients)
        self.t = 0            # timestep

    def update(self, params, grads):
        """Update all parameters using Adam"""
        self.t += 1

        for key in params:
            if key not in self.m:
                self.m[key] = np.zeros_like(params[key])
                self.v[key] = np.zeros_like(params[key])

            g = grads[key]

            # Update biased moments
            self.m[key] = self.beta1 * self.m[key] + (1-self.beta1) * g
            self.v[key] = self.beta2 * self.v[key] + (1-self.beta2) * g**2

            # Bias correction (important early on!)
            m_hat = self.m[key] / (1 - self.beta1**self.t)
            v_hat = self.v[key] / (1 - self.beta2**self.t)

            # Update parameter
            params[key] -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

        return params

# Usage:
# adam = AdamOptimizer(lr=0.001)
# for epoch in range(100):
#     grads = compute_gradients(...)  # {W1: dW1, b1: db1, ...}
#     params = adam.update(params, grads)

๐ŸŽ“ Default Hyperparameters Adam:
lr=0.001, ฮฒโ‚=0.9, ฮฒโ‚‚=0.999, ฮต=1e-8
Ini sudah optimal untuk kebanyakan kasus. Biasanya yang perlu Anda tuning hanyalah learning rate. Mulai dari 0.001, jika loss tidak turun โ†’ naikkan jadi 0.01 atau turunkan jadi 0.0001.

๐ŸŽ“ Adam Default Hyperparameters:
lr=0.001, ฮฒโ‚=0.9, ฮฒโ‚‚=0.999, ฮต=1e-8
These are already optimal for most cases. Usually the only thing you need to tune is the learning rate. Start at 0.001, if loss doesn't decrease โ†’ try 0.01 or 0.0001.

๐Ÿ“‰

6. Learning Rate Scheduling

6. Learning Rate Scheduling

LR besar di awal (eksplorasi), kecil di akhir (fine-tune)
Large LR early (exploration), small LR later (fine-tuning)

Learning rate tetap bisa sub-optimal. Idealnya: LR besar di awal agar cepat mendekati minimum, lalu LR kecil di akhir agar konvergensi presisi. Ini disebut LR scheduling.

A fixed learning rate can be sub-optimal. Ideally: large LR early to quickly approach the minimum, then small LR later for precise convergence. This is called LR scheduling.

21_lr_scheduling.py python
import numpy as np

# ===========================
# 1. Step Decay โ€” halve LR every N epochs
# ===========================
def step_decay(initial_lr, epoch, drop_every=10, drop_factor=0.5):
    return initial_lr * (drop_factor ** (epoch // drop_every))

# ===========================
# 2. Exponential Decay
# ===========================
def exp_decay(initial_lr, epoch, decay_rate=0.95):
    return initial_lr * (decay_rate ** epoch)

# ===========================
# 3. Cosine Annealing โ€” smooth decrease
# ===========================
def cosine_annealing(initial_lr, epoch, total_epochs):
    return initial_lr * 0.5 * (1 + np.cos(np.pi * epoch / total_epochs))

# ===========================
# 4. Warmup + Decay (modern best practice)
# ===========================
def warmup_cosine(initial_lr, epoch, warmup=5, total=100):
    if epoch < warmup:
        return initial_lr * (epoch + 1) / warmup  # linear warmup
    return initial_lr * 0.5 * (1 + np.cos(
        np.pi * (epoch - warmup) / (total - warmup)))

# Print LR over epochs
print("Warmup + Cosine Schedule:")
for e in [0, 2, 5, 10, 50, 90, 99]:
    print(f"  Epoch {e:>2}: lr = {warmup_cosine(0.001, e, 5, 100):.6f}")
# Epoch  0: lr = 0.000200  โ† warming up
# Epoch  5: lr = 0.001000  โ† peak
# Epoch 50: lr = 0.000524  โ† decaying
# Epoch 99: lr = 0.000005  โ† near zero
๐Ÿ†

7. Semua Digabung โ€” Model Production-Ready

7. Putting It All Together โ€” Production-Ready Model

L2 + Dropout + BatchNorm + Adam + LR Schedule = state-of-the-art
L2 + Dropout + BatchNorm + Adam + LR Schedule = state-of-the-art

Sekarang kita gabungkan semua teknik dalam satu training pipeline. Ini adalah blueprint yang dipakai di hampir semua model deep learning modern.

Now let's combine all techniques in one training pipeline. This is the blueprint used in virtually all modern deep learning models.

22_production_model.py โ€” Full Pipeline ๐Ÿ† python
import numpy as np

def train_production_model(X_train, y_train, X_test, y_test):
    """Complete training pipeline with all techniques"""

    # Architecture
    sizes = [784, 256, 128, 10]
    model = RegularizedNetwork(sizes, lambd=0.005)
    dropout1 = DropoutLayer(drop_rate=0.3)
    dropout2 = DropoutLayer(drop_rate=0.2)
    bn1 = BatchNormLayer(256)
    bn2 = BatchNormLayer(128)
    adam = AdamOptimizer(lr=0.001)

    epochs = 30
    batch_size = 64
    best_acc = 0

    print("๐Ÿ† Training with ALL techniques")
    print("   L2=0.005 | Dropout=0.3/0.2 | BatchNorm | Adam | Cosine LR")
    print("โ”€" * 55)

    for epoch in range(epochs):
        # LR Schedule
        lr = warmup_cosine(0.001, epoch, warmup=3, total=epochs)

        # Set training mode
        dropout1.training = True
        dropout2.training = True
        bn1.training = True
        bn2.training = True

        # Mini-batch training
        idx = np.random.permutation(len(X_train))
        for i in range(0, len(X_train), batch_size):
            Xb = X_train[idx[i:i+batch_size]]
            yb = y_train_oh[idx[i:i+batch_size]]

            # Forward: Linear โ†’ BN โ†’ ReLU โ†’ Dropout (per layer)
            z1 = Xb @ model.params['W1'] + model.params['b1']
            z1 = bn1.forward(z1)
            a1 = np.maximum(0, z1)  # ReLU
            a1 = dropout1.forward(a1)

            z2 = a1 @ model.params['W2'] + model.params['b2']
            z2 = bn2.forward(z2)
            a2 = np.maximum(0, z2)
            a2 = dropout2.forward(a2)

            z3 = a2 @ model.params['W3'] + model.params['b3']
            a3 = softmax(z3)

            # Backward + Adam update (simplified)
            # ... backprop through each layer in reverse ...
            # adam.update(model.params, grads)

        # Evaluate (switch to inference mode)
        dropout1.training = False
        dropout2.training = False
        bn1.training = False
        bn2.training = False

        if (epoch+1) % 5 == 0:
            test_acc = evaluate(model, X_test, y_test)
            print(f"  Epoch {epoch+1:>2} โ”‚ lr={lr:.6f} โ”‚ Test: {test_acc:.1f}%")
            if test_acc > best_acc:
                best_acc = test_acc

    print(f"\n๐ŸŽฏ Best Test Accuracy: {best_acc:.1f}%")

๐ŸŽ‰ Checklist Model Production-Ready:
โœ… He Initialization โ€” weight init yang tepat
โœ… ReLU hidden layers + Softmax output
โœ… Batch Normalization โ€” setelah linear, sebelum activation
โœ… Dropout โ€” setelah activation (0.2-0.5)
โœ… L2 Regularization โ€” ฮป = 0.001-0.01
โœ… Adam optimizer โ€” lr = 0.001
โœ… LR Schedule โ€” warmup + cosine annealing
โœ… Mini-batch โ€” batch size 32-128
โœ… Early stopping โ€” stop ketika validation loss naik

๐ŸŽ‰ Production-Ready Model Checklist:
โœ… He Initialization โ€” proper weight init
โœ… ReLU hidden layers + Softmax output
โœ… Batch Normalization โ€” after linear, before activation
โœ… Dropout โ€” after activation (0.2-0.5)
โœ… L2 Regularization โ€” ฮป = 0.001-0.01
โœ… Adam optimizer โ€” lr = 0.001
โœ… LR Schedule โ€” warmup + cosine annealing
โœ… Mini-batch โ€” batch size 32-128
โœ… Early stopping โ€” stop when validation loss increases

๐Ÿ“

8. Ringkasan Page 4

8. Page 4 Summary

Apa yang sudah kita pelajari
What we've learned
KonsepApa ItuKode Kunci
OverfittingModel terlalu hafal training datatrain_acc โ‰ซ test_acc
L2 RegularizationPenalti weight besar โ†’ model lebih simpelloss += ฮป*ฮฃ(wยฒ)
DropoutMatikan neuron random saat trainingx * mask / (1-p)
Batch NormNormalkan output tiap layerฮณ*(x-ฮผ)/ฯƒ + ฮฒ
AdamOptimizer: momentum + adaptive LRAdamOptimizer(lr=.001)
LR ScheduleLR berubah seiring epochcosine_annealing()
WarmupLR naik pelan di awal traininglr * epoch / warmup
Early StoppingBerhenti saat val loss naikif val_loss > best
ConceptWhat It IsKey Code
OverfittingModel memorizes training datatrain_acc โ‰ซ test_acc
L2 RegularizationPenalty for large weights โ†’ simpler modelloss += ฮป*ฮฃ(wยฒ)
DropoutRandomly kill neurons during trainingx * mask / (1-p)
Batch NormNormalize each layer's outputฮณ*(x-ฮผ)/ฯƒ + ฮฒ
AdamOptimizer: momentum + adaptive LRAdamOptimizer(lr=.001)
LR ScheduleLR changes over epochscosine_annealing()
WarmupLR ramps up slowly at startlr * epoch / warmup
Early StoppingStop when val loss increasesif val_loss > best
โ† Page Sebelumnyaโ† Previous Page

Page 3 โ€” Convolutional Neural Network (CNN)

๐Ÿ“˜

Coming Next: Page 5 โ€” Recurrent Neural Network (RNN) & Sequence Data

Memproses data berurutan: teks, time series, musik. Memahami RNN, vanishing gradient problem, dan membangun LSTM/GRU dari nol. Membuat text generator dan sentiment analyzer. Stay tuned!

๐Ÿ“˜

Coming Next: Page 5 โ€” Recurrent Neural Network (RNN) & Sequence Data

Processing sequential data: text, time series, music. Understanding RNNs, the vanishing gradient problem, and building LSTM/GRU from scratch. Creating a text generator and sentiment analyzer. Stay tuned!