๐ Daftar Isi โ Page 4
๐ Table of Contents โ Page 4
- Overfitting vs Underfitting โ Masalah utama machine learning
- L2 Regularization โ Hukum: jangan punya weight besar
- Dropout โ Matikan neuron random saat training
- Batch Normalization โ Stabilkan input setiap layer
- Adam Optimizer โ Raja optimizer: momentum + adaptive LR
- Learning Rate Scheduling โ LR yang berubah seiring waktu
- Semua Digabung โ Model production-ready
- Ringkasan & Preview Page 5
- Overfitting vs Underfitting โ The central ML problem
- L2 Regularization โ Penalty: don't have large weights
- Dropout โ Randomly kill neurons during training
- Batch Normalization โ Stabilize input at every layer
- Adam Optimizer โ King of optimizers: momentum + adaptive LR
- Learning Rate Scheduling โ LR that changes over time
- Putting It All Together โ A production-ready model
- Summary & Page 5 Preview
1. Overfitting vs Underfitting
1. Overfitting vs Underfitting
Overfitting = model terlalu hafal data training, tidak bisa generalize ke data baru. Train accuracy: 99%, Test accuracy: 70%. Underfitting = model terlalu sederhana, tidak menangkap pola. Keduanya buruk โ tujuan kita ada di tengah: sweet spot.
Overfitting = model memorizes training data, can't generalize to new data. Train accuracy: 99%, Test accuracy: 70%. Underfitting = model too simple, doesn't capture patterns. Both are bad โ our goal is the middle: the sweet spot.
๐ก Analogi: Belajar untuk Ujian
Underfitting = tidak belajar sama sekali โ nilai jelek.
Overfitting = hafal semua soal latihan persis, tapi gagal di soal baru karena tidak paham konsepnya.
Just Right = paham konsepnya โ bisa jawab soal baru yang belum pernah dilihat.
๐ก Analogy: Studying for an Exam
Underfitting = didn't study at all โ bad grade.
Overfitting = memorized all practice problems exactly, but fails on new questions because you don't understand the concepts.
Just Right = understand the concepts โ can answer new questions you've never seen.
2. L2 Regularization (Weight Decay)
2. L2 Regularization (Weight Decay)
Weight yang besar โ model terlalu percaya diri โ overfitting. L2 Regularization menambahkan penalti ke loss function: Loss = Loss_original + ฮป ร ฮฃ(wยฒ). Ini memaksa weight tetap kecil dan model lebih "humble".
Large weights โ model is overconfident โ overfitting. L2 Regularization adds a penalty to the loss function: Loss = Loss_original + ฮป ร ฮฃ(wยฒ). This forces weights to stay small and keeps the model more "humble".
import numpy as np class RegularizedNetwork: """Network with L2 Regularization""" def __init__(self, sizes, lambd=0.01): self.L = len(sizes) - 1 self.lambd = lambd # regularization strength self.params = {} for l in range(1, self.L + 1): self.params[f'W{l}'] = np.random.randn( sizes[l-1], sizes[l]) * np.sqrt(2.0/sizes[l-1]) self.params[f'b{l}'] = np.zeros((1, sizes[l])) def compute_loss(self, y_pred, y_true): """Cross-entropy + L2 penalty""" m = y_true.shape[0] ce_loss = -np.sum(y_true * np.log(np.clip(y_pred, 1e-12, 1))) / m # L2 penalty: sum of all weights squared l2_penalty = 0 for l in range(1, self.L + 1): l2_penalty += np.sum(self.params[f'W{l}'] ** 2) l2_penalty *= self.lambd / (2 * m) return ce_loss + l2_penalty def backward_with_l2(self, dW, W, m): """Add L2 gradient to weight gradient""" # dW_regularized = dW + (ฮป/m) * W return dW + (self.lambd / m) * W # Usage in training: # dW2 = (1/m) * a1.T @ dz2 # dW2 = model.backward_with_l2(dW2, W2, m) โ added penalty! # W2 -= lr * dW2 print("ฮป=0.0 โ no regularization (might overfit)") print("ฮป=0.01 โ mild regularization (good start)") print("ฮป=0.1 โ strong regularization (might underfit)")
๐ ฮป (lambda) mengontrol seberapa kuat penalti. ฮป terlalu kecil โ tidak efektif. ฮป terlalu besar โ underfitting. Mulai dari 0.01 dan tuning dari sana.
๐ ฮป (lambda) controls penalty strength. Too small โ ineffective. Too large โ underfitting. Start from 0.01 and tune from there.
3. Dropout โ Matikan Neuron Secara Acak
3. Dropout โ Randomly Kill Neurons
Dropout = selama training, matikan sebagian neuron secara acak (biasanya 20-50%). Ini memaksa network tidak bergantung pada neuron tertentu โ setiap neuron harus berguna sendiri. Saat inference, semua neuron aktif (tapi output di-scale).
Dropout = during training, randomly deactivate some neurons (typically 20-50%). This forces the network not to rely on any single neuron โ each neuron must be useful on its own. During inference, all neurons are active (but outputs are scaled).
import numpy as np class DropoutLayer: """Inverted Dropout โ no scaling needed at inference""" def __init__(self, drop_rate=0.3): self.drop_rate = drop_rate self.training = True def forward(self, x): if not self.training: return x # inference: no dropout! # Create random mask (same shape as x) self.mask = (np.random.rand(*x.shape) > self.drop_rate) # Inverted: scale up by 1/(1-p) during training # So we DON'T need to scale during inference return x * self.mask / (1 - self.drop_rate) def backward(self, d_out): return d_out * self.mask / (1 - self.drop_rate) # =========================== # Demo # =========================== dropout = DropoutLayer(drop_rate=0.3) x = np.array([[1.0, 2.0, 3.0, 4.0, 5.0]]) # Training: some neurons killed dropout.training = True print("Training:", dropout.forward(x)) # e.g. [[ 1.43 0. 4.29 5.71 0. ]] โ zeros + scaled # Inference: all neurons active, no scaling needed dropout.training = False print("Inference:", dropout.forward(x)) # [[ 1. 2. 3. 4. 5. ]] โ untouched!
๐ก Inverted Dropout: Kita scale output saat training (bagi dengan 1-p), bukan saat inference. Keuntungan: saat inference tidak perlu ubah apa-apa โ lebih cepat dan simpel di production.
๐ก Inverted Dropout: We scale outputs during training (divide by 1-p), not during inference. Benefit: at inference time, nothing needs to change โ faster and simpler in production.
4. Batch Normalization
4. Batch Normalization
Batch Normalization menormalkan output setiap layer sehingga mean โ 0 dan variance โ 1 (per mini-batch). Ini mengatasi internal covariate shift โ distribusi input berubah setiap kali weight di-update. Hasilnya: training jauh lebih cepat, gradient lebih stabil, dan bisa pakai learning rate lebih besar.
Batch Normalization normalizes each layer's output so mean โ 0 and variance โ 1 (per mini-batch). This addresses internal covariate shift โ input distribution changes every time weights are updated. Result: much faster training, more stable gradients, and you can use larger learning rates.
import numpy as np class BatchNormLayer: """Batch Normalization for fully connected layers""" def __init__(self, num_features): # Learnable parameters self.gamma = np.ones((1, num_features)) # scale self.beta = np.zeros((1, num_features)) # shift # Running stats for inference self.running_mean = np.zeros((1, num_features)) self.running_var = np.ones((1, num_features)) self.momentum = 0.9 self.training = True self.eps = 1e-8 def forward(self, x): if self.training: # Use batch statistics self.mean = x.mean(axis=0, keepdims=True) self.var = x.var(axis=0, keepdims=True) # Normalize self.x_norm = (x - self.mean) / np.sqrt(self.var + self.eps) # Update running stats self.running_mean = (self.momentum * self.running_mean + (1 - self.momentum) * self.mean) self.running_var = (self.momentum * self.running_var + (1 - self.momentum) * self.var) else: # Use running stats at inference self.x_norm = (x - self.running_mean) / np.sqrt( self.running_var + self.eps) # Scale & shift (learnable!) return self.gamma * self.x_norm + self.beta def backward(self, d_out, lr=0.01): m = d_out.shape[0] # Gradients for gamma and beta d_gamma = np.sum(d_out * self.x_norm, axis=0, keepdims=True) d_beta = np.sum(d_out, axis=0, keepdims=True) # Gradient through normalization dx_norm = d_out * self.gamma d_var = np.sum(dx_norm * (self.x_norm * -0.5 / (self.var + self.eps)), axis=0, keepdims=True) d_mean = np.sum(dx_norm * -1/np.sqrt(self.var + self.eps), axis=0, keepdims=True) dx = (dx_norm / np.sqrt(self.var + self.eps) + d_var * 2 * self.x_norm / m + d_mean / m) # Update learnable params self.gamma -= lr * d_gamma self.beta -= lr * d_beta return dx
๐ ฮณ (gamma) dan ฮฒ (beta) adalah parameter yang dipelajari oleh network. Setelah normalisasi, network bisa memilih skala dan shift optimal melalui backpropagation โ jadi BatchNorm tidak membatasi representasi.
๐ ฮณ (gamma) and ฮฒ (beta) are parameters learned by the network. After normalization, the network can choose the optimal scale and shift through backpropagation โ so BatchNorm doesn't limit representational power.
5. Adam Optimizer โ Raja Optimizer
5. Adam Optimizer โ King of Optimizers
Adam (Adaptive Moment Estimation) menggabungkan dua ide: Momentum (mengingat arah gradient sebelumnya) dan RMSprop (menyesuaikan LR per-parameter). Ini optimizer default yang hampir selalu bekerja dengan baik.
Adam (Adaptive Moment Estimation) combines two ideas: Momentum (remembering previous gradient direction) and RMSprop (adapting LR per-parameter). It's the default optimizer that almost always works well.
import numpy as np class AdamOptimizer: """Adam: Adaptive Moment Estimation""" def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8): self.lr = lr self.beta1 = beta1 # momentum decay self.beta2 = beta2 # RMSprop decay self.eps = eps self.m = {} # 1st moment (mean of gradients) self.v = {} # 2nd moment (mean of squared gradients) self.t = 0 # timestep def update(self, params, grads): """Update all parameters using Adam""" self.t += 1 for key in params: if key not in self.m: self.m[key] = np.zeros_like(params[key]) self.v[key] = np.zeros_like(params[key]) g = grads[key] # Update biased moments self.m[key] = self.beta1 * self.m[key] + (1-self.beta1) * g self.v[key] = self.beta2 * self.v[key] + (1-self.beta2) * g**2 # Bias correction (important early on!) m_hat = self.m[key] / (1 - self.beta1**self.t) v_hat = self.v[key] / (1 - self.beta2**self.t) # Update parameter params[key] -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps) return params # Usage: # adam = AdamOptimizer(lr=0.001) # for epoch in range(100): # grads = compute_gradients(...) # {W1: dW1, b1: db1, ...} # params = adam.update(params, grads)
๐ Default Hyperparameters Adam:lr=0.001, ฮฒโ=0.9, ฮฒโ=0.999, ฮต=1e-8
Ini sudah optimal untuk kebanyakan kasus. Biasanya yang perlu Anda tuning hanyalah learning rate. Mulai dari 0.001, jika loss tidak turun โ naikkan jadi 0.01 atau turunkan jadi 0.0001.
๐ Adam Default Hyperparameters:lr=0.001, ฮฒโ=0.9, ฮฒโ=0.999, ฮต=1e-8
These are already optimal for most cases. Usually the only thing you need to tune is the learning rate. Start at 0.001, if loss doesn't decrease โ try 0.01 or 0.0001.
6. Learning Rate Scheduling
6. Learning Rate Scheduling
Learning rate tetap bisa sub-optimal. Idealnya: LR besar di awal agar cepat mendekati minimum, lalu LR kecil di akhir agar konvergensi presisi. Ini disebut LR scheduling.
A fixed learning rate can be sub-optimal. Ideally: large LR early to quickly approach the minimum, then small LR later for precise convergence. This is called LR scheduling.
import numpy as np # =========================== # 1. Step Decay โ halve LR every N epochs # =========================== def step_decay(initial_lr, epoch, drop_every=10, drop_factor=0.5): return initial_lr * (drop_factor ** (epoch // drop_every)) # =========================== # 2. Exponential Decay # =========================== def exp_decay(initial_lr, epoch, decay_rate=0.95): return initial_lr * (decay_rate ** epoch) # =========================== # 3. Cosine Annealing โ smooth decrease # =========================== def cosine_annealing(initial_lr, epoch, total_epochs): return initial_lr * 0.5 * (1 + np.cos(np.pi * epoch / total_epochs)) # =========================== # 4. Warmup + Decay (modern best practice) # =========================== def warmup_cosine(initial_lr, epoch, warmup=5, total=100): if epoch < warmup: return initial_lr * (epoch + 1) / warmup # linear warmup return initial_lr * 0.5 * (1 + np.cos( np.pi * (epoch - warmup) / (total - warmup))) # Print LR over epochs print("Warmup + Cosine Schedule:") for e in [0, 2, 5, 10, 50, 90, 99]: print(f" Epoch {e:>2}: lr = {warmup_cosine(0.001, e, 5, 100):.6f}") # Epoch 0: lr = 0.000200 โ warming up # Epoch 5: lr = 0.001000 โ peak # Epoch 50: lr = 0.000524 โ decaying # Epoch 99: lr = 0.000005 โ near zero
7. Semua Digabung โ Model Production-Ready
7. Putting It All Together โ Production-Ready Model
Sekarang kita gabungkan semua teknik dalam satu training pipeline. Ini adalah blueprint yang dipakai di hampir semua model deep learning modern.
Now let's combine all techniques in one training pipeline. This is the blueprint used in virtually all modern deep learning models.
import numpy as np def train_production_model(X_train, y_train, X_test, y_test): """Complete training pipeline with all techniques""" # Architecture sizes = [784, 256, 128, 10] model = RegularizedNetwork(sizes, lambd=0.005) dropout1 = DropoutLayer(drop_rate=0.3) dropout2 = DropoutLayer(drop_rate=0.2) bn1 = BatchNormLayer(256) bn2 = BatchNormLayer(128) adam = AdamOptimizer(lr=0.001) epochs = 30 batch_size = 64 best_acc = 0 print("๐ Training with ALL techniques") print(" L2=0.005 | Dropout=0.3/0.2 | BatchNorm | Adam | Cosine LR") print("โ" * 55) for epoch in range(epochs): # LR Schedule lr = warmup_cosine(0.001, epoch, warmup=3, total=epochs) # Set training mode dropout1.training = True dropout2.training = True bn1.training = True bn2.training = True # Mini-batch training idx = np.random.permutation(len(X_train)) for i in range(0, len(X_train), batch_size): Xb = X_train[idx[i:i+batch_size]] yb = y_train_oh[idx[i:i+batch_size]] # Forward: Linear โ BN โ ReLU โ Dropout (per layer) z1 = Xb @ model.params['W1'] + model.params['b1'] z1 = bn1.forward(z1) a1 = np.maximum(0, z1) # ReLU a1 = dropout1.forward(a1) z2 = a1 @ model.params['W2'] + model.params['b2'] z2 = bn2.forward(z2) a2 = np.maximum(0, z2) a2 = dropout2.forward(a2) z3 = a2 @ model.params['W3'] + model.params['b3'] a3 = softmax(z3) # Backward + Adam update (simplified) # ... backprop through each layer in reverse ... # adam.update(model.params, grads) # Evaluate (switch to inference mode) dropout1.training = False dropout2.training = False bn1.training = False bn2.training = False if (epoch+1) % 5 == 0: test_acc = evaluate(model, X_test, y_test) print(f" Epoch {epoch+1:>2} โ lr={lr:.6f} โ Test: {test_acc:.1f}%") if test_acc > best_acc: best_acc = test_acc print(f"\n๐ฏ Best Test Accuracy: {best_acc:.1f}%")
๐ Checklist Model Production-Ready:
โ
He Initialization โ weight init yang tepat
โ
ReLU hidden layers + Softmax output
โ
Batch Normalization โ setelah linear, sebelum activation
โ
Dropout โ setelah activation (0.2-0.5)
โ
L2 Regularization โ ฮป = 0.001-0.01
โ
Adam optimizer โ lr = 0.001
โ
LR Schedule โ warmup + cosine annealing
โ
Mini-batch โ batch size 32-128
โ
Early stopping โ stop ketika validation loss naik
๐ Production-Ready Model Checklist:
โ
He Initialization โ proper weight init
โ
ReLU hidden layers + Softmax output
โ
Batch Normalization โ after linear, before activation
โ
Dropout โ after activation (0.2-0.5)
โ
L2 Regularization โ ฮป = 0.001-0.01
โ
Adam optimizer โ lr = 0.001
โ
LR Schedule โ warmup + cosine annealing
โ
Mini-batch โ batch size 32-128
โ
Early stopping โ stop when validation loss increases
8. Ringkasan Page 4
8. Page 4 Summary
| Konsep | Apa Itu | Kode Kunci |
|---|---|---|
| Overfitting | Model terlalu hafal training data | train_acc โซ test_acc |
| L2 Regularization | Penalti weight besar โ model lebih simpel | loss += ฮป*ฮฃ(wยฒ) |
| Dropout | Matikan neuron random saat training | x * mask / (1-p) |
| Batch Norm | Normalkan output tiap layer | ฮณ*(x-ฮผ)/ฯ + ฮฒ |
| Adam | Optimizer: momentum + adaptive LR | AdamOptimizer(lr=.001) |
| LR Schedule | LR berubah seiring epoch | cosine_annealing() |
| Warmup | LR naik pelan di awal training | lr * epoch / warmup |
| Early Stopping | Berhenti saat val loss naik | if val_loss > best |
| Concept | What It Is | Key Code |
|---|---|---|
| Overfitting | Model memorizes training data | train_acc โซ test_acc |
| L2 Regularization | Penalty for large weights โ simpler model | loss += ฮป*ฮฃ(wยฒ) |
| Dropout | Randomly kill neurons during training | x * mask / (1-p) |
| Batch Norm | Normalize each layer's output | ฮณ*(x-ฮผ)/ฯ + ฮฒ |
| Adam | Optimizer: momentum + adaptive LR | AdamOptimizer(lr=.001) |
| LR Schedule | LR changes over epochs | cosine_annealing() |
| Warmup | LR ramps up slowly at start | lr * epoch / warmup |
| Early Stopping | Stop when val loss increases | if val_loss > best |
Page 3 โ Convolutional Neural Network (CNN)
Coming Next: Page 5 โ Recurrent Neural Network (RNN) & Sequence Data
Memproses data berurutan: teks, time series, musik. Memahami RNN, vanishing gradient problem, dan membangun LSTM/GRU dari nol. Membuat text generator dan sentiment analyzer. Stay tuned!
Coming Next: Page 5 โ Recurrent Neural Network (RNN) & Sequence Data
Processing sequential data: text, time series, music. Understanding RNNs, the vanishing gradient problem, and building LSTM/GRU from scratch. Creating a text generator and sentiment analyzer. Stay tuned!