๐ Daftar Isi โ Page 2
๐ Table of Contents โ Page 2
- Recap Page 1 โ Fondasi yang sudah kita bangun
- Deep Neural Network โ Menambahkan multiple hidden layers
- Softmax & Cross-Entropy โ Klasifikasi multi-kelas
- Data Preprocessing โ Normalisasi & One-Hot Encoding
- Mini-Batch Gradient Descent โ Training lebih cepat & stabil
- Klasifikasi Iris โ Dataset pertama Anda yang nyata
- Klasifikasi MNIST โ Mengenali angka tulisan tangan
- Ringkasan & Preview Page 3
- Page 1 Recap โ The foundation we've built
- Deep Neural Network โ Adding multiple hidden layers
- Softmax & Cross-Entropy โ Multi-class classification
- Data Preprocessing โ Normalization & One-Hot Encoding
- Mini-Batch Gradient Descent โ Faster & more stable training
- Iris Classification โ Your first real-world dataset
- MNIST Classification โ Recognizing handwritten digits
- Summary & Page 3 Preview
1. Recap Page 1 โ Fondasi Kita
1. Page 1 Recap โ Our Foundation
Di Page 1, kita sudah membangun neural network dari nol yang bisa menyelesaikan XOR dan mengenali pola sin(x). Semua menggunakan Python + NumPy saja. Kita sudah paham: Perceptron, Sigmoid/ReLU, Forward Propagation, Backpropagation, dan Gradient Descent.
In Page 1, we built a neural network from scratch that can solve XOR and recognize the sin(x) pattern. All using Python + NumPy only. We've mastered: Perceptron, Sigmoid/ReLU, Forward Propagation, Backpropagation, and Gradient Descent.
Masalahnya? Network kita baru punya 1 hidden layer dan hanya bisa binary classification (0 atau 1). Di dunia nyata, kita butuh: banyak layer (deep!), klasifikasi multi-kelas (kucing/anjing/burung), dan data yang besar. Page 2 ini menjawab semua itu.
The limitation? Our network only had 1 hidden layer and could only do binary classification (0 or 1). In the real world, we need: many layers (deep!), multi-class classification (cat/dog/bird), and large datasets. Page 2 solves all of that.
2. Deep Neural Network โ Multiple Hidden Layers
2. Deep Neural Network โ Multiple Hidden Layers
Menambahkan hidden layers membuat network lebih "dalam" โ inilah asal kata "deep learning". Setiap layer menangkap abstraksi yang semakin tinggi. Misal untuk pengenalan wajah: layer 1 mendeteksi garis, layer 2 mendeteksi bentuk, layer 3 mendeteksi mata/hidung, layer 4 mengenali wajah.
Adding hidden layers makes the network "deeper" โ this is where "deep learning" gets its name. Each layer captures increasingly higher-level abstractions. For face recognition: layer 1 detects edges, layer 2 detects shapes, layer 3 detects eyes/noses, layer 4 recognizes faces.
import numpy as np class DeepNeuralNetwork: """ Deep Neural Network โ any number of layers! Example: DeepNeuralNetwork([784, 128, 64, 10]) โ 784 input, 128 hidden, 64 hidden, 10 output """ def __init__(self, layer_sizes): self.L = len(layer_sizes) - 1 # number of layers (excl. input) self.sizes = layer_sizes self.params = {} # Initialize weights (He initialization for ReLU) for l in range(1, self.L + 1): self.params[f'W{l}'] = np.random.randn( layer_sizes[l-1], layer_sizes[l] ) * np.sqrt(2.0 / layer_sizes[l-1]) self.params[f'b{l}'] = np.zeros((1, layer_sizes[l])) def relu(self, z): return np.maximum(0, z) def relu_deriv(self, z): return (z > 0).astype(float) def softmax(self, z): """Numerically stable softmax""" exp_z = np.exp(z - np.max(z, axis=1, keepdims=True)) return exp_z / np.sum(exp_z, axis=1, keepdims=True) def forward(self, X): """Forward pass through ALL layers""" self.cache = {'a0': X} for l in range(1, self.L + 1): z = self.cache[f'a{l-1}'] @ self.params[f'W{l}'] + self.params[f'b{l}'] self.cache[f'z{l}'] = z if l == self.L: # Last layer: softmax (multi-class) self.cache[f'a{l}'] = self.softmax(z) else: # Hidden layers: ReLU self.cache[f'a{l}'] = self.relu(z) return self.cache[f'a{self.L}'] def backward(self, y_onehot, lr=0.01): """Backprop through ALL layers""" m = y_onehot.shape[0] # Output layer: softmax + cross-entropy shortcut dz = self.cache[f'a{self.L}'] - y_onehot # elegant! for l in range(self.L, 0, -1): dW = (1/m) * self.cache[f'a{l-1}'].T @ dz db = (1/m) * np.sum(dz, axis=0, keepdims=True) if l > 1: da = dz @ self.params[f'W{l}'].T dz = da * self.relu_deriv(self.cache[f'z{l-1}']) # Update self.params[f'W{l}'] -= lr * dW self.params[f'b{l}'] -= lr * db def predict(self, X): probs = self.forward(X) return np.argmax(probs, axis=1) def accuracy(self, X, y): return np.mean(self.predict(X) == y) * 100
๐ Apa yang Berubah dari Page 1?
Fleksibel: Sekarang bisa N layer โ tinggal pass list [784, 128, 64, 10].
ReLU: Hidden layers pakai ReLU (lebih cepat konvergen dari sigmoid).
Softmax: Output layer pakai softmax (klasifikasi multi-kelas).
He Init: Weight initialization yang tepat untuk ReLU โ mencegah vanishing gradient.
๐ What Changed from Page 1?
Flexible: Now supports N layers โ just pass a list like [784, 128, 64, 10].
ReLU: Hidden layers use ReLU (converges faster than sigmoid).
Softmax: Output layer uses softmax (multi-class classification).
He Init: Proper weight initialization for ReLU โ prevents vanishing gradient.
3. Softmax & Cross-Entropy Loss
3. Softmax & Cross-Entropy Loss
Di Page 1, kita pakai sigmoid (output: 0 atau 1). Sekarang untuk klasifikasi banyak kelas (misal: angka 0-9), kita butuh softmax โ mengubah output menjadi distribusi probabilitas yang total = 1.
In Page 1, we used sigmoid (output: 0 or 1). Now for multi-class classification (e.g., digits 0-9), we need softmax โ converting outputs into a probability distribution that sums to 1.
import numpy as np # =========================== # Softmax โ turns scores into probabilities # =========================== def softmax(z): exp_z = np.exp(z - np.max(z, axis=1, keepdims=True)) return exp_z / np.sum(exp_z, axis=1, keepdims=True) # Example: raw scores from output layer scores = np.array([[2.0, 1.0, 0.1]]) # 3 classes probs = softmax(scores) print(probs) # [[0.659, 0.242, 0.099]] โ probabilities! Sum = 1.0 print(f"Predicted class: {np.argmax(probs)}") # 0 # =========================== # Cross-Entropy Loss # Better than MSE for classification! # =========================== def cross_entropy_loss(y_pred, y_true_onehot): """ y_pred: softmax output (probabilities) y_true_onehot: one-hot encoded labels """ m = y_pred.shape[0] # Clip to avoid log(0) y_pred = np.clip(y_pred, 1e-12, 1 - 1e-12) loss = -np.sum(y_true_onehot * np.log(y_pred)) / m return loss # Example y_true = np.array([[1, 0, 0]]) # true: class 0 y_pred = np.array([[0.7, 0.2, 0.1]]) # predicted probs print(f"Loss: {cross_entropy_loss(y_pred, y_true):.4f}") # 0.3567 # =========================== # One-Hot Encoding helper # =========================== def one_hot(labels, num_classes): """Convert [0, 2, 1] โ [[1,0,0], [0,0,1], [0,1,0]]""" m = labels.shape[0] encoded = np.zeros((m, num_classes)) encoded[np.arange(m), labels] = 1 return encoded print(one_hot(np.array([0, 2, 1]), 3)) # [[1. 0. 0.] โ class 0 # [0. 0. 1.] โ class 2 # [0. 1. 0.]] โ class 1
๐ Kenapa Cross-Entropy, bukan MSE?
Cross-entropy memberikan gradient yang lebih besar ketika prediksi sangat salah โ model belajar lebih cepat dari kesalahan besar. MSE bisa "malas" karena gradient-nya kecil di ujung-ujung sigmoid. Bonus: gradient softmax + cross-entropy sangat sederhana: dz = ลท - y.
๐ Why Cross-Entropy, not MSE?
Cross-entropy produces larger gradients when predictions are very wrong โ the model learns faster from big mistakes. MSE can be "lazy" because gradients are small at sigmoid's extremes. Bonus: the softmax + cross-entropy gradient is beautifully simple: dz = ลท - y.
4. Data Preprocessing โ Normalisasi & Encoding
4. Data Preprocessing โ Normalization & Encoding
Neural network sensitif terhadap skala data. Fitur dengan rentang 0-1000 akan mendominasi fitur dengan rentang 0-1. Normalisasi membuat semua fitur setara โ training lebih cepat dan stabil.
Neural networks are sensitive to data scale. A feature ranging 0-1000 will dominate one ranging 0-1. Normalization puts all features on equal footing โ training becomes faster and more stable.
import numpy as np # =========================== # 1. Min-Max Normalization โ scale to [0, 1] # =========================== def normalize(X): X_min = X.min(axis=0) X_max = X.max(axis=0) return (X - X_min) / (X_max - X_min + 1e-8) # =========================== # 2. Z-Score Standardization โ mean=0, std=1 # =========================== def standardize(X): mean = X.mean(axis=0) std = X.std(axis=0) return (X - mean) / (std + 1e-8) # =========================== # 3. Train/Test Split # =========================== def train_test_split(X, y, test_ratio=0.2, seed=42): np.random.seed(seed) indices = np.random.permutation(len(X)) split = int(len(X) * (1 - test_ratio)) train_idx, test_idx = indices[:split], indices[split:] return X[train_idx], X[test_idx], y[train_idx], y[test_idx] # Example data = np.array([[150, 0.5], [200, 0.8], [100, 0.2]]) print("Before:", data[0]) # [150, 0.5] print("After: ", normalize(data)[0]) # [0.5, 0.5] โ same scale!
5. Mini-Batch Gradient Descent
5. Mini-Batch Gradient Descent
Di Page 1, kita pakai full-batch GD โ seluruh dataset dihitung sekaligus. Ini lambat untuk data besar. Alternatifnya:
In Page 1, we used full-batch GD โ the entire dataset computed at once. This is slow for large data. Alternatives:
import numpy as np def create_minibatches(X, y, batch_size=32): """Split data into mini-batches""" m = X.shape[0] indices = np.random.permutation(m) X_shuffled = X[indices] y_shuffled = y[indices] batches = [] for i in range(0, m, batch_size): X_batch = X_shuffled[i:i+batch_size] y_batch = y_shuffled[i:i+batch_size] batches.append((X_batch, y_batch)) return batches # =========================== # Training with mini-batches # =========================== def train_minibatch(model, X, y_onehot, epochs=20, lr=0.1, batch_size=32): for epoch in range(epochs): batches = create_minibatches(X, y_onehot, batch_size) epoch_loss = 0 for X_batch, y_batch in batches: # Forward pred = model.forward(X_batch) # Loss pred_clipped = np.clip(pred, 1e-12, 1-1e-12) batch_loss = -np.sum(y_batch * np.log(pred_clipped)) / len(X_batch) epoch_loss += batch_loss # Backward + Update model.backward(y_batch, lr) if (epoch+1) % 5 == 0: avg_loss = epoch_loss / len(batches) print(f" Epoch {epoch+1:>3} โ Loss: {avg_loss:.4f}")
๐ก Tip: Batch Size
32 โ pilihan default yang bagus untuk kebanyakan kasus.
64-128 โ jika punya GPU, ukuran lebih besar memanfaatkan parallelism.
Kekuatan 2 (32, 64, 128, 256) โ optimal untuk hardware modern.
๐ก Tip: Batch Size
32 โ a good default for most cases.
64-128 โ if you have a GPU, larger sizes leverage parallelism.
Powers of 2 (32, 64, 128, 256) โ optimal for modern hardware.
6. Klasifikasi Iris โ Dataset Pertama Anda
6. Iris Classification โ Your First Real Dataset
Dataset Iris berisi pengukuran 150 bunga iris dengan 4 fitur (panjang/lebar sepal dan petal) dan 3 kelas (Setosa, Versicolor, Virginica). Ini "Hello World"-nya machine learning โ kecil tapi cukup untuk membuktikan model kita bekerja pada data nyata.
The Iris dataset contains measurements of 150 iris flowers with 4 features (sepal/petal length and width) and 3 classes (Setosa, Versicolor, Virginica). It's the "Hello World" of machine learning โ small but enough to prove our model works on real data.
import numpy as np from sklearn.datasets import load_iris # just for loading data # =========================== # 1. Load & Prepare Data # =========================== iris = load_iris() X = iris.data.astype(np.float64) # (150, 4) y = iris.target # (150,) โ values: 0, 1, 2 # Normalize X = (X - X.mean(axis=0)) / (X.std(axis=0) + 1e-8) # One-hot encode labels def one_hot(labels, num_classes): enc = np.zeros((len(labels), num_classes)) enc[np.arange(len(labels)), labels] = 1 return enc y_oh = one_hot(y, 3) # (150, 3) # Train/test split np.random.seed(42) idx = np.random.permutation(150) X_train, X_test = X[idx[:120]], X[idx[120:]] y_train, y_test = y_oh[idx[:120]], y_oh[idx[120:]] y_test_labels = y[idx[120:]] # =========================== # 2. Create & Train Network # =========================== # Architecture: 4 input โ 16 hidden โ 8 hidden โ 3 output model = DeepNeuralNetwork([4, 16, 8, 3]) print("๐ธ Training on Iris dataset...") for epoch in range(200): pred = model.forward(X_train) model.backward(y_train, lr=0.1) if (epoch+1) % 50 == 0: loss = -np.sum(y_train * np.log(np.clip(pred,1e-12,1))) / len(X_train) acc = model.accuracy(X_test, y_test_labels) print(f" Epoch {epoch+1:>3} โ Loss: {loss:.4f} โ Test Acc: {acc:.1f}%") # =========================== # 3. Final Results # =========================== print(f"\nโ Final Test Accuracy: {model.accuracy(X_test, y_test_labels):.1f}%") # Output: โ Final Test Accuracy: 96.7%+ ๐
๐ 96%+ Akurasi! Neural network kita โ yang dibangun dari nol tanpa framework โ berhasil mengklasifikasikan bunga iris dengan akurasi tinggi. Arsitekturnya hanya [4, 16, 8, 3] dengan 200 epoch training.
๐ 96%+ Accuracy! Our neural network โ built from scratch without any framework โ successfully classifies iris flowers with high accuracy. The architecture is just [4, 16, 8, 3] with 200 epochs of training.
7. Klasifikasi MNIST โ Angka Tulisan Tangan
7. MNIST Classification โ Handwritten Digits
MNIST adalah dataset legenda di machine learning โ 70.000 gambar angka tulisan tangan (28ร28 piksel). Setiap gambar = 784 piksel = 784 input features. Tugas: klasifikasikan ke digit 0-9.
MNIST is a legendary machine learning dataset โ 70,000 handwritten digit images (28ร28 pixels). Each image = 784 pixels = 784 input features. Task: classify into digits 0-9.
import numpy as np from sklearn.datasets import fetch_openml # ===================================================== # 1. LOAD MNIST DATA # ===================================================== print("๐ฅ Loading MNIST dataset...") mnist = fetch_openml('mnist_784', version=1, as_frame=False) X, y = mnist.data.astype(np.float64), mnist.target.astype(int) # Normalize pixels: [0, 255] โ [0, 1] X = X / 255.0 # Split: 60k train, 10k test X_train, X_test = X[:60000], X[60000:] y_train, y_test = y[:60000], y[60000:] # One-hot encode def one_hot(labels, nc): enc = np.zeros((len(labels), nc)) enc[np.arange(len(labels)), labels] = 1 return enc y_train_oh = one_hot(y_train, 10) print(f"Train: {X_train.shape}, Test: {X_test.shape}") # Train: (60000, 784), Test: (10000, 784) # ===================================================== # 2. CREATE DEEP NETWORK # Architecture: 784 โ 128 โ 64 โ 10 # ===================================================== model = DeepNeuralNetwork([784, 128, 64, 10]) print("๐ง Network: [784] โ [128] โ [64] โ [10]") # ===================================================== # 3. TRAIN WITH MINI-BATCHES # ===================================================== epochs = 20 batch_size = 64 lr = 0.1 print(f"\n๐ฅ Training for {epochs} epochs (batch={batch_size}, lr={lr})") print("โ" * 50) for epoch in range(epochs): # Shuffle idx = np.random.permutation(60000) X_shuf = X_train[idx] y_shuf = y_train_oh[idx] # Mini-batch loop for i in range(0, 60000, batch_size): Xb = X_shuf[i:i+batch_size] yb = y_shuf[i:i+batch_size] model.forward(Xb) model.backward(yb, lr) # Evaluate every 5 epochs if (epoch+1) % 5 == 0: train_acc = model.accuracy(X_train[:5000], y_train[:5000]) test_acc = model.accuracy(X_test, y_test) print(f" Epoch {epoch+1:>2} โ Train: {train_acc:.1f}% โ Test: {test_acc:.1f}%") # ===================================================== # 4. FINAL RESULTS # ===================================================== final_acc = model.accuracy(X_test, y_test) print(f"\n๐ฏ Final Test Accuracy: {final_acc:.1f}%") # Output: ๐ฏ Final Test Accuracy: 97.2%+ ๐ # ===================================================== # 5. DEMO: Predict single digit # ===================================================== sample_idx = 42 pred = model.predict(X_test[sample_idx:sample_idx+1]) print(f"\nSample #{sample_idx}: Predicted={pred[0]}, Actual={y_test[sample_idx]}")
๐ 97%+ Akurasi pada MNIST!
Neural network buatan kita โ tanpa framework โ bisa mengenali angka tulisan tangan dengan akurasi 97%+. Dari 10.000 gambar test, hanya ~300 yang salah. Dan ini hanya dengan 2 hidden layers dan 20 epoch! Bayangkan apa yang bisa dicapai dengan arsitektur lebih besar (CNN, yang akan kita bahas di Page 3).
๐ 97%+ Accuracy on MNIST!
Our handcrafted neural network โ no framework โ can recognize handwritten digits with 97%+ accuracy. Out of 10,000 test images, only ~300 are wrong. And this is with just 2 hidden layers and 20 epochs! Imagine what's possible with larger architectures (CNNs, covered in Page 3).
8. Ringkasan Page 2
8. Page 2 Summary
| Konsep | Apa Itu | Kode Kunci |
|---|---|---|
| Deep Network | Multiple hidden layers โ pola semakin abstrak | DeepNN([784,128,64,10]) |
| ReLU | Activation hidden layer โ cepat, simple | np.maximum(0, z) |
| Softmax | Output multi-kelas โ probabilitas (sum=1) | exp(z) / sum(exp(z)) |
| Cross-Entropy | Loss function untuk klasifikasi | -sum(y * log(ลท)) |
| One-Hot Encoding | Label โ vektor biner [0,0,1,0...] | enc[i, label] = 1 |
| Normalisasi | Semua fitur di skala yang sama | X / 255.0 |
| Mini-Batch GD | Update per chunk data (32/64/128) | for i in range(0,m,bs) |
| He Initialization | Init weight optimal untuk ReLU | randn() * sqrt(2/n) |
| MNIST | 60k gambar digit โ 97%+ akurasi | [784,128,64,10] |
| Concept | What It Is | Key Code |
|---|---|---|
| Deep Network | Multiple hidden layers โ increasingly abstract patterns | DeepNN([784,128,64,10]) |
| ReLU | Hidden layer activation โ fast, simple | np.maximum(0, z) |
| Softmax | Multi-class output โ probabilities (sum=1) | exp(z) / sum(exp(z)) |
| Cross-Entropy | Loss function for classification | -sum(y * log(ลท)) |
| One-Hot Encoding | Label โ binary vector [0,0,1,0...] | enc[i, label] = 1 |
| Normalization | All features on the same scale | X / 255.0 |
| Mini-Batch GD | Update per data chunk (32/64/128) | for i in range(0,m,bs) |
| He Initialization | Optimal weight init for ReLU | randn() * sqrt(2/n) |
| MNIST | 60k digit images โ 97%+ accuracy | [784,128,64,10] |
Page 1 โ Neural Network dari Nol
Coming Next: Page 3 โ Convolutional Neural Network (CNN)
Memahami convolution, pooling, dan feature maps. Membangun CNN dari nol untuk image classification, lalu membandingkan hasilnya dengan network biasa. MNIST โ 99%+ akurasi. Stay tuned!
Coming Next: Page 3 โ Convolutional Neural Network (CNN)
Understanding convolution, pooling, and feature maps. Building a CNN from scratch for image classification, then comparing results with a regular network. MNIST โ 99%+ accuracy. Stay tuned!