Neural Network Page 2 — Multi-Layer & Real Dataset

📑 Daftar Isi — Page 2

📑 Table of Contents — Page 2

Recap Page 1 — Fondasi yang sudah kita bangun
Deep Neural Network — Menambahkan multiple hidden layers
Softmax & Cross-Entropy — Klasifikasi multi-kelas
Data Preprocessing — Normalisasi & One-Hot Encoding
Mini-Batch Gradient Descent — Training lebih cepat & stabil
Klasifikasi Iris — Dataset pertama Anda yang nyata
Klasifikasi MNIST — Mengenali angka tulisan tangan
Ringkasan & Preview Page 3

Page 1 Recap — The foundation we've built
Deep Neural Network — Adding multiple hidden layers
Softmax & Cross-Entropy — Multi-class classification
Data Preprocessing — Normalization & One-Hot Encoding
Mini-Batch Gradient Descent — Faster & more stable training
Iris Classification — Your first real-world dataset
MNIST Classification — Recognizing handwritten digits
Summary & Page 3 Preview

🔄

1. Recap Page 1 — Fondasi Kita

1. Page 1 Recap — Our Foundation

Perceptron → Activation → Forward → Backward → Training Loop

Di Page 1, kita sudah membangun neural network dari nol yang bisa menyelesaikan XOR dan mengenali pola sin(x). Semua menggunakan Python + NumPy saja. Kita sudah paham: Perceptron, Sigmoid/ReLU, Forward Propagation, Backpropagation, dan Gradient Descent.

In Page 1, we built a neural network from scratch that can solve XOR and recognize the sin(x) pattern. All using Python + NumPy only. We've mastered: Perceptron, Sigmoid/ReLU, Forward Propagation, Backpropagation, and Gradient Descent.

Masalahnya? Network kita baru punya 1 hidden layer dan hanya bisa binary classification (0 atau 1). Di dunia nyata, kita butuh: banyak layer (deep!), klasifikasi multi-kelas (kucing/anjing/burung), dan data yang besar. Page 2 ini menjawab semua itu.

The limitation? Our network only had 1 hidden layer and could only do binary classification (0 or 1). In the real world, we need: many layers (deep!), multi-class classification (cat/dog/bird), and large datasets. Page 2 solves all of that.

Page 1 (sederhana) Page 2 (real-world) Input → [Hidden] → Output Input → [H1] → [H2] → [H3] → Output Binary (0/1) Multi-class (0,1,2,...,9) Toy data (4 samples) Real data (60,000 images) Full-batch GD Mini-batch GD Sigmoid only Sigmoid + ReLU + Softmax

🏗️

2. Deep Neural Network — Multiple Hidden Layers

Lebih dalam = lebih bisa menangkap pola kompleks

Deeper = better at capturing complex patterns

Menambahkan hidden layers membuat network lebih "dalam" — inilah asal kata "deep learning". Setiap layer menangkap abstraksi yang semakin tinggi. Misal untuk pengenalan wajah: layer 1 mendeteksi garis, layer 2 mendeteksi bentuk, layer 3 mendeteksi mata/hidung, layer 4 mengenali wajah.

Adding hidden layers makes the network "deeper" — this is where "deep learning" gets its name. Each layer captures increasingly higher-level abstractions. For face recognition: layer 1 detects edges, layer 2 detects shapes, layer 3 detects eyes/noses, layer 4 recognizes faces.

06_deep_network.py — Flexible Deep Network Class python

import numpy as np

class DeepNeuralNetwork:
    """
    Deep Neural Network — any number of layers!
    Example: DeepNeuralNetwork([784, 128, 64, 10])
             → 784 input, 128 hidden, 64 hidden, 10 output
    """
    def __init__(self, layer_sizes):
        self.L = len(layer_sizes) - 1  # number of layers (excl. input)
        self.sizes = layer_sizes
        self.params = {}

        # Initialize weights (He initialization for ReLU)
        for l in range(1, self.L + 1):
            self.params[f'W{l}'] = np.random.randn(
                layer_sizes[l-1], layer_sizes[l]
            ) * np.sqrt(2.0 / layer_sizes[l-1])
            self.params[f'b{l}'] = np.zeros((1, layer_sizes[l]))

    def relu(self, z):
        return np.maximum(0, z)

    def relu_deriv(self, z):
        return (z > 0).astype(float)

    def softmax(self, z):
        """Numerically stable softmax"""
        exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
        return exp_z / np.sum(exp_z, axis=1, keepdims=True)

    def forward(self, X):
        """Forward pass through ALL layers"""
        self.cache = {'a0': X}

        for l in range(1, self.L + 1):
            z = self.cache[f'a{l-1}'] @ self.params[f'W{l}'] + self.params[f'b{l}']
            self.cache[f'z{l}'] = z

            if l == self.L:
                # Last layer: softmax (multi-class)
                self.cache[f'a{l}'] = self.softmax(z)
            else:
                # Hidden layers: ReLU
                self.cache[f'a{l}'] = self.relu(z)

        return self.cache[f'a{self.L}']

    def backward(self, y_onehot, lr=0.01):
        """Backprop through ALL layers"""
        m = y_onehot.shape[0]

        # Output layer: softmax + cross-entropy shortcut
        dz = self.cache[f'a{self.L}'] - y_onehot  # elegant!

        for l in range(self.L, 0, -1):
            dW = (1/m) * self.cache[f'a{l-1}'].T @ dz
            db = (1/m) * np.sum(dz, axis=0, keepdims=True)

            if l > 1:
                da = dz @ self.params[f'W{l}'].T
                dz = da * self.relu_deriv(self.cache[f'z{l-1}'])

            # Update
            self.params[f'W{l}'] -= lr * dW
            self.params[f'b{l}'] -= lr * db

    def predict(self, X):
        probs = self.forward(X)
        return np.argmax(probs, axis=1)

    def accuracy(self, X, y):
        return np.mean(self.predict(X) == y) * 100

🎓 Apa yang Berubah dari Page 1?
Fleksibel: Sekarang bisa N layer — tinggal pass list [784, 128, 64, 10].
ReLU: Hidden layers pakai ReLU (lebih cepat konvergen dari sigmoid).
Softmax: Output layer pakai softmax (klasifikasi multi-kelas).
He Init: Weight initialization yang tepat untuk ReLU — mencegah vanishing gradient.

🎓 What Changed from Page 1?
Flexible: Now supports N layers — just pass a list like [784, 128, 64, 10].
ReLU: Hidden layers use ReLU (converges faster than sigmoid).
Softmax: Output layer uses softmax (multi-class classification).
He Init: Proper weight initialization for ReLU — prevents vanishing gradient.

🎯

3. Softmax & Cross-Entropy Loss

Dari binary ke multi-class — probabilitas tiap kelas

From binary to multi-class — probability for each class

Di Page 1, kita pakai sigmoid (output: 0 atau 1). Sekarang untuk klasifikasi banyak kelas (misal: angka 0-9), kita butuh softmax — mengubah output menjadi distribusi probabilitas yang total = 1.

In Page 1, we used sigmoid (output: 0 or 1). Now for multi-class classification (e.g., digits 0-9), we need softmax — converting outputs into a probability distribution that sums to 1.

07_softmax_crossentropy.py python

import numpy as np

# ===========================
# Softmax — turns scores into probabilities
# ===========================
def softmax(z):
    exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)

# Example: raw scores from output layer
scores = np.array([[2.0, 1.0, 0.1]])  # 3 classes
probs = softmax(scores)
print(probs)
# [[0.659, 0.242, 0.099]] ← probabilities! Sum = 1.0
print(f"Predicted class: {np.argmax(probs)}")  # 0

# ===========================
# Cross-Entropy Loss
# Better than MSE for classification!
# ===========================
def cross_entropy_loss(y_pred, y_true_onehot):
    """
    y_pred: softmax output (probabilities)
    y_true_onehot: one-hot encoded labels
    """
    m = y_pred.shape[0]
    # Clip to avoid log(0)
    y_pred = np.clip(y_pred, 1e-12, 1 - 1e-12)
    loss = -np.sum(y_true_onehot * np.log(y_pred)) / m
    return loss

# Example
y_true = np.array([[1, 0, 0]])  # true: class 0
y_pred = np.array([[0.7, 0.2, 0.1]])  # predicted probs
print(f"Loss: {cross_entropy_loss(y_pred, y_true):.4f}")  # 0.3567

# ===========================
# One-Hot Encoding helper
# ===========================
def one_hot(labels, num_classes):
    """Convert [0, 2, 1] → [[1,0,0], [0,0,1], [0,1,0]]"""
    m = labels.shape[0]
    encoded = np.zeros((m, num_classes))
    encoded[np.arange(m), labels] = 1
    return encoded

print(one_hot(np.array([0, 2, 1]), 3))
# [[1. 0. 0.]   ← class 0
#  [0. 0. 1.]   ← class 2
#  [0. 1. 0.]]  ← class 1

🎓 Kenapa Cross-Entropy, bukan MSE?
Cross-entropy memberikan gradient yang lebih besar ketika prediksi sangat salah — model belajar lebih cepat dari kesalahan besar. MSE bisa "malas" karena gradient-nya kecil di ujung-ujung sigmoid. Bonus: gradient softmax + cross-entropy sangat sederhana: dz = ŷ - y.

🎓 Why Cross-Entropy, not MSE?
Cross-entropy produces larger gradients when predictions are very wrong — the model learns faster from big mistakes. MSE can be "lazy" because gradients are small at sigmoid's extremes. Bonus: the softmax + cross-entropy gradient is beautifully simple: dz = ŷ - y.

🧹

4. Data Preprocessing — Normalisasi & Encoding

4. Data Preprocessing — Normalization & Encoding

Data mentah → data siap training

Raw data → training-ready data

Neural network sensitif terhadap skala data. Fitur dengan rentang 0-1000 akan mendominasi fitur dengan rentang 0-1. Normalisasi membuat semua fitur setara — training lebih cepat dan stabil.

Neural networks are sensitive to data scale. A feature ranging 0-1000 will dominate one ranging 0-1. Normalization puts all features on equal footing — training becomes faster and more stable.

08_preprocessing.py python

import numpy as np

# ===========================
# 1. Min-Max Normalization → scale to [0, 1]
# ===========================
def normalize(X):
    X_min = X.min(axis=0)
    X_max = X.max(axis=0)
    return (X - X_min) / (X_max - X_min + 1e-8)

# ===========================
# 2. Z-Score Standardization → mean=0, std=1
# ===========================
def standardize(X):
    mean = X.mean(axis=0)
    std = X.std(axis=0)
    return (X - mean) / (std + 1e-8)

# ===========================
# 3. Train/Test Split
# ===========================
def train_test_split(X, y, test_ratio=0.2, seed=42):
    np.random.seed(seed)
    indices = np.random.permutation(len(X))
    split = int(len(X) * (1 - test_ratio))
    train_idx, test_idx = indices[:split], indices[split:]
    return X[train_idx], X[test_idx], y[train_idx], y[test_idx]

# Example
data = np.array([[150, 0.5], [200, 0.8], [100, 0.2]])
print("Before:", data[0])         # [150, 0.5]
print("After: ", normalize(data)[0]) # [0.5, 0.5] ← same scale!

⚡

5. Mini-Batch Gradient Descent

Kompromi terbaik: cepat dan stabil

The best compromise: fast and stable

Di Page 1, kita pakai full-batch GD — seluruh dataset dihitung sekaligus. Ini lambat untuk data besar. Alternatifnya:

In Page 1, we used full-batch GD — the entire dataset computed at once. This is slow for large data. Alternatives:

3 Jenis / 3 Types of Gradient Descent Full-Batch GD Mini-Batch GD Stochastic GD (SGD) All data at once Chunks of 32/64/128 One sample at a time Slow, stable ⭐ Fast & stable Very fast, noisy 1 update/epoch N updates/epoch M updates/epoch Good for small data Best for most cases Good for online learning

09_minibatch.py — Mini-Batch Training python

import numpy as np

def create_minibatches(X, y, batch_size=32):
    """Split data into mini-batches"""
    m = X.shape[0]
    indices = np.random.permutation(m)
    X_shuffled = X[indices]
    y_shuffled = y[indices]

    batches = []
    for i in range(0, m, batch_size):
        X_batch = X_shuffled[i:i+batch_size]
        y_batch = y_shuffled[i:i+batch_size]
        batches.append((X_batch, y_batch))
    return batches

# ===========================
# Training with mini-batches
# ===========================
def train_minibatch(model, X, y_onehot, epochs=20, lr=0.1, batch_size=32):
    for epoch in range(epochs):
        batches = create_minibatches(X, y_onehot, batch_size)
        epoch_loss = 0

        for X_batch, y_batch in batches:
            # Forward
            pred = model.forward(X_batch)

            # Loss
            pred_clipped = np.clip(pred, 1e-12, 1-1e-12)
            batch_loss = -np.sum(y_batch * np.log(pred_clipped)) / len(X_batch)
            epoch_loss += batch_loss

            # Backward + Update
            model.backward(y_batch, lr)

        if (epoch+1) % 5 == 0:
            avg_loss = epoch_loss / len(batches)
            print(f"  Epoch {epoch+1:>3} │ Loss: {avg_loss:.4f}")

💡 Tip: Batch Size
32 — pilihan default yang bagus untuk kebanyakan kasus.
64-128 — jika punya GPU, ukuran lebih besar memanfaatkan parallelism.
Kekuatan 2 (32, 64, 128, 256) — optimal untuk hardware modern.

💡 Tip: Batch Size
32 — a good default for most cases.
64-128 — if you have a GPU, larger sizes leverage parallelism.
Powers of 2 (32, 64, 128, 256) — optimal for modern hardware.

🌸

6. Klasifikasi Iris — Dataset Pertama Anda

6. Iris Classification — Your First Real Dataset

150 bunga, 4 fitur, 3 spesies — klasik machine learning

150 flowers, 4 features, 3 species — a machine learning classic

Dataset Iris berisi pengukuran 150 bunga iris dengan 4 fitur (panjang/lebar sepal dan petal) dan 3 kelas (Setosa, Versicolor, Virginica). Ini "Hello World"-nya machine learning — kecil tapi cukup untuk membuktikan model kita bekerja pada data nyata.

The Iris dataset contains measurements of 150 iris flowers with 4 features (sepal/petal length and width) and 3 classes (Setosa, Versicolor, Virginica). It's the "Hello World" of machine learning — small but enough to prove our model works on real data.

10_iris_classifier.py — Iris Classification 🌸 python

import numpy as np
from sklearn.datasets import load_iris  # just for loading data

# ===========================
# 1. Load & Prepare Data
# ===========================
iris = load_iris()
X = iris.data.astype(np.float64)     # (150, 4)
y = iris.target                       # (150,) → values: 0, 1, 2

# Normalize
X = (X - X.mean(axis=0)) / (X.std(axis=0) + 1e-8)

# One-hot encode labels
def one_hot(labels, num_classes):
    enc = np.zeros((len(labels), num_classes))
    enc[np.arange(len(labels)), labels] = 1
    return enc

y_oh = one_hot(y, 3)  # (150, 3)

# Train/test split
np.random.seed(42)
idx = np.random.permutation(150)
X_train, X_test = X[idx[:120]], X[idx[120:]]
y_train, y_test = y_oh[idx[:120]], y_oh[idx[120:]]
y_test_labels = y[idx[120:]]

# ===========================
# 2. Create & Train Network
# ===========================
# Architecture: 4 input → 16 hidden → 8 hidden → 3 output
model = DeepNeuralNetwork([4, 16, 8, 3])

print("🌸 Training on Iris dataset...")
for epoch in range(200):
    pred = model.forward(X_train)
    model.backward(y_train, lr=0.1)

    if (epoch+1) % 50 == 0:
        loss = -np.sum(y_train * np.log(np.clip(pred,1e-12,1))) / len(X_train)
        acc = model.accuracy(X_test, y_test_labels)
        print(f"  Epoch {epoch+1:>3} │ Loss: {loss:.4f} │ Test Acc: {acc:.1f}%")

# ===========================
# 3. Final Results
# ===========================
print(f"\n✅ Final Test Accuracy: {model.accuracy(X_test, y_test_labels):.1f}%")
# Output: ✅ Final Test Accuracy: 96.7%+ 🎉

🎉 96%+ Akurasi! Neural network kita — yang dibangun dari nol tanpa framework — berhasil mengklasifikasikan bunga iris dengan akurasi tinggi. Arsitekturnya hanya [4, 16, 8, 3] dengan 200 epoch training.

🎉 96%+ Accuracy! Our neural network — built from scratch without any framework — successfully classifies iris flowers with high accuracy. The architecture is just [4, 16, 8, 3] with 200 epochs of training.

🔢

7. Klasifikasi MNIST — Angka Tulisan Tangan

7. MNIST Classification — Handwritten Digits

60.000 gambar, 10 digit, deep network dari nol — 97%+ akurasi

60,000 images, 10 digits, deep network from scratch — 97%+ accuracy

MNIST adalah dataset legenda di machine learning — 70.000 gambar angka tulisan tangan (28×28 piksel). Setiap gambar = 784 piksel = 784 input features. Tugas: klasifikasikan ke digit 0-9.

MNIST is a legendary machine learning dataset — 70,000 handwritten digit images (28×28 pixels). Each image = 784 pixels = 784 input features. Task: classify into digits 0-9.

MNIST Dataset Image: 28 × 28 pixels Flattened: 784 values ┌─────────────────┐ │ ░░██░░░░░░░░░ │ [0, 0, 0.5, 0.9, 0.9, 0.1, ...] │ ░░██░░░░░░░░░ │ │ │ ░░██░░░░░░░░░ │ ▼ │ ░░██░░██░░░░░ │ [784] → [128] → [64] → [10] │ ░░████████░░░ │ │ │ ░░░░░░░░██░░░ │ ▼ │ ░░░░░░░░██░░░ │ Prediction: "4" (confidence: 98.2%) └─────────────────┘

11_mnist_classifier.py — MNIST from Scratch! 🔥 python

import numpy as np
from sklearn.datasets import fetch_openml

# =====================================================
# 1. LOAD MNIST DATA
# =====================================================
print("📥 Loading MNIST dataset...")
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X, y = mnist.data.astype(np.float64), mnist.target.astype(int)

# Normalize pixels: [0, 255] → [0, 1]
X = X / 255.0

# Split: 60k train, 10k test
X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]

# One-hot encode
def one_hot(labels, nc):
    enc = np.zeros((len(labels), nc))
    enc[np.arange(len(labels)), labels] = 1
    return enc

y_train_oh = one_hot(y_train, 10)

print(f"Train: {X_train.shape}, Test: {X_test.shape}")
# Train: (60000, 784), Test: (10000, 784)

# =====================================================
# 2. CREATE DEEP NETWORK
# Architecture: 784 → 128 → 64 → 10
# =====================================================
model = DeepNeuralNetwork([784, 128, 64, 10])
print("🧠 Network: [784] → [128] → [64] → [10]")

# =====================================================
# 3. TRAIN WITH MINI-BATCHES
# =====================================================
epochs = 20
batch_size = 64
lr = 0.1

print(f"\n🔥 Training for {epochs} epochs (batch={batch_size}, lr={lr})")
print("─" * 50)

for epoch in range(epochs):
    # Shuffle
    idx = np.random.permutation(60000)
    X_shuf = X_train[idx]
    y_shuf = y_train_oh[idx]

    # Mini-batch loop
    for i in range(0, 60000, batch_size):
        Xb = X_shuf[i:i+batch_size]
        yb = y_shuf[i:i+batch_size]
        model.forward(Xb)
        model.backward(yb, lr)

    # Evaluate every 5 epochs
    if (epoch+1) % 5 == 0:
        train_acc = model.accuracy(X_train[:5000], y_train[:5000])
        test_acc = model.accuracy(X_test, y_test)
        print(f"  Epoch {epoch+1:>2} │ Train: {train_acc:.1f}% │ Test: {test_acc:.1f}%")

# =====================================================
# 4. FINAL RESULTS
# =====================================================
final_acc = model.accuracy(X_test, y_test)
print(f"\n🎯 Final Test Accuracy: {final_acc:.1f}%")
# Output: 🎯 Final Test Accuracy: 97.2%+ 🎉

# =====================================================
# 5. DEMO: Predict single digit
# =====================================================
sample_idx = 42
pred = model.predict(X_test[sample_idx:sample_idx+1])
print(f"\nSample #{sample_idx}: Predicted={pred[0]}, Actual={y_test[sample_idx]}")

🎉 97%+ Akurasi pada MNIST!
Neural network buatan kita — tanpa framework — bisa mengenali angka tulisan tangan dengan akurasi 97%+. Dari 10.000 gambar test, hanya ~300 yang salah. Dan ini hanya dengan 2 hidden layers dan 20 epoch! Bayangkan apa yang bisa dicapai dengan arsitektur lebih besar (CNN, yang akan kita bahas di Page 3).

🎉 97%+ Accuracy on MNIST!
Our handcrafted neural network — no framework — can recognize handwritten digits with 97%+ accuracy. Out of 10,000 test images, only ~300 are wrong. And this is with just 2 hidden layers and 20 epochs! Imagine what's possible with larger architectures (CNNs, covered in Page 3).

📝

8. Ringkasan Page 2

8. Page 2 Summary

Apa yang sudah kita pelajari

What we've learned

Konsep	Apa Itu	Kode Kunci
Deep Network	Multiple hidden layers — pola semakin abstrak	`DeepNN([784,128,64,10])`
ReLU	Activation hidden layer — cepat, simple	`np.maximum(0, z)`
Softmax	Output multi-kelas → probabilitas (sum=1)	`exp(z) / sum(exp(z))`
Cross-Entropy	Loss function untuk klasifikasi	`-sum(y * log(ŷ))`
One-Hot Encoding	Label → vektor biner [0,0,1,0...]	`enc[i, label] = 1`
Normalisasi	Semua fitur di skala yang sama	`X / 255.0`
Mini-Batch GD	Update per chunk data (32/64/128)	`for i in range(0,m,bs)`
He Initialization	Init weight optimal untuk ReLU	`randn() * sqrt(2/n)`
MNIST	60k gambar digit → 97%+ akurasi	`[784,128,64,10]`

Concept	What It Is	Key Code
Deep Network	Multiple hidden layers — increasingly abstract patterns	`DeepNN([784,128,64,10])`
ReLU	Hidden layer activation — fast, simple	`np.maximum(0, z)`
Softmax	Multi-class output → probabilities (sum=1)	`exp(z) / sum(exp(z))`
Cross-Entropy	Loss function for classification	`-sum(y * log(ŷ))`
One-Hot Encoding	Label → binary vector [0,0,1,0...]	`enc[i, label] = 1`
Normalization	All features on the same scale	`X / 255.0`
Mini-Batch GD	Update per data chunk (32/64/128)	`for i in range(0,m,bs)`
He Initialization	Optimal weight init for ReLU	`randn() * sqrt(2/n)`
MNIST	60k digit images → 97%+ accuracy	`[784,128,64,10]`

← Page Sebelumnya← Previous Page

Multi-Layer Network &
Dataset Dunia Nyata

Multi-Layer Networks &
Real-World Datasets

📑 Daftar Isi — Page 2

📑 Table of Contents — Page 2

1. Recap Page 1 — Fondasi Kita

1. Page 1 Recap — Our Foundation

2. Deep Neural Network — Multiple Hidden Layers

2. Deep Neural Network — Multiple Hidden Layers

3. Softmax & Cross-Entropy Loss

3. Softmax & Cross-Entropy Loss

4. Data Preprocessing — Normalisasi & Encoding

4. Data Preprocessing — Normalization & Encoding

5. Mini-Batch Gradient Descent

5. Mini-Batch Gradient Descent

6. Klasifikasi Iris — Dataset Pertama Anda

6. Iris Classification — Your First Real Dataset

7. Klasifikasi MNIST — Angka Tulisan Tangan

7. MNIST Classification — Handwritten Digits

8. Ringkasan Page 2

8. Page 2 Summary

Page 1 — Neural Network dari Nol

Coming Next: Page 3 — Convolutional Neural Network (CNN)

Coming Next: Page 3 — Convolutional Neural Network (CNN)