Neural Network Page 3 — Convolutional Neural Network (CNN)

🤔

1. Kenapa CNN? — Dense Network Tidak Cukup

1. Why CNN? — Dense Networks Aren't Enough

Gambar punya struktur spasial — dense network mengabaikannya

Images have spatial structure — dense networks ignore it

Di Page 2, kita capai 97% akurasi pada MNIST dengan dense (fully connected) network. Tapi ada 3 masalah besar kalau kita pakai dense network untuk gambar:

In Page 2, we achieved 97% accuracy on MNIST with a dense (fully connected) network. But there are 3 major problems when using dense networks for images:

Masalah Dense Network untuk Gambar / Problems with Dense Networks for Images 1. Terlalu banyak parameter / Too many parameters Image 28×28 = 784 input Hidden 128 = 784 × 128 = 100,352 weights (layer 1 saja!) Image 224×224×3 = 150,528 input → jutaan parameter 💀 2. Tidak paham posisi / No spatial awareness Piksel kiri-atas dan kanan-bawah diperlakukan sama Padahal mata selalu dekat hidung — posisi itu penting! 3. Tidak translation-invariant Kucing di pojok kiri ≠ kucing di tengah (bagi dense network) Padahal tetap kucing yang sama!

Solusinya: CNN! CNN memproses gambar menggunakan filter kecil yang "geser" (slide) di atas gambar. Filter ini mendeteksi pola lokal (garis, sudut, tekstur) — tidak peduli di mana posisinya. Hasilnya: parameter jauh lebih sedikit, akurasi jauh lebih tinggi.

The solution: CNN! CNNs process images using small filters that "slide" across the image. These filters detect local patterns (edges, corners, textures) — regardless of position. Result: far fewer parameters, far higher accuracy.

💡 Analogi: Kaca Pembesar
Dense network = melihat seluruh foto sekaligus (overwhelming!).
CNN = melihat foto dengan kaca pembesar — periksa bagian kecil satu per satu, temukan pola, lalu gabungkan hasilnya. Lebih efisien dan lebih teliti.

💡 Analogy: Magnifying Glass
Dense network = looking at the entire photo at once (overwhelming!).
CNN = examining the photo with a magnifying glass — check small regions one at a time, find patterns, then combine the results. More efficient and more thorough.

🔍

2. Operasi Convolution — Inti CNN

2. The Convolution Operation — The Heart of CNN

Filter kecil geser di atas gambar → menghasilkan feature map

A small filter slides over the image → produces a feature map

Convolution = operasi di mana sebuah filter (kernel) kecil (misal 3×3) digeser di atas gambar. Di setiap posisi, filter dikalikan element-wise dengan bagian gambar, lalu dijumlahkan. Hasilnya = satu angka di feature map.

Convolution = an operation where a small filter (kernel), e.g. 3×3, slides over the image. At each position, the filter is multiplied element-wise with the image patch, then summed. The result = one number in the feature map.

Convolution: Filter 3×3 Sliding Over a 5×5 Image Input Image (5×5) Filter/Kernel (3×3) Feature Map (3×3) ┌─────────────┐ ┌─────────┐ ┌─────────┐ │ 1 0 1 0 1│ │ 1 0 1 │ │ 4 3 4 │ │ 0 1 0 1 0│ ✕ │ 0 1 0 │ = │ 2 4 3 │ │ 1 0 1 0 1│ │ 1 0 1 │ │ 2 3 4 │ │ 0 1 0 1 0│ └─────────┘ └─────────┘ │ 1 0 1 0 1│ └─────────────┘ ↑ ↑ Window position 1 Sum of element-wise multiply 1×1+0×0+1×1+ = 4 0×0+1×1+0×0+ 1×1+0×0+1×1 = 4

12_convolution.py — Convolution from Scratch python

import numpy as np

def conv2d(image, kernel):
    """
    2D Convolution (no padding, stride=1)
    image: (H, W) — single-channel image
    kernel: (kH, kW) — filter
    returns: feature map (H-kH+1, W-kW+1)
    """
    H, W = image.shape
    kH, kW = kernel.shape
    outH = H - kH + 1
    outW = W - kW + 1
    output = np.zeros((outH, outW))

    for i in range(outH):
        for j in range(outW):
            # Extract patch & element-wise multiply + sum
            patch = image[i:i+kH, j:j+kW]
            output[i, j] = np.sum(patch * kernel)

    return output

# ===========================
# Demo: Edge Detection!
# ===========================
image = np.array([
    [0, 0, 0, 0, 0],
    [0, 1, 1, 1, 0],
    [0, 1, 1, 1, 0],
    [0, 1, 1, 1, 0],
    [0, 0, 0, 0, 0],
], dtype=np.float64)

# Vertical edge detector
kernel_v = np.array([[-1, 0, 1],
                     [-1, 0, 1],
                     [-1, 0, 1]], dtype=np.float64)

# Horizontal edge detector
kernel_h = np.array([[-1, -1, -1],
                     [ 0,  0,  0],
                     [ 1,  1,  1]], dtype=np.float64)

print("Vertical edges:\n", conv2d(image, kernel_v))
print("Horizontal edges:\n", conv2d(image, kernel_h))

🎓 Key Insight: Di CNN, filter ini tidak di-design manual — network mempelajari filter terbaik melalui backpropagation! Layer awal belajar mendeteksi garis/tepi, layer tengah belajar bentuk, layer akhir belajar objek utuh.

🎓 Key Insight: In a CNN, these filters are not manually designed — the network learns the best filters through backpropagation! Early layers learn to detect edges, middle layers learn shapes, and later layers learn complete objects.

📐

3. Padding & Stride — Kontrol Ukuran Output

3. Padding & Stride — Controlling Output Size

Dua hyperparameter kunci untuk convolution

Two key hyperparameters for convolution

Padding = menambahkan "bingkai" nol di sekeliling gambar agar ukuran output = ukuran input. Stride = seberapa jauh filter bergeser setiap langkah (stride=2 → output setengah ukuran).

Padding = adding a "frame" of zeros around the image so output size = input size. Stride = how far the filter moves each step (stride=2 → output is half the size).

13_padding_stride.py python

import numpy as np

def conv2d_full(image, kernel, padding=0, stride=1):
    """Conv2D with padding and stride support"""
    # Add zero-padding
    if padding > 0:
        image = np.pad(image, padding, mode='constant')

    H, W = image.shape
    kH, kW = kernel.shape
    outH = (H - kH) // stride + 1
    outW = (W - kW) // stride + 1
    output = np.zeros((outH, outW))

    for i in range(outH):
        for j in range(outW):
            si, sj = i * stride, j * stride
            patch = image[si:si+kH, sj:sj+kW]
            output[i, j] = np.sum(patch * kernel)

    return output

# Output size formula:
# out = (input + 2*padding - kernel) / stride + 1

# Example: 28×28 image, 3×3 kernel
# No padding:  (28 - 3)/1 + 1 = 26×26
# Padding=1:   (28+2 - 3)/1 + 1 = 28×28 ← "same"!
# Stride=2:    (28+2 - 3)/2 + 1 = 14×14 ← downsampled

img = np.random.rand(28, 28)
k = np.random.rand(3, 3)

print("No padding:  ", conv2d_full(img, k).shape)             # (26, 26)
print("Padding=1:   ", conv2d_full(img, k, padding=1).shape)  # (28, 28)
print("Stride=2:    ", conv2d_full(img, k, stride=2).shape)   # (13, 13)
print("Both p=1,s=2:", conv2d_full(img, k, 1, 2).shape)     # (14, 14)

📐 Rumus Output Size:
output_size = (input + 2×padding − kernel) ÷ stride + 1
Hafalkan rumus ini — Anda akan pakai terus saat mendesain arsitektur CNN.

📐 Output Size Formula:
output_size = (input + 2×padding − kernel) ÷ stride + 1
Memorize this formula — you'll use it constantly when designing CNN architectures.

🏊

4. Pooling Layer — Menyusutkan Feature Map

4. Pooling Layer — Shrinking Feature Maps

Ambil info penting, buang detail yang tidak perlu

Keep important info, discard unnecessary details

Pooling mengurangi ukuran feature map (downsampling) untuk: mengurangi jumlah parameter, mencegah overfitting, dan membuat network lebih robust terhadap pergeseran kecil. Max Pooling = ambil nilai terbesar di setiap window.

Pooling reduces the size of feature maps (downsampling) to: reduce parameters, prevent overfitting, and make the network more robust to small translations. Max Pooling = take the largest value in each window.

Max Pooling 2×2, Stride 2 Input (4×4) Output (2×2) ┌─────┬─────┐ │ 1 3│ 2 1│ ┌─────────┐ │ 4 6│ 5 2│ → │ 6 5 │ max(1,3,4,6) = 6 ├─────┼─────┤ │ 8 7 │ max(3,1,8,2) = 8 │ 3 1│ 7 4│ └─────────┘ │ 8 2│ 3 6│ └─────┴─────┘

14_pooling.py — Max Pooling python

import numpy as np

def max_pool2d(feature_map, pool_size=2, stride=2):
    """Max Pooling 2D"""
    H, W = feature_map.shape
    outH = (H - pool_size) // stride + 1
    outW = (W - pool_size) // stride + 1
    output = np.zeros((outH, outW))

    for i in range(outH):
        for j in range(outW):
            si, sj = i * stride, j * stride
            window = feature_map[si:si+pool_size, sj:sj+pool_size]
            output[i, j] = np.max(window)

    return output

# Demo
fm = np.array([[1,3,2,1], [4,6,5,2], [3,1,7,4], [8,2,3,6]], dtype=np.float64)
print("Input (4×4):\n", fm)
print("After MaxPool 2×2:\n", max_pool2d(fm))
# [[6. 5.]
#  [8. 7.]]  ← size halved, key values preserved!

🏛️

5. Arsitektur CNN Lengkap

5. Full CNN Architecture

Conv → ReLU → Pool → Conv → ReLU → Pool → Flatten → FC → Softmax

CNN menggabungkan beberapa building block dalam urutan tertentu. Convolutional layers mengekstrak fitur, pooling menyusutkan, dan fully connected layers di akhir melakukan klasifikasi.

A CNN combines several building blocks in a specific order. Convolutional layers extract features, pooling downsamples, and fully connected layers at the end perform classification.

CNN Architecture for MNIST (28×28 → digit 0-9) Input Conv1 Pool1 Conv2 Pool2 28×28×1 → 26×26×8 → 13×13×8 → 11×11×16 → 5×5×16 (image) (8 filters) (maxpool) (16 filters) (maxpool) ↓ │ ▼ Flatten 5×5×16 = 400 │ ▼ FC: 400 → 64 │ ▼ FC: 64 → 10 (softmax) │ ▼ Prediction: "7"

🎓 Parameter Comparison:
Dense Network (Page 2): 784→128→64→10 = ~109k parameters.
CNN: Conv(8)+Conv(16)+FC(400→64→10) = ~28k parameters.
CNN punya 4× lebih sedikit parameter tapi akurasi lebih tinggi — karena memanfaatkan struktur spasial gambar!

🎓 Parameter Comparison:
Dense Network (Page 2): 784→128→64→10 = ~109k parameters.
CNN: Conv(8)+Conv(16)+FC(400→64→10) = ~28k parameters.
CNN has 4× fewer parameters but higher accuracy — because it leverages the spatial structure of images!

🔧

6. Membangun CNN dari Nol — NumPy Murni

6. Building CNN from Scratch — Pure NumPy

Setiap layer: forward + backward, digabung jadi CNN utuh

Each layer: forward + backward, combined into a complete CNN

Kita implementasikan setiap layer sebagai class terpisah dengan method forward() dan backward(). Lalu gabungkan semuanya menjadi satu CNN pipeline.

We'll implement each layer as a separate class with forward() and backward() methods. Then combine them all into one CNN pipeline.

15_cnn_layers.py — CNN Layers from Scratch python

import numpy as np

# =====================================================
# CONVOLUTIONAL LAYER
# =====================================================
class ConvLayer:
    def __init__(self, num_filters, kernel_size=3):
        self.num_filters = num_filters
        self.k = kernel_size
        # He initialization: shape (num_filters, kH, kW)
        self.filters = np.random.randn(
            num_filters, kernel_size, kernel_size
        ) * np.sqrt(2.0 / (kernel_size * kernel_size))
        self.biases = np.zeros(num_filters)

    def forward(self, input):
        """input: (H, W) or batch (N, H, W)"""
        self.input = input
        if input.ndim == 2:
            input = input[np.newaxis]  # add batch dim
        N, H, W = input.shape
        outH = H - self.k + 1
        outW = W - self.k + 1
        output = np.zeros((N, self.num_filters, outH, outW))

        for n in range(N):
            for f in range(self.num_filters):
                for i in range(outH):
                    for j in range(outW):
                        patch = input[n, i:i+self.k, j:j+self.k]
                        output[n, f, i, j] = (
                            np.sum(patch * self.filters[f]) + self.biases[f]
                        )
        self.output = output
        return output

    def backward(self, d_out, lr):
        """Compute gradients and update filters"""
        inp = self.input if self.input.ndim == 3 else self.input[np.newaxis]
        N, H, W = inp.shape
        d_filters = np.zeros_like(self.filters)
        d_biases = np.zeros_like(self.biases)

        for n in range(N):
            for f in range(self.num_filters):
                for i in range(d_out.shape[2]):
                    for j in range(d_out.shape[3]):
                        patch = inp[n, i:i+self.k, j:j+self.k]
                        d_filters[f] += patch * d_out[n, f, i, j]
                        d_biases[f] += d_out[n, f, i, j]

        self.filters -= lr * d_filters / N
        self.biases -= lr * d_biases / N

# =====================================================
# MAX POOLING LAYER
# =====================================================
class MaxPoolLayer:
    def __init__(self, pool_size=2):
        self.p = pool_size

    def forward(self, input):
        """input: (N, C, H, W)"""
        self.input = input
        N, C, H, W = input.shape
        outH = H // self.p
        outW = W // self.p
        output = np.zeros((N, C, outH, outW))
        self.mask = np.zeros_like(input)

        for i in range(outH):
            for j in range(outW):
                si, sj = i*self.p, j*self.p
                window = input[:, :, si:si+self.p, sj:sj+self.p]
                output[:, :, i, j] = np.max(window, axis=(2,3))
                # Save mask for backward
                for n in range(N):
                    for c in range(C):
                        w = window[n, c]
                        mi, mj = np.unravel_index(w.argmax(), w.shape)
                        self.mask[n, c, si+mi, sj+mj] = 1
        return output

    def backward(self, d_out, lr=None):
        """Route gradients to max positions"""
        d_input = np.zeros_like(self.input)
        N, C, outH, outW = d_out.shape
        for i in range(outH):
            for j in range(outW):
                si, sj = i*self.p, j*self.p
                for n in range(N):
                    for c in range(C):
                        d_input[n,c,si:si+self.p,sj:sj+self.p] += (
                            self.mask[n,c,si:si+self.p,sj:sj+self.p]
                            * d_out[n,c,i,j]
                        )
        return d_input

# =====================================================
# ReLU LAYER
# =====================================================
class ReLULayer:
    def forward(self, x):
        self.input = x
        return np.maximum(0, x)

    def backward(self, d_out, lr=None):
        return d_out * (self.input > 0)

🎓 Kenapa backward() penting di setiap layer?
Karena backpropagation bekerja secara berantai: gradient dari output "mengalir mundur" melewati setiap layer. Pooling meneruskan gradient hanya ke posisi max, Conv mengupdate filter-nya, ReLU mematikan gradient di posisi negatif. Ini adalah chain rule yang sama dari Page 1 — hanya lebih banyak layer!

🎓 Why is backward() important in each layer?
Because backpropagation works as a chain: gradients from the output "flow backward" through each layer. Pooling routes gradients only to max positions, Conv updates its filters, ReLU kills gradients at negative positions. This is the same chain rule from Page 1 — just with more layers!

🔢

7. MNIST dengan CNN — 99%+ Akurasi!

7. MNIST with CNN — 99%+ Accuracy!

Menggabungkan semua layer menjadi pipeline lengkap

Combining all layers into a complete pipeline

Sekarang kita gabungkan semua layer dan latih pada MNIST. Karena CNN dari nol dengan Python agak lambat, kita pakai subset 5000 gambar untuk demo — tapi hasilnya sudah sangat impresif.

Now let's combine all layers and train on MNIST. Since a pure Python CNN is slow, we'll use a subset of 5000 images for the demo — but the results are already very impressive.

16_cnn_mnist.py — CNN on MNIST 🔥 python

import numpy as np
from sklearn.datasets import fetch_openml

# =====================================================
# 1. LOAD DATA (subset for speed)
# =====================================================
print("📥 Loading MNIST...")
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X = mnist.data.astype(np.float64) / 255.0
y = mnist.target.astype(int)

# Use 5k for training (CNN from scratch is slow!)
X_train = X[:5000].reshape(-1, 28, 28)  # (5000, 28, 28)
y_train = y[:5000]
X_test = X[60000:61000].reshape(-1, 28, 28)
y_test = y[60000:61000]

# =====================================================
# 2. BUILD CNN PIPELINE
# Conv(8) → ReLU → Pool → Flatten → FC(64) → Softmax(10)
# =====================================================
conv1 = ConvLayer(num_filters=8, kernel_size=3)   # 28→26
relu1 = ReLULayer()
pool1 = MaxPoolLayer(pool_size=2)                  # 26→13

# FC layers (reusing DeepNeuralNetwork from Page 2)
# After pool: 8 filters × 13 × 13 = 1352 flattened
fc = DeepNeuralNetwork([1352, 64, 10])

print("🧠 CNN: Conv(8,3×3) → ReLU → MaxPool(2) → FC(1352→64→10)")

# =====================================================
# 3. TRAINING LOOP
# =====================================================
epochs = 3
batch_size = 16
lr = 0.005

def one_hot(labels, nc):
    enc = np.zeros((len(labels), nc))
    enc[np.arange(len(labels)), labels] = 1
    return enc

print(f"\n🔥 Training {epochs} epochs (batch={batch_size})")
for epoch in range(epochs):
    idx = np.random.permutation(len(X_train))
    correct = 0

    for i in range(0, len(X_train), batch_size):
        Xb = X_train[idx[i:i+batch_size]]
        yb = y_train[idx[i:i+batch_size]]
        yb_oh = one_hot(yb, 10)

        # Forward through CNN
        c1 = conv1.forward(Xb)
        r1 = relu1.forward(c1)
        p1 = pool1.forward(r1)

        # Flatten for FC
        flat = p1.reshape(p1.shape[0], -1)  # (batch, 1352)
        probs = fc.forward(flat)

        # Accuracy
        correct += np.sum(np.argmax(probs, axis=1) == yb)

        # Backward through FC
        fc.backward(yb_oh, lr)

        # Backward through CNN layers
        d_flat = fc.cache['a2'] - yb_oh  # softmax grad
        d_pool = d_flat.reshape(p1.shape)
        d_relu = pool1.backward(d_pool)
        d_conv = relu1.backward(d_relu)
        conv1.backward(d_conv, lr)

    acc = correct / len(X_train) * 100
    print(f"  Epoch {epoch+1} │ Train Acc: {acc:.1f}%")

# =====================================================
# 4. TEST
# =====================================================
c1 = conv1.forward(X_test)
r1 = relu1.forward(c1)
p1 = pool1.forward(r1)
flat = p1.reshape(p1.shape[0], -1)
preds = np.argmax(fc.forward(flat), axis=1)
test_acc = np.mean(preds == y_test) * 100
print(f"\n🎯 Test Accuracy: {test_acc:.1f}%")
# With full dataset + more epochs → 99%+

🎉 CNN > Dense Network!
Bahkan dengan subset kecil dan hanya 3 epoch, CNN sudah menunjukkan keunggulan dibanding dense network. Dengan full dataset + lebih banyak epoch + 2 conv layers, akurasi bisa mencapai 99%+. Ini karena CNN memahami struktur spasial gambar — sesuatu yang dense network tidak bisa.

🎉 CNN > Dense Network!
Even with a small subset and just 3 epochs, CNN already shows superiority over the dense network. With the full dataset + more epochs + 2 conv layers, accuracy can reach 99%+. This is because CNN understands the spatial structure of images — something dense networks cannot.

📝

8. Ringkasan Page 3

8. Page 3 Summary

Apa yang sudah kita pelajari

What we've learned

Konsep	Apa Itu	Kode Kunci
Convolution	Filter geser di gambar → feature map	`np.sum(patch * kernel)`
Filter/Kernel	Detektor pola kecil (3×3, 5×5) — dipelajari!	`randn(F, kH, kW)`
Feature Map	Output convolution — "peta" fitur terdeteksi	`(H-k+1, W-k+1)`
Padding	Bingkai nol → pertahankan ukuran	`np.pad(img, p)`
Stride	Langkah filter — stride=2 → downsample	`(H+2p-k)//s + 1`
Max Pooling	Ambil nilai max di window → downsample	`np.max(window)`
Flatten	Reshape 3D→1D untuk FC layer	`x.reshape(N, -1)`
CNN Pipeline	Conv → ReLU → Pool → FC → Softmax	`ConvLayer + FCLayer`

Concept	What It Is	Key Code
Convolution	Filter slides over image → feature map	`np.sum(patch * kernel)`
Filter/Kernel	Small pattern detector (3×3, 5×5) — learned!	`randn(F, kH, kW)`
Feature Map	Convolution output — "map" of detected features	`(H-k+1, W-k+1)`
Padding	Zero border → preserve size	`np.pad(img, p)`
Stride	Filter step — stride=2 → downsample	`(H+2p-k)//s + 1`
Max Pooling	Take max value per window → downsample	`np.max(window)`
Flatten	Reshape 3D→1D for FC layer	`x.reshape(N, -1)`
CNN Pipeline	Conv → ReLU → Pool → FC → Softmax	`ConvLayer + FCLayer`

← Page Sebelumnya← Previous Page

Convolutional Neural Network
CNN dari Nol

Convolutional Neural Network
CNN from Scratch

📑 Daftar Isi — Page 3

📑 Table of Contents — Page 3

1. Kenapa CNN? — Dense Network Tidak Cukup

1. Why CNN? — Dense Networks Aren't Enough

2. Operasi Convolution — Inti CNN

2. The Convolution Operation — The Heart of CNN

3. Padding & Stride — Kontrol Ukuran Output

3. Padding & Stride — Controlling Output Size

4. Pooling Layer — Menyusutkan Feature Map

4. Pooling Layer — Shrinking Feature Maps

5. Arsitektur CNN Lengkap

5. Full CNN Architecture

6. Membangun CNN dari Nol — NumPy Murni

6. Building CNN from Scratch — Pure NumPy

7. MNIST dengan CNN — 99%+ Akurasi!

7. MNIST with CNN — 99%+ Accuracy!

8. Ringkasan Page 3

8. Page 3 Summary

Page 2 — Multi-Layer Network & Real Dataset

Coming Next: Page 4 — Regularization & Optimization

Coming Next: Page 4 — Regularization & Optimization