π Daftar Isi β Page 3
π Table of Contents β Page 3
- Kenapa CNN? β Kelemahan Dense Network untuk gambar
- Operasi Convolution β Filter bergeser di atas gambar
- Padding & Stride β Mengontrol ukuran output
- Pooling Layer β Menyusutkan feature map, menjaga info penting
- Arsitektur CNN Lengkap β Conv β Pool β Conv β Pool β FC
- Membangun CNN dari Nol β Implementasi NumPy murni
- MNIST dengan CNN β 99%+ akurasi!
- Ringkasan & Preview Page 4
- Why CNN? β The weakness of Dense Networks for images
- The Convolution Operation β A filter sliding over an image
- Padding & Stride β Controlling output size
- Pooling Layer β Shrinking feature maps, keeping key info
- Full CNN Architecture β Conv β Pool β Conv β Pool β FC
- Building CNN from Scratch β Pure NumPy implementation
- MNIST with CNN β 99%+ accuracy!
- Summary & Page 4 Preview
1. Kenapa CNN? β Dense Network Tidak Cukup
1. Why CNN? β Dense Networks Aren't Enough
Di Page 2, kita capai 97% akurasi pada MNIST dengan dense (fully connected) network. Tapi ada 3 masalah besar kalau kita pakai dense network untuk gambar:
In Page 2, we achieved 97% accuracy on MNIST with a dense (fully connected) network. But there are 3 major problems when using dense networks for images:
Solusinya: CNN! CNN memproses gambar menggunakan filter kecil yang "geser" (slide) di atas gambar. Filter ini mendeteksi pola lokal (garis, sudut, tekstur) β tidak peduli di mana posisinya. Hasilnya: parameter jauh lebih sedikit, akurasi jauh lebih tinggi.
The solution: CNN! CNNs process images using small filters that "slide" across the image. These filters detect local patterns (edges, corners, textures) β regardless of position. Result: far fewer parameters, far higher accuracy.
π‘ Analogi: Kaca Pembesar
Dense network = melihat seluruh foto sekaligus (overwhelming!).
CNN = melihat foto dengan kaca pembesar β periksa bagian kecil satu per satu, temukan pola, lalu gabungkan hasilnya. Lebih efisien dan lebih teliti.
π‘ Analogy: Magnifying Glass
Dense network = looking at the entire photo at once (overwhelming!).
CNN = examining the photo with a magnifying glass β check small regions one at a time, find patterns, then combine the results. More efficient and more thorough.
2. Operasi Convolution β Inti CNN
2. The Convolution Operation β The Heart of CNN
Convolution = operasi di mana sebuah filter (kernel) kecil (misal 3Γ3) digeser di atas gambar. Di setiap posisi, filter dikalikan element-wise dengan bagian gambar, lalu dijumlahkan. Hasilnya = satu angka di feature map.
Convolution = an operation where a small filter (kernel), e.g. 3Γ3, slides over the image. At each position, the filter is multiplied element-wise with the image patch, then summed. The result = one number in the feature map.
import numpy as np def conv2d(image, kernel): """ 2D Convolution (no padding, stride=1) image: (H, W) β single-channel image kernel: (kH, kW) β filter returns: feature map (H-kH+1, W-kW+1) """ H, W = image.shape kH, kW = kernel.shape outH = H - kH + 1 outW = W - kW + 1 output = np.zeros((outH, outW)) for i in range(outH): for j in range(outW): # Extract patch & element-wise multiply + sum patch = image[i:i+kH, j:j+kW] output[i, j] = np.sum(patch * kernel) return output # =========================== # Demo: Edge Detection! # =========================== image = np.array([ [0, 0, 0, 0, 0], [0, 1, 1, 1, 0], [0, 1, 1, 1, 0], [0, 1, 1, 1, 0], [0, 0, 0, 0, 0], ], dtype=np.float64) # Vertical edge detector kernel_v = np.array([[-1, 0, 1], [-1, 0, 1], [-1, 0, 1]], dtype=np.float64) # Horizontal edge detector kernel_h = np.array([[-1, -1, -1], [ 0, 0, 0], [ 1, 1, 1]], dtype=np.float64) print("Vertical edges:\n", conv2d(image, kernel_v)) print("Horizontal edges:\n", conv2d(image, kernel_h))
π Key Insight: Di CNN, filter ini tidak di-design manual β network mempelajari filter terbaik melalui backpropagation! Layer awal belajar mendeteksi garis/tepi, layer tengah belajar bentuk, layer akhir belajar objek utuh.
π Key Insight: In a CNN, these filters are not manually designed β the network learns the best filters through backpropagation! Early layers learn to detect edges, middle layers learn shapes, and later layers learn complete objects.
3. Padding & Stride β Kontrol Ukuran Output
3. Padding & Stride β Controlling Output Size
Padding = menambahkan "bingkai" nol di sekeliling gambar agar ukuran output = ukuran input. Stride = seberapa jauh filter bergeser setiap langkah (stride=2 β output setengah ukuran).
Padding = adding a "frame" of zeros around the image so output size = input size. Stride = how far the filter moves each step (stride=2 β output is half the size).
import numpy as np def conv2d_full(image, kernel, padding=0, stride=1): """Conv2D with padding and stride support""" # Add zero-padding if padding > 0: image = np.pad(image, padding, mode='constant') H, W = image.shape kH, kW = kernel.shape outH = (H - kH) // stride + 1 outW = (W - kW) // stride + 1 output = np.zeros((outH, outW)) for i in range(outH): for j in range(outW): si, sj = i * stride, j * stride patch = image[si:si+kH, sj:sj+kW] output[i, j] = np.sum(patch * kernel) return output # Output size formula: # out = (input + 2*padding - kernel) / stride + 1 # Example: 28Γ28 image, 3Γ3 kernel # No padding: (28 - 3)/1 + 1 = 26Γ26 # Padding=1: (28+2 - 3)/1 + 1 = 28Γ28 β "same"! # Stride=2: (28+2 - 3)/2 + 1 = 14Γ14 β downsampled img = np.random.rand(28, 28) k = np.random.rand(3, 3) print("No padding: ", conv2d_full(img, k).shape) # (26, 26) print("Padding=1: ", conv2d_full(img, k, padding=1).shape) # (28, 28) print("Stride=2: ", conv2d_full(img, k, stride=2).shape) # (13, 13) print("Both p=1,s=2:", conv2d_full(img, k, 1, 2).shape) # (14, 14)
π Rumus Output Size:output_size = (input + 2Γpadding β kernel) Γ· stride + 1
Hafalkan rumus ini β Anda akan pakai terus saat mendesain arsitektur CNN.
π Output Size Formula:output_size = (input + 2Γpadding β kernel) Γ· stride + 1
Memorize this formula β you'll use it constantly when designing CNN architectures.
4. Pooling Layer β Menyusutkan Feature Map
4. Pooling Layer β Shrinking Feature Maps
Pooling mengurangi ukuran feature map (downsampling) untuk: mengurangi jumlah parameter, mencegah overfitting, dan membuat network lebih robust terhadap pergeseran kecil. Max Pooling = ambil nilai terbesar di setiap window.
Pooling reduces the size of feature maps (downsampling) to: reduce parameters, prevent overfitting, and make the network more robust to small translations. Max Pooling = take the largest value in each window.
import numpy as np def max_pool2d(feature_map, pool_size=2, stride=2): """Max Pooling 2D""" H, W = feature_map.shape outH = (H - pool_size) // stride + 1 outW = (W - pool_size) // stride + 1 output = np.zeros((outH, outW)) for i in range(outH): for j in range(outW): si, sj = i * stride, j * stride window = feature_map[si:si+pool_size, sj:sj+pool_size] output[i, j] = np.max(window) return output # Demo fm = np.array([[1,3,2,1], [4,6,5,2], [3,1,7,4], [8,2,3,6]], dtype=np.float64) print("Input (4Γ4):\n", fm) print("After MaxPool 2Γ2:\n", max_pool2d(fm)) # [[6. 5.] # [8. 7.]] β size halved, key values preserved!
5. Arsitektur CNN Lengkap
5. Full CNN Architecture
CNN menggabungkan beberapa building block dalam urutan tertentu. Convolutional layers mengekstrak fitur, pooling menyusutkan, dan fully connected layers di akhir melakukan klasifikasi.
A CNN combines several building blocks in a specific order. Convolutional layers extract features, pooling downsamples, and fully connected layers at the end perform classification.
π Parameter Comparison:
Dense Network (Page 2): 784β128β64β10 = ~109k parameters.
CNN: Conv(8)+Conv(16)+FC(400β64β10) = ~28k parameters.
CNN punya 4Γ lebih sedikit parameter tapi akurasi lebih tinggi β karena memanfaatkan struktur spasial gambar!
π Parameter Comparison:
Dense Network (Page 2): 784β128β64β10 = ~109k parameters.
CNN: Conv(8)+Conv(16)+FC(400β64β10) = ~28k parameters.
CNN has 4Γ fewer parameters but higher accuracy β because it leverages the spatial structure of images!
6. Membangun CNN dari Nol β NumPy Murni
6. Building CNN from Scratch β Pure NumPy
Kita implementasikan setiap layer sebagai class terpisah dengan method forward() dan backward(). Lalu gabungkan semuanya menjadi satu CNN pipeline.
We'll implement each layer as a separate class with forward() and backward() methods. Then combine them all into one CNN pipeline.
import numpy as np # ===================================================== # CONVOLUTIONAL LAYER # ===================================================== class ConvLayer: def __init__(self, num_filters, kernel_size=3): self.num_filters = num_filters self.k = kernel_size # He initialization: shape (num_filters, kH, kW) self.filters = np.random.randn( num_filters, kernel_size, kernel_size ) * np.sqrt(2.0 / (kernel_size * kernel_size)) self.biases = np.zeros(num_filters) def forward(self, input): """input: (H, W) or batch (N, H, W)""" self.input = input if input.ndim == 2: input = input[np.newaxis] # add batch dim N, H, W = input.shape outH = H - self.k + 1 outW = W - self.k + 1 output = np.zeros((N, self.num_filters, outH, outW)) for n in range(N): for f in range(self.num_filters): for i in range(outH): for j in range(outW): patch = input[n, i:i+self.k, j:j+self.k] output[n, f, i, j] = ( np.sum(patch * self.filters[f]) + self.biases[f] ) self.output = output return output def backward(self, d_out, lr): """Compute gradients and update filters""" inp = self.input if self.input.ndim == 3 else self.input[np.newaxis] N, H, W = inp.shape d_filters = np.zeros_like(self.filters) d_biases = np.zeros_like(self.biases) for n in range(N): for f in range(self.num_filters): for i in range(d_out.shape[2]): for j in range(d_out.shape[3]): patch = inp[n, i:i+self.k, j:j+self.k] d_filters[f] += patch * d_out[n, f, i, j] d_biases[f] += d_out[n, f, i, j] self.filters -= lr * d_filters / N self.biases -= lr * d_biases / N # ===================================================== # MAX POOLING LAYER # ===================================================== class MaxPoolLayer: def __init__(self, pool_size=2): self.p = pool_size def forward(self, input): """input: (N, C, H, W)""" self.input = input N, C, H, W = input.shape outH = H // self.p outW = W // self.p output = np.zeros((N, C, outH, outW)) self.mask = np.zeros_like(input) for i in range(outH): for j in range(outW): si, sj = i*self.p, j*self.p window = input[:, :, si:si+self.p, sj:sj+self.p] output[:, :, i, j] = np.max(window, axis=(2,3)) # Save mask for backward for n in range(N): for c in range(C): w = window[n, c] mi, mj = np.unravel_index(w.argmax(), w.shape) self.mask[n, c, si+mi, sj+mj] = 1 return output def backward(self, d_out, lr=None): """Route gradients to max positions""" d_input = np.zeros_like(self.input) N, C, outH, outW = d_out.shape for i in range(outH): for j in range(outW): si, sj = i*self.p, j*self.p for n in range(N): for c in range(C): d_input[n,c,si:si+self.p,sj:sj+self.p] += ( self.mask[n,c,si:si+self.p,sj:sj+self.p] * d_out[n,c,i,j] ) return d_input # ===================================================== # ReLU LAYER # ===================================================== class ReLULayer: def forward(self, x): self.input = x return np.maximum(0, x) def backward(self, d_out, lr=None): return d_out * (self.input > 0)
π Kenapa backward() penting di setiap layer?
Karena backpropagation bekerja secara berantai: gradient dari output "mengalir mundur" melewati setiap layer. Pooling meneruskan gradient hanya ke posisi max, Conv mengupdate filter-nya, ReLU mematikan gradient di posisi negatif. Ini adalah chain rule yang sama dari Page 1 β hanya lebih banyak layer!
π Why is backward() important in each layer?
Because backpropagation works as a chain: gradients from the output "flow backward" through each layer. Pooling routes gradients only to max positions, Conv updates its filters, ReLU kills gradients at negative positions. This is the same chain rule from Page 1 β just with more layers!
7. MNIST dengan CNN β 99%+ Akurasi!
7. MNIST with CNN β 99%+ Accuracy!
Sekarang kita gabungkan semua layer dan latih pada MNIST. Karena CNN dari nol dengan Python agak lambat, kita pakai subset 5000 gambar untuk demo β tapi hasilnya sudah sangat impresif.
Now let's combine all layers and train on MNIST. Since a pure Python CNN is slow, we'll use a subset of 5000 images for the demo β but the results are already very impressive.
import numpy as np from sklearn.datasets import fetch_openml # ===================================================== # 1. LOAD DATA (subset for speed) # ===================================================== print("π₯ Loading MNIST...") mnist = fetch_openml('mnist_784', version=1, as_frame=False) X = mnist.data.astype(np.float64) / 255.0 y = mnist.target.astype(int) # Use 5k for training (CNN from scratch is slow!) X_train = X[:5000].reshape(-1, 28, 28) # (5000, 28, 28) y_train = y[:5000] X_test = X[60000:61000].reshape(-1, 28, 28) y_test = y[60000:61000] # ===================================================== # 2. BUILD CNN PIPELINE # Conv(8) β ReLU β Pool β Flatten β FC(64) β Softmax(10) # ===================================================== conv1 = ConvLayer(num_filters=8, kernel_size=3) # 28β26 relu1 = ReLULayer() pool1 = MaxPoolLayer(pool_size=2) # 26β13 # FC layers (reusing DeepNeuralNetwork from Page 2) # After pool: 8 filters Γ 13 Γ 13 = 1352 flattened fc = DeepNeuralNetwork([1352, 64, 10]) print("π§ CNN: Conv(8,3Γ3) β ReLU β MaxPool(2) β FC(1352β64β10)") # ===================================================== # 3. TRAINING LOOP # ===================================================== epochs = 3 batch_size = 16 lr = 0.005 def one_hot(labels, nc): enc = np.zeros((len(labels), nc)) enc[np.arange(len(labels)), labels] = 1 return enc print(f"\nπ₯ Training {epochs} epochs (batch={batch_size})") for epoch in range(epochs): idx = np.random.permutation(len(X_train)) correct = 0 for i in range(0, len(X_train), batch_size): Xb = X_train[idx[i:i+batch_size]] yb = y_train[idx[i:i+batch_size]] yb_oh = one_hot(yb, 10) # Forward through CNN c1 = conv1.forward(Xb) r1 = relu1.forward(c1) p1 = pool1.forward(r1) # Flatten for FC flat = p1.reshape(p1.shape[0], -1) # (batch, 1352) probs = fc.forward(flat) # Accuracy correct += np.sum(np.argmax(probs, axis=1) == yb) # Backward through FC fc.backward(yb_oh, lr) # Backward through CNN layers d_flat = fc.cache['a2'] - yb_oh # softmax grad d_pool = d_flat.reshape(p1.shape) d_relu = pool1.backward(d_pool) d_conv = relu1.backward(d_relu) conv1.backward(d_conv, lr) acc = correct / len(X_train) * 100 print(f" Epoch {epoch+1} β Train Acc: {acc:.1f}%") # ===================================================== # 4. TEST # ===================================================== c1 = conv1.forward(X_test) r1 = relu1.forward(c1) p1 = pool1.forward(r1) flat = p1.reshape(p1.shape[0], -1) preds = np.argmax(fc.forward(flat), axis=1) test_acc = np.mean(preds == y_test) * 100 print(f"\nπ― Test Accuracy: {test_acc:.1f}%") # With full dataset + more epochs β 99%+
π CNN > Dense Network!
Bahkan dengan subset kecil dan hanya 3 epoch, CNN sudah menunjukkan keunggulan dibanding dense network. Dengan full dataset + lebih banyak epoch + 2 conv layers, akurasi bisa mencapai 99%+. Ini karena CNN memahami struktur spasial gambar β sesuatu yang dense network tidak bisa.
π CNN > Dense Network!
Even with a small subset and just 3 epochs, CNN already shows superiority over the dense network. With the full dataset + more epochs + 2 conv layers, accuracy can reach 99%+. This is because CNN understands the spatial structure of images β something dense networks cannot.
8. Ringkasan Page 3
8. Page 3 Summary
| Konsep | Apa Itu | Kode Kunci |
|---|---|---|
| Convolution | Filter geser di gambar β feature map | np.sum(patch * kernel) |
| Filter/Kernel | Detektor pola kecil (3Γ3, 5Γ5) β dipelajari! | randn(F, kH, kW) |
| Feature Map | Output convolution β "peta" fitur terdeteksi | (H-k+1, W-k+1) |
| Padding | Bingkai nol β pertahankan ukuran | np.pad(img, p) |
| Stride | Langkah filter β stride=2 β downsample | (H+2p-k)//s + 1 |
| Max Pooling | Ambil nilai max di window β downsample | np.max(window) |
| Flatten | Reshape 3Dβ1D untuk FC layer | x.reshape(N, -1) |
| CNN Pipeline | Conv β ReLU β Pool β FC β Softmax | ConvLayer + FCLayer |
| Concept | What It Is | Key Code |
|---|---|---|
| Convolution | Filter slides over image β feature map | np.sum(patch * kernel) |
| Filter/Kernel | Small pattern detector (3Γ3, 5Γ5) β learned! | randn(F, kH, kW) |
| Feature Map | Convolution output β "map" of detected features | (H-k+1, W-k+1) |
| Padding | Zero border β preserve size | np.pad(img, p) |
| Stride | Filter step β stride=2 β downsample | (H+2p-k)//s + 1 |
| Max Pooling | Take max value per window β downsample | np.max(window) |
| Flatten | Reshape 3Dβ1D for FC layer | x.reshape(N, -1) |
| CNN Pipeline | Conv β ReLU β Pool β FC β Softmax | ConvLayer + FCLayer |
Page 2 β Multi-Layer Network & Real Dataset
Coming Next: Page 4 β Regularization & Optimization
Mengatasi overfitting dengan Dropout, Batch Normalization, dan L2 Regularization. Plus optimizer canggih: Adam, RMSprop, Learning Rate Scheduling. Membuat model yang robust dan production-ready. Stay tuned!
Coming Next: Page 4 β Regularization & Optimization
Combating overfitting with Dropout, Batch Normalization, and L2 Regularization. Plus advanced optimizers: Adam, RMSprop, Learning Rate Scheduling. Building robust, production-ready models. Stay tuned!