๐ Daftar Isi โ Page 5
๐ Table of Contents โ Page 5
- Kenapa RNN? โ Data berurutan butuh "memori"
- Vanilla RNN โ Arsitektur dasar dan forward pass
- Backpropagation Through Time โ BPTT dan vanishing gradient
- LSTM โ Long Short-Term Memory: solusi vanishing gradient
- GRU โ Gated Recurrent Unit: LSTM yang lebih ringkas
- Character-Level Text Generator โ Menulis teks karakter per karakter
- Sentiment Analysis โ Klasifikasi emosi dari teks
- Ringkasan & Preview Page 6
- Why RNN? โ Sequential data needs "memory"
- Vanilla RNN โ Basic architecture and forward pass
- Backpropagation Through Time โ BPTT and vanishing gradients
- LSTM โ Long Short-Term Memory: solving vanishing gradients
- GRU โ Gated Recurrent Unit: a leaner LSTM
- Character-Level Text Generator โ Writing text one character at a time
- Sentiment Analysis โ Classifying emotions from text
- Summary & Page 6 Preview
1. Kenapa RNN? โ Data Berurutan Butuh Memori
1. Why RNN? โ Sequential Data Needs Memory
CNN dan Dense network memperlakukan setiap input secara independen. Tapi banyak data di dunia nyata itu berurutan โ arti sebuah kata bergantung pada kata sebelumnya, harga saham hari ini bergantung pada hari kemarin. Kita butuh network yang punya memori.
CNNs and Dense networks treat each input independently. But much real-world data is sequential โ the meaning of a word depends on previous words, today's stock price depends on yesterday's. We need a network with memory.
๐ก Analogi: Menonton Film
Feedforward = melihat satu frame foto tanpa tahu frame sebelumnya. Anda tidak bisa mengerti ceritanya.
RNN = menonton film secara berurutan โ Anda ingat apa yang terjadi sebelumnya dan itu mempengaruhi pemahaman Anda tentang adegan sekarang.
๐ก Analogy: Watching a Movie
Feedforward = seeing one frame without knowing previous frames. You can't understand the story.
RNN = watching a movie sequentially โ you remember what happened before and it affects your understanding of the current scene.
2. Vanilla RNN โ Arsitektur Dasar
2. Vanilla RNN โ Basic Architecture
RNN punya hidden state h(t) yang berperan sebagai "memori". Di setiap timestep, hidden state di-update berdasarkan input baru dan hidden state sebelumnya: h(t) = tanh(W_h ยท h(t-1) + W_x ยท x(t) + b).
An RNN has a hidden state h(t) that acts as "memory". At each timestep, the hidden state is updated based on the new input and the previous hidden state: h(t) = tanh(W_h ยท h(t-1) + W_x ยท x(t) + b).
import numpy as np class VanillaRNN: """Simple RNN cell โ from scratch""" def __init__(self, input_size, hidden_size, output_size): self.hidden_size = hidden_size # Weight matrices scale = 0.01 self.Wxh = np.random.randn(input_size, hidden_size) * scale self.Whh = np.random.randn(hidden_size, hidden_size) * scale self.Why = np.random.randn(hidden_size, output_size) * scale self.bh = np.zeros((1, hidden_size)) self.by = np.zeros((1, output_size)) def forward(self, inputs, h_prev=None): """ inputs: list of input vectors (one per timestep) h_prev: initial hidden state returns: outputs, hidden_states """ if h_prev is None: h_prev = np.zeros((1, self.hidden_size)) self.inputs = inputs self.hs = {-1: h_prev} # hidden states per timestep self.outputs = [] for t, x in enumerate(inputs): # Core RNN equation! self.hs[t] = np.tanh( x @ self.Wxh + # input contribution self.hs[t-1] @ self.Whh + # memory contribution self.bh # bias ) # Output at this timestep y = self.hs[t] @ self.Why + self.by self.outputs.append(y) return self.outputs, self.hs # Demo: process a sequence of 4 timesteps rnn = VanillaRNN(input_size=10, hidden_size=32, output_size=10) inputs = [np.random.randn(1, 10) for _ in range(4)] outputs, hidden_states = rnn.forward(inputs) print(f"Processed {len(inputs)} timesteps") print(f"Last hidden state shape: {hidden_states[3].shape}") print(f"Last output shape: {outputs[-1].shape}")
3. Backprop Through Time & Vanishing Gradient
3. Backprop Through Time & Vanishing Gradient
BPTT (Backpropagation Through Time) = backprop biasa, tapi di-"unroll" sepanjang waktu. Masalahnya: saat gradient mengalir mundur melewati banyak timestep, gradient terus dikalikan dengan W_hh berulang kali. Jika |W_hh| < 1 โ gradient menghilang (vanishing). Jika |W_hh| > 1 โ gradient meledak (exploding).
BPTT (Backpropagation Through Time) = regular backprop, but "unrolled" through time. The problem: as gradients flow backward through many timesteps, they're repeatedly multiplied by W_hh. If |W_hh| < 1 โ gradients vanish (vanishing). If |W_hh| > 1 โ gradients explode (exploding).
import numpy as np def clip_gradients(gradients, max_norm=5.0): """Clip gradients to prevent exploding gradient""" total_norm = np.sqrt( sum(np.sum(g ** 2) for g in gradients) ) if total_norm > max_norm: scale = max_norm / total_norm gradients = [g * scale for g in gradients] return gradients # This fixes EXPLODING gradients # But VANISHING gradients need a different solution โ LSTM!
4. LSTM โ Long Short-Term Memory
4. LSTM โ Long Short-Term Memory
LSTM mengatasi vanishing gradient dengan menambahkan cell state (jalur "highway" untuk informasi) dan 3 gate yang mengontrol apa yang diingat, dilupakan, dan dikeluarkan. Gate adalah sigmoid (0-1) yang bertindak seperti katup.
LSTM solves vanishing gradients by adding a cell state (an information "highway") and 3 gates that control what to remember, forget, and output. Gates are sigmoids (0-1) that act like valves.
import numpy as np def sigmoid(x): return 1 / (1 + np.exp(-np.clip(x, -500, 500))) class LSTMCell: """Single LSTM cell โ from scratch""" def __init__(self, input_size, hidden_size): n = input_size + hidden_size s = np.sqrt(2.0 / n) self.hidden_size = hidden_size # Combined weights for all 4 gates (efficiency!) # [forget, input, candidate, output] self.W = np.random.randn(n, 4 * hidden_size) * s self.b = np.zeros((1, 4 * hidden_size)) # Initialize forget gate bias to 1 (remember by default) self.b[0, :hidden_size] = 1.0 def forward(self, x, h_prev, c_prev): """One timestep forward""" H = self.hidden_size # Concatenate input and previous hidden state combined = np.concatenate([h_prev, x], axis=1) # Compute all 4 gates at once gates = combined @ self.W + self.b # Split into individual gates f = sigmoid(gates[:, :H]) # Forget gate i = sigmoid(gates[:, H:2*H]) # Input gate c_tilde = np.tanh(gates[:, 2*H:3*H]) # Candidate o = sigmoid(gates[:, 3*H:]) # Output gate # Update cell state c_new = f * c_prev + i * c_tilde # forget old + add new # Compute hidden state h_new = o * np.tanh(c_new) # output filtered # Cache for backprop self.cache = (x, h_prev, c_prev, f, i, c_tilde, o, c_new) return h_new, c_new # Demo lstm = LSTMCell(input_size=10, hidden_size=32) h = np.zeros((1, 32)) c = np.zeros((1, 32)) # Process 5 timesteps for t in range(5): x = np.random.randn(1, 10) h, c = lstm.forward(x, h, c) print(f"t={t}: h_norm={np.linalg.norm(h):.4f}, c_norm={np.linalg.norm(c):.4f}")
๐ Kenapa LSTM Mengatasi Vanishing Gradient?
Cell state c(t) adalah "highway" โ informasi mengalir lewat operasi perkalian dan penjumlahan saja (bukan tanh berulang). Forget gate yang mendekati 1 memungkinkan gradient mengalir tanpa penyusutan. Ini seperti membuat "jalan tol" khusus untuk informasi penting.
๐ Why Does LSTM Solve Vanishing Gradients?
The cell state c(t) is a "highway" โ information flows through multiplication and addition only (no repeated tanh). A forget gate near 1 allows gradients to flow without shrinking. It's like creating a special "express lane" for important information.
5. GRU โ Gated Recurrent Unit
5. GRU โ Gated Recurrent Unit
GRU menyederhanakan LSTM: hanya 2 gate (reset dan update), tidak ada cell state terpisah. Lebih cepat di-train, parameter lebih sedikit, dan performanya sering setara dengan LSTM.
GRU simplifies LSTM: only 2 gates (reset and update), no separate cell state. Faster to train, fewer parameters, and often performs comparably to LSTM.
import numpy as np class GRUCell: """GRU: 2 gates, no separate cell state""" def __init__(self, input_size, hidden_size): n = input_size + hidden_size s = np.sqrt(2.0 / n) self.H = hidden_size # Weights for reset & update gates self.Wz = np.random.randn(n, hidden_size) * s # update gate self.Wr = np.random.randn(n, hidden_size) * s # reset gate self.Wh = np.random.randn(n, hidden_size) * s # candidate self.bz = np.zeros((1, hidden_size)) self.br = np.zeros((1, hidden_size)) self.bh = np.zeros((1, hidden_size)) def forward(self, x, h_prev): combined = np.concatenate([h_prev, x], axis=1) # Update gate: how much of old state to keep z = sigmoid(combined @ self.Wz + self.bz) # Reset gate: how much of old state to use for candidate r = sigmoid(combined @ self.Wr + self.br) # Candidate hidden state combined_r = np.concatenate([r * h_prev, x], axis=1) h_tilde = np.tanh(combined_r @ self.Wh + self.bh) # Final hidden state: interpolate old and new h_new = z * h_prev + (1 - z) * h_tilde return h_new # GRU vs LSTM comparison: # LSTM: 3 gates + cell state โ 4ร(n+H)รH params # GRU: 2 gates + candidate โ 3ร(n+H)รH params โ 25% fewer!
๐ LSTM vs GRU โ Kapan Pakai Mana?
LSTM โ sequence sangat panjang, butuh memori jangka panjang presisi tinggi.
GRU โ dataset lebih kecil, butuh training cepat, sequence pendek-sedang.
Rule of thumb: mulai dari GRU. Kalau tidak cukup, coba LSTM.
๐ LSTM vs GRU โ When to Use Which?
LSTM โ very long sequences, need precise long-term memory.
GRU โ smaller datasets, need faster training, short-medium sequences.
Rule of thumb: start with GRU. If it's not enough, try LSTM.
6. Character-Level Text Generator
6. Character-Level Text Generator
Kita latih LSTM untuk memprediksi karakter berikutnya dari sequence teks. Setelah training, model bisa generate teks baru yang mirip gaya penulisan data training โ karakter per karakter.
We'll train an LSTM to predict the next character from a text sequence. After training, the model can generate new text that mimics the writing style of the training data โ one character at a time.
import numpy as np # ===================================================== # 1. PREPARE TEXT DATA # ===================================================== text = """To be or not to be that is the question Whether tis nobler in the mind to suffer The slings and arrows of outrageous fortune Or to take arms against a sea of troubles""" # Build character vocabulary chars = sorted(set(text)) char_to_idx = {c: i for i, c in enumerate(chars)} idx_to_char = {i: c for c, i in char_to_idx.items()} vocab_size = len(chars) print(f"Vocab size: {vocab_size} chars") # One-hot encode def one_hot_char(idx, size): v = np.zeros((1, size)) v[0, idx] = 1 return v # ===================================================== # 2. TRAINING LOOP # ===================================================== hidden_size = 64 seq_length = 25 # chars per training sequence lr = 0.01 lstm = LSTMCell(vocab_size, hidden_size) Why = np.random.randn(hidden_size, vocab_size) * 0.01 by = np.zeros((1, vocab_size)) def softmax(z): e = np.exp(z - np.max(z)) return e / e.sum() print("๐ฅ Training text generator...") for iteration in range(1000): # Random starting position start = np.random.randint(0, len(text) - seq_length - 1) inputs = [char_to_idx[c] for c in text[start:start+seq_length]] targets = [char_to_idx[c] for c in text[start+1:start+seq_length+1]] # Forward pass through sequence h = np.zeros((1, hidden_size)) c = np.zeros((1, hidden_size)) loss = 0 for t in range(seq_length): x = one_hot_char(inputs[t], vocab_size) h, c = lstm.forward(x, h, c) logits = h @ Why + by probs = softmax(logits) loss -= np.log(probs[0, targets[t]] + 1e-12) if (iteration + 1) % 200 == 0: print(f" Iter {iteration+1:>4} โ Loss: {loss/seq_length:.4f}") # ===================================================== # 3. GENERATE TEXT! # ===================================================== def generate(seed_char, length=100): h = np.zeros((1, hidden_size)) c = np.zeros((1, hidden_size)) idx = char_to_idx[seed_char] result = seed_char for _ in range(length): x = one_hot_char(idx, vocab_size) h, c = lstm.forward(x, h, c) logits = h @ Why + by probs = softmax(logits).flatten() # Sample from probability distribution idx = np.random.choice(vocab_size, p=probs) result += idx_to_char[idx] return result print("\nโ๏ธ Generated text:") print(generate("T", 150))
๐ Model Menulis Sendiri! Setelah cukup training, model bisa generate teks yang menyerupai Shakespeare โ belajar pola ejaan, spasi, bahkan struktur kalimat, hanya dari memprediksi karakter berikutnya satu per satu.
๐ The Model Writes by Itself! After enough training, the model can generate Shakespeare-like text โ learning spelling patterns, spacing, even sentence structure, just from predicting the next character one at a time.
7. Sentiment Analysis โ Klasifikasi Emosi Teks
7. Sentiment Analysis โ Text Emotion Classification
Aplikasi RNN paling populer: membaca seluruh kalimat, ambil hidden state terakhir sebagai "ringkasan", lalu klasifikasikan sentimen. Hidden state terakhir mengandung informasi dari semua kata sebelumnya.
The most popular RNN application: read the entire sentence, take the last hidden state as a "summary", then classify the sentiment. The last hidden state contains information from all previous words.
import numpy as np # ===================================================== # 1. TOY DATASET # ===================================================== data = [ ("this movie is great", 1), # positive ("terrible waste of time", 0), # negative ("i love this film", 1), ("worst movie ever", 0), ("absolutely wonderful", 1), ("boring and awful", 0), ("amazing performance", 1), ("horrible acting", 0), ] # Build word vocabulary all_words = set() for text, _ in data: all_words.update(text.split()) word2idx = {w: i+1 for i, w in enumerate(sorted(all_words))} word2idx["" ] = 0 vocab_size = len(word2idx) # ===================================================== # 2. ENCODE SENTENCES # ===================================================== def encode_sentence(text, max_len=5): words = text.split() indices = [word2idx.get(w, 0) for w in words] # Pad to fixed length while len(indices) < max_len: indices.append(0) return indices[:max_len] # Simple word embedding (random, learned during training) embed_dim = 8 embeddings = np.random.randn(vocab_size, embed_dim) * 0.1 # ===================================================== # 3. SENTIMENT CLASSIFIER # LSTM โ last hidden โ sigmoid โ pos/neg # ===================================================== lstm = LSTMCell(input_size=embed_dim, hidden_size=16) W_out = np.random.randn(16, 1) * 0.1 b_out = np.zeros((1, 1)) print("๐ Training sentiment classifier...") for epoch in range(200): total_loss = 0 for text, label in data: indices = encode_sentence(text) h = np.zeros((1, 16)) c = np.zeros((1, 16)) # Forward through each word for idx in indices: x = embeddings[idx:idx+1] # (1, 8) h, c = lstm.forward(x, h, c) # Classify using last hidden state logit = h @ W_out + b_out pred = sigmoid(logit) # Binary cross-entropy loss loss = -(label * np.log(pred + 1e-12) + (1-label) * np.log(1-pred + 1e-12)) total_loss += loss.item() if (epoch+1) % 50 == 0: print(f" Epoch {epoch+1:>3} โ Loss: {total_loss/len(data):.4f}") # ===================================================== # 4. TEST # ===================================================== print("\n๐ฏ Predictions:") for text, label in data: indices = encode_sentence(text) h, c = np.zeros((1,16)), np.zeros((1,16)) for idx in indices: h, c = lstm.forward(embeddings[idx:idx+1], h, c) pred = sigmoid(h @ W_out + b_out).item() emoji = "๐" if pred > 0.5 else "๐" print(f" {emoji} {pred:.2f} โ {text}")
๐ RNN Memahami Konteks!
Model membaca kata per kata, membangun "pemahaman" di hidden state, lalu mengklasifikasikan emosi berdasarkan keseluruhan kalimat. Kata "love" dan "great" mendorong ke positif, "terrible" dan "worst" ke negatif โ dan model mempelajari ini sendiri dari data!
๐ RNN Understands Context!
The model reads word by word, building "understanding" in the hidden state, then classifies emotion based on the entire sentence. Words like "love" and "great" push toward positive, "terrible" and "worst" toward negative โ and the model learns this on its own from data!
8. Ringkasan Page 5
8. Page 5 Summary
| Konsep | Apa Itu | Kode Kunci |
|---|---|---|
| RNN | Network dengan hidden state (memori) | h = tanh(Wxยทx + Whยทh + b) |
| Hidden State | Vektor "memori" yang terus di-update | h(t) = f(h(t-1), x(t)) |
| BPTT | Backprop yang di-unroll sepanjang waktu | ฮฃ dL/dW per timestep |
| Vanishing Gradient | Gradient menghilang di sequence panjang | 0.9ยนโฐโฐ โ 0 |
| LSTM | 3 gate + cell state = memori jangka panjang | c = f*c + i*cฬ |
| GRU | 2 gate, lebih ringkas dari LSTM | h = z*h + (1-z)*hฬ |
| Text Generation | Prediksi karakter berikutnya โ generate | sample(softmax(logits)) |
| Sentiment Analysis | Baca kalimat โ klasifikasi emosi | sigmoid(h_last @ W) |
| Gradient Clipping | Cegah exploding gradient | g * (max/norm) |
| Concept | What It Is | Key Code |
|---|---|---|
| RNN | Network with hidden state (memory) | h = tanh(Wxยทx + Whยทh + b) |
| Hidden State | "Memory" vector continuously updated | h(t) = f(h(t-1), x(t)) |
| BPTT | Backprop unrolled through time | ฮฃ dL/dW per timestep |
| Vanishing Gradient | Gradients vanish in long sequences | 0.9ยนโฐโฐ โ 0 |
| LSTM | 3 gates + cell state = long-term memory | c = f*c + i*cฬ |
| GRU | 2 gates, leaner than LSTM | h = z*h + (1-z)*hฬ |
| Text Generation | Predict next char โ generate | sample(softmax(logits)) |
| Sentiment Analysis | Read sentence โ classify emotion | sigmoid(h_last @ W) |
| Gradient Clipping | Prevent exploding gradients | g * (max/norm) |
Page 4 โ Regularization & Advanced Optimization
Coming Next: Page 6 โ Word Embeddings & NLP Pipeline
Dari one-hot ke word vectors: Word2Vec, GloVe, dan cara merepresentasikan kata sebagai vektor bermakna. Membangun NLP pipeline lengkap: tokenization, embedding, model, dan evaluasi. Stay tuned!
Coming Next: Page 6 โ Word Embeddings & NLP Pipeline
From one-hot to word vectors: Word2Vec, GloVe, and how to represent words as meaningful vectors. Building a complete NLP pipeline: tokenization, embedding, model, and evaluation. Stay tuned!