Neural Network Page 5 — RNN, LSTM & Sequence Data

📑 Daftar Isi — Page 5

📑 Table of Contents — Page 5

Kenapa RNN? — Data berurutan butuh "memori"
Vanilla RNN — Arsitektur dasar dan forward pass
Backpropagation Through Time — BPTT dan vanishing gradient
LSTM — Long Short-Term Memory: solusi vanishing gradient
GRU — Gated Recurrent Unit: LSTM yang lebih ringkas
Character-Level Text Generator — Menulis teks karakter per karakter
Sentiment Analysis — Klasifikasi emosi dari teks
Ringkasan & Preview Page 6

Why RNN? — Sequential data needs "memory"
Vanilla RNN — Basic architecture and forward pass
Backpropagation Through Time — BPTT and vanishing gradients
LSTM — Long Short-Term Memory: solving vanishing gradients
GRU — Gated Recurrent Unit: a leaner LSTM
Character-Level Text Generator — Writing text one character at a time
Sentiment Analysis — Classifying emotions from text
Summary & Page 6 Preview

🤔

1. Kenapa RNN? — Data Berurutan Butuh Memori

1. Why RNN? — Sequential Data Needs Memory

Feedforward network tidak punya konsep "urutan" dan "waktu"

Feedforward networks have no concept of "order" or "time"

CNN dan Dense network memperlakukan setiap input secara independen. Tapi banyak data di dunia nyata itu berurutan — arti sebuah kata bergantung pada kata sebelumnya, harga saham hari ini bergantung pada hari kemarin. Kita butuh network yang punya memori.

CNNs and Dense networks treat each input independently. But much real-world data is sequential — the meaning of a word depends on previous words, today's stock price depends on yesterday's. We need a network with memory.

Data Berurutan / Sequential Data Examples Teks / Text: "Saya suka [???]" → "makan" (context matters!) Time Series: 📈 100 → 105 → 103 → [???] → predict next Musik / Music: 🎵 C → E → G → [???] → predict next note DNA: ATCG → GCTA → [???] → predict next sequence Feedforward: setiap input independen → ❌ no memory RNN: output bergantung pada input SEKARANG + state SEBELUMNYA → ✅

💡 Analogi: Menonton Film
Feedforward = melihat satu frame foto tanpa tahu frame sebelumnya. Anda tidak bisa mengerti ceritanya.
RNN = menonton film secara berurutan — Anda ingat apa yang terjadi sebelumnya dan itu mempengaruhi pemahaman Anda tentang adegan sekarang.

💡 Analogy: Watching a Movie
Feedforward = seeing one frame without knowing previous frames. You can't understand the story.
RNN = watching a movie sequentially — you remember what happened before and it affects your understanding of the current scene.

🔁

2. Vanilla RNN — Arsitektur Dasar

2. Vanilla RNN — Basic Architecture

Setiap timestep: input baru + hidden state sebelumnya → output

Each timestep: new input + previous hidden state → output

RNN punya hidden state h(t) yang berperan sebagai "memori". Di setiap timestep, hidden state di-update berdasarkan input baru dan hidden state sebelumnya: h(t) = tanh(W_h · h(t-1) + W_x · x(t) + b).

An RNN has a hidden state h(t) that acts as "memory". At each timestep, the hidden state is updated based on the new input and the previous hidden state: h(t) = tanh(W_h · h(t-1) + W_x · x(t) + b).

RNN Unrolled Through Time h(0) ─────▶ h(1) ─────▶ h(2) ─────▶ h(3) ↑ ↑ ↑ ↑ │ │ │ │ x(0) x(1) x(2) x(3) "Saya" "suka" "makan" "nasi" │ ▼ output Same weights W_h, W_x are SHARED across all timesteps!

23_vanilla_rnn.py — RNN from Scratch python

import numpy as np

class VanillaRNN:
    """Simple RNN cell — from scratch"""

    def __init__(self, input_size, hidden_size, output_size):
        self.hidden_size = hidden_size
        # Weight matrices
        scale = 0.01
        self.Wxh = np.random.randn(input_size, hidden_size) * scale
        self.Whh = np.random.randn(hidden_size, hidden_size) * scale
        self.Why = np.random.randn(hidden_size, output_size) * scale
        self.bh = np.zeros((1, hidden_size))
        self.by = np.zeros((1, output_size))

    def forward(self, inputs, h_prev=None):
        """
        inputs: list of input vectors (one per timestep)
        h_prev: initial hidden state
        returns: outputs, hidden_states
        """
        if h_prev is None:
            h_prev = np.zeros((1, self.hidden_size))

        self.inputs = inputs
        self.hs = {-1: h_prev}  # hidden states per timestep
        self.outputs = []

        for t, x in enumerate(inputs):
            # Core RNN equation!
            self.hs[t] = np.tanh(
                x @ self.Wxh +            # input contribution
                self.hs[t-1] @ self.Whh + # memory contribution
                self.bh                    # bias
            )
            # Output at this timestep
            y = self.hs[t] @ self.Why + self.by
            self.outputs.append(y)

        return self.outputs, self.hs

# Demo: process a sequence of 4 timesteps
rnn = VanillaRNN(input_size=10, hidden_size=32, output_size=10)
inputs = [np.random.randn(1, 10) for _ in range(4)]
outputs, hidden_states = rnn.forward(inputs)
print(f"Processed {len(inputs)} timesteps")
print(f"Last hidden state shape: {hidden_states[3].shape}")
print(f"Last output shape: {outputs[-1].shape}")

💀

3. Backprop Through Time & Vanishing Gradient

Masalah kritis: gradient menghilang di sequence panjang

Critical problem: gradients vanish in long sequences

BPTT (Backpropagation Through Time) = backprop biasa, tapi di-"unroll" sepanjang waktu. Masalahnya: saat gradient mengalir mundur melewati banyak timestep, gradient terus dikalikan dengan W_hh berulang kali. Jika |W_hh| < 1 → gradient menghilang (vanishing). Jika |W_hh| > 1 → gradient meledak (exploding).

BPTT (Backpropagation Through Time) = regular backprop, but "unrolled" through time. The problem: as gradients flow backward through many timesteps, they're repeatedly multiplied by W_hh. If |W_hh| < 1 → gradients vanish (vanishing). If |W_hh| > 1 → gradients explode (exploding).

Vanishing Gradient Problem Gradient flows backward through time: t=100 t=99 t=98 t=1 t=0 ──◀──────◀──────◀── ... ──◀──────◀──────◀── ×W_hh ×W_hh ×W_hh ×W_hh ×W_hh If |W_hh| = 0.9: After 100 steps: 0.9¹⁰⁰ = 0.0000265 ← gradient almost ZERO! Result: RNN "forgets" early inputs in long sequences Solution: LSTM and GRU (special gates to control memory)

24_gradient_clipping.py — Quick Fix: Gradient Clipping python

import numpy as np

def clip_gradients(gradients, max_norm=5.0):
    """Clip gradients to prevent exploding gradient"""
    total_norm = np.sqrt(
        sum(np.sum(g ** 2) for g in gradients)
    )
    if total_norm > max_norm:
        scale = max_norm / total_norm
        gradients = [g * scale for g in gradients]
    return gradients

# This fixes EXPLODING gradients
# But VANISHING gradients need a different solution → LSTM!

🧠

4. LSTM — Long Short-Term Memory

3 gate + cell state = memori jangka panjang yang terkontrol

3 gates + cell state = controlled long-term memory

LSTM mengatasi vanishing gradient dengan menambahkan cell state (jalur "highway" untuk informasi) dan 3 gate yang mengontrol apa yang diingat, dilupakan, dan dikeluarkan. Gate adalah sigmoid (0-1) yang bertindak seperti katup.

LSTM solves vanishing gradients by adding a cell state (an information "highway") and 3 gates that control what to remember, forget, and output. Gates are sigmoids (0-1) that act like valves.

LSTM Cell — 3 Gates + Cell State ┌────── Forget Gate ──────┐ │ f(t) = σ(W_f·[h,x]+b) │ │ "Apa yang dilupakan?" │ └────────────┬──────────────┘ ▼ c(t-1) ───────── × f(t) ──── + i(t)×c̃(t) ──────▶ c(t) (cell state) ↑ (new cell) ┌─────── Input Gate ───────┐ │ i(t) = σ(W_i·[h,x]+b) │ │ c̃(t) = tanh(W_c·[h,x]) │ │ "Apa yang ditambahkan?" │ └───────────────────────────┘ h(t-1) ──────────────────────────────────────▶ h(t) (hidden) Output Gate = o(t) × tanh(c(t)) o(t) = σ(W_o·[h,x]+b) "Apa yang dikeluarkan?"

25_lstm.py — LSTM from Scratch python

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

class LSTMCell:
    """Single LSTM cell — from scratch"""

    def __init__(self, input_size, hidden_size):
        n = input_size + hidden_size
        s = np.sqrt(2.0 / n)
        self.hidden_size = hidden_size

        # Combined weights for all 4 gates (efficiency!)
        # [forget, input, candidate, output]
        self.W = np.random.randn(n, 4 * hidden_size) * s
        self.b = np.zeros((1, 4 * hidden_size))
        # Initialize forget gate bias to 1 (remember by default)
        self.b[0, :hidden_size] = 1.0

    def forward(self, x, h_prev, c_prev):
        """One timestep forward"""
        H = self.hidden_size

        # Concatenate input and previous hidden state
        combined = np.concatenate([h_prev, x], axis=1)

        # Compute all 4 gates at once
        gates = combined @ self.W + self.b

        # Split into individual gates
        f = sigmoid(gates[:, :H])          # Forget gate
        i = sigmoid(gates[:, H:2*H])       # Input gate
        c_tilde = np.tanh(gates[:, 2*H:3*H])  # Candidate
        o = sigmoid(gates[:, 3*H:])        # Output gate

        # Update cell state
        c_new = f * c_prev + i * c_tilde   # forget old + add new

        # Compute hidden state
        h_new = o * np.tanh(c_new)         # output filtered

        # Cache for backprop
        self.cache = (x, h_prev, c_prev, f, i, c_tilde, o, c_new)

        return h_new, c_new

# Demo
lstm = LSTMCell(input_size=10, hidden_size=32)
h = np.zeros((1, 32))
c = np.zeros((1, 32))

# Process 5 timesteps
for t in range(5):
    x = np.random.randn(1, 10)
    h, c = lstm.forward(x, h, c)
    print(f"t={t}: h_norm={np.linalg.norm(h):.4f}, c_norm={np.linalg.norm(c):.4f}")

🎓 Kenapa LSTM Mengatasi Vanishing Gradient?
Cell state c(t) adalah "highway" — informasi mengalir lewat operasi perkalian dan penjumlahan saja (bukan tanh berulang). Forget gate yang mendekati 1 memungkinkan gradient mengalir tanpa penyusutan. Ini seperti membuat "jalan tol" khusus untuk informasi penting.

🎓 Why Does LSTM Solve Vanishing Gradients?
The cell state c(t) is a "highway" — information flows through multiplication and addition only (no repeated tanh). A forget gate near 1 allows gradients to flow without shrinking. It's like creating a special "express lane" for important information.

⚡

5. GRU — Gated Recurrent Unit

LSTM yang lebih ringkas — 2 gate, lebih cepat, performa serupa

A leaner LSTM — 2 gates, faster, similar performance

GRU menyederhanakan LSTM: hanya 2 gate (reset dan update), tidak ada cell state terpisah. Lebih cepat di-train, parameter lebih sedikit, dan performanya sering setara dengan LSTM.

GRU simplifies LSTM: only 2 gates (reset and update), no separate cell state. Faster to train, fewer parameters, and often performs comparably to LSTM.

26_gru.py — GRU from Scratch python

import numpy as np

class GRUCell:
    """GRU: 2 gates, no separate cell state"""

    def __init__(self, input_size, hidden_size):
        n = input_size + hidden_size
        s = np.sqrt(2.0 / n)
        self.H = hidden_size

        # Weights for reset & update gates
        self.Wz = np.random.randn(n, hidden_size) * s   # update gate
        self.Wr = np.random.randn(n, hidden_size) * s   # reset gate
        self.Wh = np.random.randn(n, hidden_size) * s   # candidate
        self.bz = np.zeros((1, hidden_size))
        self.br = np.zeros((1, hidden_size))
        self.bh = np.zeros((1, hidden_size))

    def forward(self, x, h_prev):
        combined = np.concatenate([h_prev, x], axis=1)

        # Update gate: how much of old state to keep
        z = sigmoid(combined @ self.Wz + self.bz)

        # Reset gate: how much of old state to use for candidate
        r = sigmoid(combined @ self.Wr + self.br)

        # Candidate hidden state
        combined_r = np.concatenate([r * h_prev, x], axis=1)
        h_tilde = np.tanh(combined_r @ self.Wh + self.bh)

        # Final hidden state: interpolate old and new
        h_new = z * h_prev + (1 - z) * h_tilde

        return h_new

# GRU vs LSTM comparison:
# LSTM: 3 gates + cell state → 4×(n+H)×H params
# GRU:  2 gates + candidate  → 3×(n+H)×H params ← 25% fewer!

🎓 LSTM vs GRU — Kapan Pakai Mana?
LSTM → sequence sangat panjang, butuh memori jangka panjang presisi tinggi.
GRU → dataset lebih kecil, butuh training cepat, sequence pendek-sedang.
Rule of thumb: mulai dari GRU. Kalau tidak cukup, coba LSTM.

🎓 LSTM vs GRU — When to Use Which?
LSTM → very long sequences, need precise long-term memory.
GRU → smaller datasets, need faster training, short-medium sequences.
Rule of thumb: start with GRU. If it's not enough, try LSTM.

✍️

6. Character-Level Text Generator

Latih RNN menulis teks — satu karakter pada satu waktu

Train an RNN to write text — one character at a time

Kita latih LSTM untuk memprediksi karakter berikutnya dari sequence teks. Setelah training, model bisa generate teks baru yang mirip gaya penulisan data training — karakter per karakter.

We'll train an LSTM to predict the next character from a text sequence. After training, the model can generate new text that mimics the writing style of the training data — one character at a time.

27_text_generator.py — Char-Level Text Gen 🔥 python

import numpy as np

# =====================================================
# 1. PREPARE TEXT DATA
# =====================================================
text = """To be or not to be that is the question
Whether tis nobler in the mind to suffer
The slings and arrows of outrageous fortune
Or to take arms against a sea of troubles"""

# Build character vocabulary
chars = sorted(set(text))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for c, i in char_to_idx.items()}
vocab_size = len(chars)
print(f"Vocab size: {vocab_size} chars")

# One-hot encode
def one_hot_char(idx, size):
    v = np.zeros((1, size))
    v[0, idx] = 1
    return v

# =====================================================
# 2. TRAINING LOOP
# =====================================================
hidden_size = 64
seq_length = 25    # chars per training sequence
lr = 0.01
lstm = LSTMCell(vocab_size, hidden_size)
Why = np.random.randn(hidden_size, vocab_size) * 0.01
by = np.zeros((1, vocab_size))

def softmax(z):
    e = np.exp(z - np.max(z))
    return e / e.sum()

print("🔥 Training text generator...")
for iteration in range(1000):
    # Random starting position
    start = np.random.randint(0, len(text) - seq_length - 1)
    inputs = [char_to_idx[c] for c in text[start:start+seq_length]]
    targets = [char_to_idx[c] for c in text[start+1:start+seq_length+1]]

    # Forward pass through sequence
    h = np.zeros((1, hidden_size))
    c = np.zeros((1, hidden_size))
    loss = 0

    for t in range(seq_length):
        x = one_hot_char(inputs[t], vocab_size)
        h, c = lstm.forward(x, h, c)
        logits = h @ Why + by
        probs = softmax(logits)
        loss -= np.log(probs[0, targets[t]] + 1e-12)

    if (iteration + 1) % 200 == 0:
        print(f"  Iter {iteration+1:>4} │ Loss: {loss/seq_length:.4f}")

# =====================================================
# 3. GENERATE TEXT!
# =====================================================
def generate(seed_char, length=100):
    h = np.zeros((1, hidden_size))
    c = np.zeros((1, hidden_size))
    idx = char_to_idx[seed_char]
    result = seed_char

    for _ in range(length):
        x = one_hot_char(idx, vocab_size)
        h, c = lstm.forward(x, h, c)
        logits = h @ Why + by
        probs = softmax(logits).flatten()
        # Sample from probability distribution
        idx = np.random.choice(vocab_size, p=probs)
        result += idx_to_char[idx]

    return result

print("\n✍️ Generated text:")
print(generate("T", 150))

🎉 Model Menulis Sendiri! Setelah cukup training, model bisa generate teks yang menyerupai Shakespeare — belajar pola ejaan, spasi, bahkan struktur kalimat, hanya dari memprediksi karakter berikutnya satu per satu.

🎉 The Model Writes by Itself! After enough training, the model can generate Shakespeare-like text — learning spelling patterns, spacing, even sentence structure, just from predicting the next character one at a time.

😊

7. Sentiment Analysis — Klasifikasi Emosi Teks

7. Sentiment Analysis — Text Emotion Classification

Baca kalimat → tentukan positif atau negatif

Read a sentence → determine positive or negative

Aplikasi RNN paling populer: membaca seluruh kalimat, ambil hidden state terakhir sebagai "ringkasan", lalu klasifikasikan sentimen. Hidden state terakhir mengandung informasi dari semua kata sebelumnya.

The most popular RNN application: read the entire sentence, take the last hidden state as a "summary", then classify the sentiment. The last hidden state contains information from all previous words.

28_sentiment_analysis.py — Simple Sentiment Classifier python

import numpy as np

# =====================================================
# 1. TOY DATASET
# =====================================================
data = [
    ("this movie is great",       1),  # positive
    ("terrible waste of time",    0),  # negative
    ("i love this film",          1),
    ("worst movie ever",          0),
    ("absolutely wonderful",      1),
    ("boring and awful",          0),
    ("amazing performance",       1),
    ("horrible acting",           0),
]

# Build word vocabulary
all_words = set()
for text, _ in data:
    all_words.update(text.split())
word2idx = {w: i+1 for i, w in enumerate(sorted(all_words))}
word2idx[""] = 0
vocab_size = len(word2idx)

# =====================================================
# 2. ENCODE SENTENCES
# =====================================================
def encode_sentence(text, max_len=5):
    words = text.split()
    indices = [word2idx.get(w, 0) for w in words]
    # Pad to fixed length
    while len(indices) < max_len:
        indices.append(0)
    return indices[:max_len]

# Simple word embedding (random, learned during training)
embed_dim = 8
embeddings = np.random.randn(vocab_size, embed_dim) * 0.1

# =====================================================
# 3. SENTIMENT CLASSIFIER
# LSTM → last hidden → sigmoid → pos/neg
# =====================================================
lstm = LSTMCell(input_size=embed_dim, hidden_size=16)
W_out = np.random.randn(16, 1) * 0.1
b_out = np.zeros((1, 1))

print("😊 Training sentiment classifier...")
for epoch in range(200):
    total_loss = 0
    for text, label in data:
        indices = encode_sentence(text)
        h = np.zeros((1, 16))
        c = np.zeros((1, 16))

        # Forward through each word
        for idx in indices:
            x = embeddings[idx:idx+1]  # (1, 8)
            h, c = lstm.forward(x, h, c)

        # Classify using last hidden state
        logit = h @ W_out + b_out
        pred = sigmoid(logit)

        # Binary cross-entropy loss
        loss = -(label * np.log(pred + 1e-12)
                + (1-label) * np.log(1-pred + 1e-12))
        total_loss += loss.item()

    if (epoch+1) % 50 == 0:
        print(f"  Epoch {epoch+1:>3} │ Loss: {total_loss/len(data):.4f}")

# =====================================================
# 4. TEST
# =====================================================
print("\n🎯 Predictions:")
for text, label in data:
    indices = encode_sentence(text)
    h, c = np.zeros((1,16)), np.zeros((1,16))
    for idx in indices:
        h, c = lstm.forward(embeddings[idx:idx+1], h, c)
    pred = sigmoid(h @ W_out + b_out).item()
    emoji = "😊" if pred > 0.5 else "😞"
    print(f"  {emoji} {pred:.2f} │ {text}")

🎉 RNN Memahami Konteks!
Model membaca kata per kata, membangun "pemahaman" di hidden state, lalu mengklasifikasikan emosi berdasarkan keseluruhan kalimat. Kata "love" dan "great" mendorong ke positif, "terrible" dan "worst" ke negatif — dan model mempelajari ini sendiri dari data!

🎉 RNN Understands Context!
The model reads word by word, building "understanding" in the hidden state, then classifies emotion based on the entire sentence. Words like "love" and "great" push toward positive, "terrible" and "worst" toward negative — and the model learns this on its own from data!

📝

8. Ringkasan Page 5

8. Page 5 Summary

Apa yang sudah kita pelajari

What we've learned

Konsep	Apa Itu	Kode Kunci
RNN	Network dengan hidden state (memori)	`h = tanh(Wx·x + Wh·h + b)`
Hidden State	Vektor "memori" yang terus di-update	`h(t) = f(h(t-1), x(t))`
BPTT	Backprop yang di-unroll sepanjang waktu	`Σ dL/dW per timestep`
Vanishing Gradient	Gradient menghilang di sequence panjang	`0.9¹⁰⁰ ≈ 0`
LSTM	3 gate + cell state = memori jangka panjang	`c = fc + ic̃`
GRU	2 gate, lebih ringkas dari LSTM	`h = zh + (1-z)h̃`
Text Generation	Prediksi karakter berikutnya → generate	`sample(softmax(logits))`
Sentiment Analysis	Baca kalimat → klasifikasi emosi	`sigmoid(h_last @ W)`
Gradient Clipping	Cegah exploding gradient	`g * (max/norm)`

Concept	What It Is	Key Code
RNN	Network with hidden state (memory)	`h = tanh(Wx·x + Wh·h + b)`
Hidden State	"Memory" vector continuously updated	`h(t) = f(h(t-1), x(t))`
BPTT	Backprop unrolled through time	`Σ dL/dW per timestep`
Vanishing Gradient	Gradients vanish in long sequences	`0.9¹⁰⁰ ≈ 0`
LSTM	3 gates + cell state = long-term memory	`c = fc + ic̃`
GRU	2 gates, leaner than LSTM	`h = zh + (1-z)h̃`
Text Generation	Predict next char → generate	`sample(softmax(logits))`
Sentiment Analysis	Read sentence → classify emotion	`sigmoid(h_last @ W)`
Gradient Clipping	Prevent exploding gradients	`g * (max/norm)`

← Page Sebelumnya← Previous Page

Recurrent Neural Network
LSTM & Sequence Data

Recurrent Neural Network
LSTM & Sequence Data

📑 Daftar Isi — Page 5

📑 Table of Contents — Page 5

1. Kenapa RNN? — Data Berurutan Butuh Memori

1. Why RNN? — Sequential Data Needs Memory

2. Vanilla RNN — Arsitektur Dasar

2. Vanilla RNN — Basic Architecture

3. Backprop Through Time & Vanishing Gradient

3. Backprop Through Time & Vanishing Gradient

4. LSTM — Long Short-Term Memory

4. LSTM — Long Short-Term Memory

5. GRU — Gated Recurrent Unit

5. GRU — Gated Recurrent Unit

6. Character-Level Text Generator

6. Character-Level Text Generator

7. Sentiment Analysis — Klasifikasi Emosi Teks

7. Sentiment Analysis — Text Emotion Classification

8. Ringkasan Page 5

8. Page 5 Summary

Page 4 — Regularization & Advanced Optimization

Coming Next: Page 6 — Word Embeddings & NLP Pipeline

Coming Next: Page 6 — Word Embeddings & NLP Pipeline