π Daftar Isi β Page 6
π Table of Contents β Page 6
- Masalah One-Hot β Kenapa perlu embedding
- Word2Vec β Skip-gram & CBOW dari nol
- GloVe & Pre-trained β Download dan pakai
- Tokenization β Text preprocessing lengkap
- Embedding Layer β Lookup table trainable
- NLP Pipeline Lengkap β End-to-end classifier
- Ringkasan & Preview Page 7
- One-Hot Problem β Why we need embeddings
- Word2Vec β Skip-gram & CBOW from scratch
- GloVe & Pre-trained β Download and use
- Tokenization β Complete text preprocessing
- Embedding Layer β Trainable lookup table
- Complete NLP Pipeline β End-to-end classifier
- Summary & Page 7 Preview
1. Masalah One-Hot Encoding untuk Teks
1. The Problem with One-Hot Encoding for Text
Di Page 5, kita menggunakan one-hot encoding untuk merepresentasikan kata. Masalahnya: vektor sangat besar (kosakata 50k kata = 50k dimensi!), dan semua kata "sama jauhnya" β "kucing" dan "anjing" sejauh "kucing" dan "pesawat". Kita butuh representasi yang lebih cerdas.
In Page 5, we used one-hot encoding to represent words. The problem: vectors are huge (50k vocabulary = 50k dimensions!), and all words are "equally distant" β "cat" and "dog" are as far apart as "cat" and "airplane". We need a smarter representation.
2. Word2Vec β Skip-gram & CBOW
2. Word2Vec β Skip-gram & CBOW
Word2Vec belajar embedding dari jutaan kalimat. Dua arsitektur: CBOW (prediksi kata dari konteks) dan Skip-gram (prediksi konteks dari kata). Hasilnya: kata-kata yang muncul dalam konteks mirip akan punya vektor yang mirip.
Word2Vec learns embeddings from millions of sentences. Two architectures: CBOW (predict word from context) and Skip-gram (predict context from word). Result: words appearing in similar contexts will have similar vectors.
import numpy as np class SkipGram: """Word2Vec Skip-gram: predict context from center word""" def __init__(self, vocab_size, embed_dim=50): # Two embedding matrices self.W_in = np.random.randn(vocab_size, embed_dim) * 0.01 self.W_out = np.random.randn(embed_dim, vocab_size) * 0.01 def forward(self, center_idx): """Get embedding for center word""" self.h = self.W_in[center_idx] # (embed_dim,) scores = self.h @ self.W_out # (vocab_size,) # Softmax exp_s = np.exp(scores - np.max(scores)) self.probs = exp_s / exp_s.sum() return self.probs def train_pair(self, center_idx, context_idx, lr=0.01): """Train on one (center, context) pair""" probs = self.forward(center_idx) # Gradient grad = probs.copy() grad[context_idx] -= 1 # softmax - one_hot # Update self.W_out -= lr * np.outer(self.h, grad) self.W_in[center_idx] -= lr * (self.W_out @ grad) # After training: W_in contains word embeddings! # Famous result: king - man + woman β queen
π Analogi Ajaib Word2Vec:vec("king") - vec("man") + vec("woman") β vec("queen")
Embedding menangkap hubungan semantik! Ini bekerja karena "king" dan "queen" muncul dalam konteks yang mirip, begitu juga "man" dan "woman".
π Word2Vec's Magic Analogies:vec("king") - vec("man") + vec("woman") β vec("queen")
Embeddings capture semantic relationships! This works because "king" and "queen" appear in similar contexts, as do "man" and "woman".
3. GloVe & Pre-trained Embeddings
3. GloVe & Pre-trained Embeddings
GloVe (Global Vectors) mempelajari embedding dari statistik co-occurrence global, bukan window lokal seperti Word2Vec. Keduanya tersedia sebagai pre-trained β Anda bisa langsung download dan pakai tanpa training.
GloVe (Global Vectors) learns embeddings from global co-occurrence statistics, not local windows like Word2Vec. Both are available pre-trained β you can download and use them directly without training.
import numpy as np def load_glove(filepath, vocab=None): """Load GloVe embeddings from text file""" embeddings = {} with open(filepath, 'r', encoding='utf-8') as f: for line in f: parts = line.strip().split() word = parts[0] if vocab and word not in vocab: continue vec = np.array(parts[1:], dtype=np.float64) embeddings[word] = vec return embeddings # Usage: download glove.6B.50d.txt from nlp.stanford.edu # glove = load_glove('glove.6B.50d.txt') # print(glove['king'].shape) # (50,) def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) def find_analogy(embeddings, a, b, c): """a is to b as c is to ???""" target = embeddings[b] - embeddings[a] + embeddings[c] best_word, best_sim = None, -1 for word, vec in embeddings.items(): if word in {a, b, c}: continue sim = cosine_similarity(target, vec) if sim > best_sim: best_word, best_sim = word, sim return best_word # find_analogy(glove, "man", "king", "woman") β "queen"
4. Tokenization & Text Preprocessing
4. Tokenization & Text Preprocessing
Sebelum masuk model, teks harus melalui pipeline preprocessing: lowercasing, tokenization (pecah jadi kata/subword), building vocabulary, konversi ke indeks, padding ke panjang sama, lalu lookup embedding.
Before entering the model, text must go through a preprocessing pipeline: lowercasing, tokenization (split into words/subwords), building vocabulary, converting to indices, padding to equal length, then embedding lookup.
import numpy as np import re class TextPipeline: """Complete text preprocessing pipeline""" def __init__(self, max_vocab=10000, max_len=50): self.max_vocab = max_vocab self.max_len = max_len self.word2idx = {"<PAD>": 0, "<UNK>": 1} self.idx2word = {0: "<PAD>", 1: "<UNK>"} def tokenize(self, text): text = text.lower().strip() text = re.sub(r'[^a-z0-9\s]', '', text) return text.split() def build_vocab(self, texts): word_counts = {} for text in texts: for word in self.tokenize(text): word_counts[word] = word_counts.get(word, 0) + 1 top_words = sorted(word_counts, key=word_counts.get, reverse=True) for w in top_words[:self.max_vocab - 2]: idx = len(self.word2idx) self.word2idx[w] = idx self.idx2word[idx] = w def encode(self, text): tokens = self.tokenize(text) indices = [self.word2idx.get(t, 1) for t in tokens] # Pad or truncate if len(indices) < self.max_len: indices += [0] * (self.max_len - len(indices)) return np.array(indices[:self.max_len]) def encode_batch(self, texts): return np.array([self.encode(t) for t in texts]) # Usage pipe = TextPipeline(max_vocab=5000, max_len=20) texts = ["I love this movie", "Terrible film"] pipe.build_vocab(texts) encoded = pipe.encode_batch(texts) print(encoded.shape) # (2, 20)
5. Embedding Layer dari Nol
5. Embedding Layer from Scratch
Embedding layer adalah matrix besar (vocab_size Γ embed_dim). Untuk mendapatkan vektor sebuah kata, cukup ambil baris yang sesuai. Matrix ini di-train bersama model β jadi embedding menyesuaikan task.
An embedding layer is a large matrix (vocab_size Γ embed_dim). To get a word's vector, just fetch the corresponding row. This matrix is trained alongside the model β so embeddings adapt to the task.
import numpy as np class EmbeddingLayer: def __init__(self, vocab_size, embed_dim, pretrained=None): if pretrained is not None: self.W = pretrained.copy() else: self.W = np.random.randn(vocab_size, embed_dim) * 0.01 def forward(self, indices): """indices: (batch, seq_len) β (batch, seq_len, embed_dim)""" self.indices = indices return self.W[indices] def backward(self, d_out, lr=0.01): """Update embeddings for used words only""" np.add.at(self.W, self.indices, -lr * d_out) # Full NLP model: TextPipeline β Embedding β LSTM β FC β Softmax
6. NLP Pipeline Lengkap β Sentiment Classifier
6. Complete NLP Pipeline β Sentiment Classifier
Sekarang kita gabungkan semua: TextPipeline β EmbeddingLayer β LSTM β FC + Sigmoid. Ini adalah arsitektur standar NLP sebelum era Transformer (dan masih dipakai untuk banyak kasus!).
Now we combine everything: TextPipeline β EmbeddingLayer β LSTM β FC + Sigmoid. This is the standard NLP architecture before the Transformer era (and still used for many cases!).
π Pipeline NLP Lengkap! Anda sekarang bisa membangun end-to-end NLP system: dari teks mentah sampai prediksi sentimen. Semua dari nol, tanpa library NLP.
π Complete NLP Pipeline! You can now build an end-to-end NLP system: from raw text to sentiment prediction. All from scratch, no NLP libraries.
7. Ringkasan Page 6
7. Page 6 Summary
| Konsep | Apa Itu | Kode Kunci |
|---|---|---|
| Word Embedding | Representasi kata sebagai vektor padat bermakna | W[word_idx] |
| Word2Vec | Belajar embedding dari konteks kata | SkipGram / CBOW |
| GloVe | Embedding dari co-occurrence statistics global | load_glove(file) |
| Cosine Similarity | Ukuran kemiripan antar vektor | dot(a,b)/(|a|Β·|b|) |
| Tokenization | Pecah teks jadi token (kata/subword) | text.lower().split() |
| Embedding Layer | Lookup table kataβvektor yang trainable | W[indices] |
| NLP Pipeline | TokenβIndexβEmbedβModelβPredict | pipeβembedβlstmβfc |
| Concept | What It Is | Key Code |
|---|---|---|
| Word Embedding | Words as dense, meaningful vectors | W[word_idx] |
| Word2Vec | Learn embeddings from word context | SkipGram / CBOW |
| GloVe | Embeddings from global co-occurrence stats | load_glove(file) |
| Cosine Similarity | Similarity measure between vectors | dot(a,b)/(|a|Β·|b|) |
| Tokenization | Split text into tokens | text.lower().split() |
| Embedding Layer | Trainable wordβvector lookup table | W[indices] |
| NLP Pipeline | TokenβIndexβEmbedβModelβPredict | pipeβembedβlstmβfc |
Page 5 β RNN, LSTM & Sequence Data
Coming Next: Page 7 β Generative Adversarial Network (GAN)
Dua network bertarung: Generator membuat data palsu, Discriminator mendeteksinya. Membangun GAN dari nol untuk generate gambar. Stay tuned!
Coming Next: Page 7 β Generative Adversarial Network (GAN)
Two networks compete: Generator creates fake data, Discriminator detects it. Building a GAN from scratch to generate images. Stay tuned!