Neural Network Page 6 — Word Embeddings & NLP Pipeline

📑 Daftar Isi — Page 6

📑 Table of Contents — Page 6

Masalah One-Hot — Kenapa perlu embedding
Word2Vec — Skip-gram & CBOW dari nol
GloVe & Pre-trained — Download dan pakai
Tokenization — Text preprocessing lengkap
Embedding Layer — Lookup table trainable
NLP Pipeline Lengkap — End-to-end classifier
Ringkasan & Preview Page 7

One-Hot Problem — Why we need embeddings
Word2Vec — Skip-gram & CBOW from scratch
GloVe & Pre-trained — Download and use
Tokenization — Complete text preprocessing
Embedding Layer — Trainable lookup table
Complete NLP Pipeline — End-to-end classifier
Summary & Page 7 Preview

🤔

1. Masalah One-Hot Encoding untuk Teks

1. The Problem with One-Hot Encoding for Text

10.000 kata = vektor 10.000 dimensi — tidak efisien dan tidak bermakna

10,000 words = 10,000-dimensional vectors — inefficient and meaningless

Di Page 5, kita menggunakan one-hot encoding untuk merepresentasikan kata. Masalahnya: vektor sangat besar (kosakata 50k kata = 50k dimensi!), dan semua kata "sama jauhnya" — "kucing" dan "anjing" sejauh "kucing" dan "pesawat". Kita butuh representasi yang lebih cerdas.

In Page 5, we used one-hot encoding to represent words. The problem: vectors are huge (50k vocabulary = 50k dimensions!), and all words are "equally distant" — "cat" and "dog" are as far apart as "cat" and "airplane". We need a smarter representation.

One-Hot vs Word Embedding One-Hot (sparse, meaningless distances): "cat" = [1, 0, 0, 0, 0, ...] (10,000 dims) "dog" = [0, 1, 0, 0, 0, ...] "king" = [0, 0, 1, 0, 0, ...] dist(cat, dog) = dist(cat, king) = √2 ← SAMA! Word Embedding (dense, meaningful): "cat" = [0.2, -0.4, 0.7, 0.1] (50-300 dims) "dog" = [0.3, -0.3, 0.6, 0.2] ← dekat dengan cat! "king" = [-0.5, 0.8, 0.1, 0.9] ← jauh dari cat

📐

2. Word2Vec — Skip-gram & CBOW

Belajar representasi kata dari konteks — "Anda dikenal dari teman-teman Anda"

Learning word representations from context — "you are known by the company you keep"

Word2Vec belajar embedding dari jutaan kalimat. Dua arsitektur: CBOW (prediksi kata dari konteks) dan Skip-gram (prediksi konteks dari kata). Hasilnya: kata-kata yang muncul dalam konteks mirip akan punya vektor yang mirip.

Word2Vec learns embeddings from millions of sentences. Two architectures: CBOW (predict word from context) and Skip-gram (predict context from word). Result: words appearing in similar contexts will have similar vectors.

29_word2vec_skipgram.py — Skip-gram from Scratchpython

import numpy as np

class SkipGram:
    """Word2Vec Skip-gram: predict context from center word"""

    def __init__(self, vocab_size, embed_dim=50):
        # Two embedding matrices
        self.W_in = np.random.randn(vocab_size, embed_dim) * 0.01
        self.W_out = np.random.randn(embed_dim, vocab_size) * 0.01

    def forward(self, center_idx):
        """Get embedding for center word"""
        self.h = self.W_in[center_idx]  # (embed_dim,)
        scores = self.h @ self.W_out     # (vocab_size,)
        # Softmax
        exp_s = np.exp(scores - np.max(scores))
        self.probs = exp_s / exp_s.sum()
        return self.probs

    def train_pair(self, center_idx, context_idx, lr=0.01):
        """Train on one (center, context) pair"""
        probs = self.forward(center_idx)
        # Gradient
        grad = probs.copy()
        grad[context_idx] -= 1  # softmax - one_hot
        # Update
        self.W_out -= lr * np.outer(self.h, grad)
        self.W_in[center_idx] -= lr * (self.W_out @ grad)

# After training: W_in contains word embeddings!
# Famous result: king - man + woman ≈ queen

🎓 Analogi Ajaib Word2Vec:
vec("king") - vec("man") + vec("woman") ≈ vec("queen")
Embedding menangkap hubungan semantik! Ini bekerja karena "king" dan "queen" muncul dalam konteks yang mirip, begitu juga "man" dan "woman".

🎓 Word2Vec's Magic Analogies:
vec("king") - vec("man") + vec("woman") ≈ vec("queen")
Embeddings capture semantic relationships! This works because "king" and "queen" appear in similar contexts, as do "man" and "woman".

🌍

3. GloVe & Pre-trained Embeddings

Tidak perlu train sendiri — pakai embedding yang sudah dilatih pada miliaran kata

No need to train your own — use embeddings pre-trained on billions of words

GloVe (Global Vectors) mempelajari embedding dari statistik co-occurrence global, bukan window lokal seperti Word2Vec. Keduanya tersedia sebagai pre-trained — Anda bisa langsung download dan pakai tanpa training.

GloVe (Global Vectors) learns embeddings from global co-occurrence statistics, not local windows like Word2Vec. Both are available pre-trained — you can download and use them directly without training.

30_load_pretrained.py — Loading GloVe Embeddingspython

import numpy as np

def load_glove(filepath, vocab=None):
    """Load GloVe embeddings from text file"""
    embeddings = {}
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split()
            word = parts[0]
            if vocab and word not in vocab:
                continue
            vec = np.array(parts[1:], dtype=np.float64)
            embeddings[word] = vec
    return embeddings

# Usage: download glove.6B.50d.txt from nlp.stanford.edu
# glove = load_glove('glove.6B.50d.txt')
# print(glove['king'].shape)  # (50,)

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def find_analogy(embeddings, a, b, c):
    """a is to b as c is to ???"""
    target = embeddings[b] - embeddings[a] + embeddings[c]
    best_word, best_sim = None, -1
    for word, vec in embeddings.items():
        if word in {a, b, c}: continue
        sim = cosine_similarity(target, vec)
        if sim > best_sim:
            best_word, best_sim = word, sim
    return best_word

# find_analogy(glove, "man", "king", "woman")  → "queen"

🔧

4. Tokenization & Text Preprocessing

Teks mentah → token → indeks → embedding → siap masuk model

Raw text → tokens → indices → embeddings → ready for the model

Sebelum masuk model, teks harus melalui pipeline preprocessing: lowercasing, tokenization (pecah jadi kata/subword), building vocabulary, konversi ke indeks, padding ke panjang sama, lalu lookup embedding.

Before entering the model, text must go through a preprocessing pipeline: lowercasing, tokenization (split into words/subwords), building vocabulary, converting to indices, padding to equal length, then embedding lookup.

31_nlp_pipeline.py — Complete NLP Preprocessingpython

import numpy as np
import re

class TextPipeline:
    """Complete text preprocessing pipeline"""

    def __init__(self, max_vocab=10000, max_len=50):
        self.max_vocab = max_vocab
        self.max_len = max_len
        self.word2idx = {"<PAD>": 0, "<UNK>": 1}
        self.idx2word = {0: "<PAD>", 1: "<UNK>"}

    def tokenize(self, text):
        text = text.lower().strip()
        text = re.sub(r'[^a-z0-9\s]', '', text)
        return text.split()

    def build_vocab(self, texts):
        word_counts = {}
        for text in texts:
            for word in self.tokenize(text):
                word_counts[word] = word_counts.get(word, 0) + 1
        top_words = sorted(word_counts, key=word_counts.get, reverse=True)
        for w in top_words[:self.max_vocab - 2]:
            idx = len(self.word2idx)
            self.word2idx[w] = idx
            self.idx2word[idx] = w

    def encode(self, text):
        tokens = self.tokenize(text)
        indices = [self.word2idx.get(t, 1) for t in tokens]
        # Pad or truncate
        if len(indices) < self.max_len:
            indices += [0] * (self.max_len - len(indices))
        return np.array(indices[:self.max_len])

    def encode_batch(self, texts):
        return np.array([self.encode(t) for t in texts])

# Usage
pipe = TextPipeline(max_vocab=5000, max_len=20)
texts = ["I love this movie", "Terrible film"]
pipe.build_vocab(texts)
encoded = pipe.encode_batch(texts)
print(encoded.shape)  # (2, 20)

🏗️

5. Embedding Layer dari Nol

5. Embedding Layer from Scratch

Lookup table yang bisa di-train — jembatan antara kata dan vektor

A trainable lookup table — the bridge between words and vectors

Embedding layer adalah matrix besar (vocab_size × embed_dim). Untuk mendapatkan vektor sebuah kata, cukup ambil baris yang sesuai. Matrix ini di-train bersama model — jadi embedding menyesuaikan task.

An embedding layer is a large matrix (vocab_size × embed_dim). To get a word's vector, just fetch the corresponding row. This matrix is trained alongside the model — so embeddings adapt to the task.

32_embedding_layer.pypython

import numpy as np

class EmbeddingLayer:
    def __init__(self, vocab_size, embed_dim, pretrained=None):
        if pretrained is not None:
            self.W = pretrained.copy()
        else:
            self.W = np.random.randn(vocab_size, embed_dim) * 0.01

    def forward(self, indices):
        """indices: (batch, seq_len) → (batch, seq_len, embed_dim)"""
        self.indices = indices
        return self.W[indices]

    def backward(self, d_out, lr=0.01):
        """Update embeddings for used words only"""
        np.add.at(self.W, self.indices, -lr * d_out)

# Full NLP model: TextPipeline → Embedding → LSTM → FC → Softmax

🔗

6. NLP Pipeline Lengkap — Sentiment Classifier

6. Complete NLP Pipeline — Sentiment Classifier

Text → Tokenize → Embed → LSTM → Classify → "Positif!"

Text → Tokenize → Embed → LSTM → Classify → "Positive!"

Sekarang kita gabungkan semua: TextPipeline → EmbeddingLayer → LSTM → FC + Sigmoid. Ini adalah arsitektur standar NLP sebelum era Transformer (dan masih dipakai untuk banyak kasus!).

Now we combine everything: TextPipeline → EmbeddingLayer → LSTM → FC + Sigmoid. This is the standard NLP architecture before the Transformer era (and still used for many cases!).

🎉 Pipeline NLP Lengkap! Anda sekarang bisa membangun end-to-end NLP system: dari teks mentah sampai prediksi sentimen. Semua dari nol, tanpa library NLP.

🎉 Complete NLP Pipeline! You can now build an end-to-end NLP system: from raw text to sentiment prediction. All from scratch, no NLP libraries.

📝

7. Ringkasan Page 6

7. Page 6 Summary

Apa yang sudah kita pelajari

What we've learned

Konsep	Apa Itu	Kode Kunci
Word Embedding	Representasi kata sebagai vektor padat bermakna	`W[word_idx]`
Word2Vec	Belajar embedding dari konteks kata	`SkipGram / CBOW`
GloVe	Embedding dari co-occurrence statistics global	`load_glove(file)`
Cosine Similarity	Ukuran kemiripan antar vektor	`dot(a,b)/(\|a\|·\|b\|)`
Tokenization	Pecah teks jadi token (kata/subword)	`text.lower().split()`
Embedding Layer	Lookup table kata→vektor yang trainable	`W[indices]`
NLP Pipeline	Token→Index→Embed→Model→Predict	`pipe→embed→lstm→fc`

Concept	What It Is	Key Code
Word Embedding	Words as dense, meaningful vectors	`W[word_idx]`
Word2Vec	Learn embeddings from word context	`SkipGram / CBOW`
GloVe	Embeddings from global co-occurrence stats	`load_glove(file)`
Cosine Similarity	Similarity measure between vectors	`dot(a,b)/(\|a\|·\|b\|)`
Tokenization	Split text into tokens	`text.lower().split()`
Embedding Layer	Trainable word→vector lookup table	`W[indices]`
NLP Pipeline	Token→Index→Embed→Model→Predict	`pipe→embed→lstm→fc`

← Page Sebelumnya← Previous Page

Word Embeddings &
NLP Pipeline

Word Embeddings &
NLP Pipeline

📑 Daftar Isi — Page 6

📑 Table of Contents — Page 6

1. Masalah One-Hot Encoding untuk Teks

1. The Problem with One-Hot Encoding for Text

2. Word2Vec — Skip-gram & CBOW

2. Word2Vec — Skip-gram & CBOW

3. GloVe & Pre-trained Embeddings

3. GloVe & Pre-trained Embeddings

4. Tokenization & Text Preprocessing

4. Tokenization & Text Preprocessing

5. Embedding Layer dari Nol

5. Embedding Layer from Scratch

6. NLP Pipeline Lengkap — Sentiment Classifier

6. Complete NLP Pipeline — Sentiment Classifier

7. Ringkasan Page 6

7. Page 6 Summary

Page 5 — RNN, LSTM & Sequence Data

Coming Next: Page 7 — Generative Adversarial Network (GAN)

Coming Next: Page 7 — Generative Adversarial Network (GAN)