πŸ“ Artikel ini ditulis dalam Bahasa Indonesia & English
πŸ“ This article is available in English & Bahasa Indonesia

πŸ“ Tutorial Neural Network β€” Page 6Neural Network Tutorial β€” Page 6

Word Embeddings &
NLP Pipeline

Word Embeddings &
NLP Pipeline

Dari one-hot ke word vectors bermakna. Page 6 membahas: Word2Vec (Skip-gram & CBOW), GloVe, pre-trained embeddings, tokenization, dan membangun NLP pipeline lengkap dari teks mentah sampai klasifikasi sentimen.

From one-hot to meaningful word vectors. Page 6 covers: Word2Vec (Skip-gram & CBOW), GloVe, pre-trained embeddings, tokenization, and building a complete NLP pipeline from raw text to sentiment classification.

πŸ“… MaretMarch 2026 ⏱ 24 menit baca24 min read
🏷 Word2VecGloVeEmbeddingsTokenizationNLP Pipeline
πŸ“š Seri Tutorial Neural Network:Neural Network Tutorial Series:

πŸ“‘ Daftar Isi β€” Page 6

πŸ“‘ Table of Contents β€” Page 6

  1. Masalah One-Hot β€” Kenapa perlu embedding
  2. Word2Vec β€” Skip-gram & CBOW dari nol
  3. GloVe & Pre-trained β€” Download dan pakai
  4. Tokenization β€” Text preprocessing lengkap
  5. Embedding Layer β€” Lookup table trainable
  6. NLP Pipeline Lengkap β€” End-to-end classifier
  7. Ringkasan & Preview Page 7
  1. One-Hot Problem β€” Why we need embeddings
  2. Word2Vec β€” Skip-gram & CBOW from scratch
  3. GloVe & Pre-trained β€” Download and use
  4. Tokenization β€” Complete text preprocessing
  5. Embedding Layer β€” Trainable lookup table
  6. Complete NLP Pipeline β€” End-to-end classifier
  7. Summary & Page 7 Preview
πŸ€”

1. Masalah One-Hot Encoding untuk Teks

1. The Problem with One-Hot Encoding for Text

10.000 kata = vektor 10.000 dimensi β€” tidak efisien dan tidak bermakna
10,000 words = 10,000-dimensional vectors β€” inefficient and meaningless

Di Page 5, kita menggunakan one-hot encoding untuk merepresentasikan kata. Masalahnya: vektor sangat besar (kosakata 50k kata = 50k dimensi!), dan semua kata "sama jauhnya" β€” "kucing" dan "anjing" sejauh "kucing" dan "pesawat". Kita butuh representasi yang lebih cerdas.

In Page 5, we used one-hot encoding to represent words. The problem: vectors are huge (50k vocabulary = 50k dimensions!), and all words are "equally distant" β€” "cat" and "dog" are as far apart as "cat" and "airplane". We need a smarter representation.

One-Hot vs Word Embedding One-Hot (sparse, meaningless distances): "cat" = [1, 0, 0, 0, 0, ...] (10,000 dims) "dog" = [0, 1, 0, 0, 0, ...] "king" = [0, 0, 1, 0, 0, ...] dist(cat, dog) = dist(cat, king) = √2 ← SAMA! Word Embedding (dense, meaningful): "cat" = [0.2, -0.4, 0.7, 0.1] (50-300 dims) "dog" = [0.3, -0.3, 0.6, 0.2] ← dekat dengan cat! "king" = [-0.5, 0.8, 0.1, 0.9] ← jauh dari cat
πŸ“

2. Word2Vec β€” Skip-gram & CBOW

2. Word2Vec β€” Skip-gram & CBOW

Belajar representasi kata dari konteks β€” "Anda dikenal dari teman-teman Anda"
Learning word representations from context β€” "you are known by the company you keep"

Word2Vec belajar embedding dari jutaan kalimat. Dua arsitektur: CBOW (prediksi kata dari konteks) dan Skip-gram (prediksi konteks dari kata). Hasilnya: kata-kata yang muncul dalam konteks mirip akan punya vektor yang mirip.

Word2Vec learns embeddings from millions of sentences. Two architectures: CBOW (predict word from context) and Skip-gram (predict context from word). Result: words appearing in similar contexts will have similar vectors.

29_word2vec_skipgram.py β€” Skip-gram from Scratchpython
import numpy as np

class SkipGram:
    """Word2Vec Skip-gram: predict context from center word"""

    def __init__(self, vocab_size, embed_dim=50):
        # Two embedding matrices
        self.W_in = np.random.randn(vocab_size, embed_dim) * 0.01
        self.W_out = np.random.randn(embed_dim, vocab_size) * 0.01

    def forward(self, center_idx):
        """Get embedding for center word"""
        self.h = self.W_in[center_idx]  # (embed_dim,)
        scores = self.h @ self.W_out     # (vocab_size,)
        # Softmax
        exp_s = np.exp(scores - np.max(scores))
        self.probs = exp_s / exp_s.sum()
        return self.probs

    def train_pair(self, center_idx, context_idx, lr=0.01):
        """Train on one (center, context) pair"""
        probs = self.forward(center_idx)
        # Gradient
        grad = probs.copy()
        grad[context_idx] -= 1  # softmax - one_hot
        # Update
        self.W_out -= lr * np.outer(self.h, grad)
        self.W_in[center_idx] -= lr * (self.W_out @ grad)

# After training: W_in contains word embeddings!
# Famous result: king - man + woman β‰ˆ queen

πŸŽ“ Analogi Ajaib Word2Vec:
vec("king") - vec("man") + vec("woman") β‰ˆ vec("queen")
Embedding menangkap hubungan semantik! Ini bekerja karena "king" dan "queen" muncul dalam konteks yang mirip, begitu juga "man" dan "woman".

πŸŽ“ Word2Vec's Magic Analogies:
vec("king") - vec("man") + vec("woman") β‰ˆ vec("queen")
Embeddings capture semantic relationships! This works because "king" and "queen" appear in similar contexts, as do "man" and "woman".

🌍

3. GloVe & Pre-trained Embeddings

3. GloVe & Pre-trained Embeddings

Tidak perlu train sendiri β€” pakai embedding yang sudah dilatih pada miliaran kata
No need to train your own β€” use embeddings pre-trained on billions of words

GloVe (Global Vectors) mempelajari embedding dari statistik co-occurrence global, bukan window lokal seperti Word2Vec. Keduanya tersedia sebagai pre-trained β€” Anda bisa langsung download dan pakai tanpa training.

GloVe (Global Vectors) learns embeddings from global co-occurrence statistics, not local windows like Word2Vec. Both are available pre-trained β€” you can download and use them directly without training.

30_load_pretrained.py β€” Loading GloVe Embeddingspython
import numpy as np

def load_glove(filepath, vocab=None):
    """Load GloVe embeddings from text file"""
    embeddings = {}
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split()
            word = parts[0]
            if vocab and word not in vocab:
                continue
            vec = np.array(parts[1:], dtype=np.float64)
            embeddings[word] = vec
    return embeddings

# Usage: download glove.6B.50d.txt from nlp.stanford.edu
# glove = load_glove('glove.6B.50d.txt')
# print(glove['king'].shape)  # (50,)

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def find_analogy(embeddings, a, b, c):
    """a is to b as c is to ???"""
    target = embeddings[b] - embeddings[a] + embeddings[c]
    best_word, best_sim = None, -1
    for word, vec in embeddings.items():
        if word in {a, b, c}: continue
        sim = cosine_similarity(target, vec)
        if sim > best_sim:
            best_word, best_sim = word, sim
    return best_word

# find_analogy(glove, "man", "king", "woman")  β†’ "queen"
πŸ”§

4. Tokenization & Text Preprocessing

4. Tokenization & Text Preprocessing

Teks mentah β†’ token β†’ indeks β†’ embedding β†’ siap masuk model
Raw text β†’ tokens β†’ indices β†’ embeddings β†’ ready for the model

Sebelum masuk model, teks harus melalui pipeline preprocessing: lowercasing, tokenization (pecah jadi kata/subword), building vocabulary, konversi ke indeks, padding ke panjang sama, lalu lookup embedding.

Before entering the model, text must go through a preprocessing pipeline: lowercasing, tokenization (split into words/subwords), building vocabulary, converting to indices, padding to equal length, then embedding lookup.

31_nlp_pipeline.py β€” Complete NLP Preprocessingpython
import numpy as np
import re

class TextPipeline:
    """Complete text preprocessing pipeline"""

    def __init__(self, max_vocab=10000, max_len=50):
        self.max_vocab = max_vocab
        self.max_len = max_len
        self.word2idx = {"<PAD>": 0, "<UNK>": 1}
        self.idx2word = {0: "<PAD>", 1: "<UNK>"}

    def tokenize(self, text):
        text = text.lower().strip()
        text = re.sub(r'[^a-z0-9\s]', '', text)
        return text.split()

    def build_vocab(self, texts):
        word_counts = {}
        for text in texts:
            for word in self.tokenize(text):
                word_counts[word] = word_counts.get(word, 0) + 1
        top_words = sorted(word_counts, key=word_counts.get, reverse=True)
        for w in top_words[:self.max_vocab - 2]:
            idx = len(self.word2idx)
            self.word2idx[w] = idx
            self.idx2word[idx] = w

    def encode(self, text):
        tokens = self.tokenize(text)
        indices = [self.word2idx.get(t, 1) for t in tokens]
        # Pad or truncate
        if len(indices) < self.max_len:
            indices += [0] * (self.max_len - len(indices))
        return np.array(indices[:self.max_len])

    def encode_batch(self, texts):
        return np.array([self.encode(t) for t in texts])

# Usage
pipe = TextPipeline(max_vocab=5000, max_len=20)
texts = ["I love this movie", "Terrible film"]
pipe.build_vocab(texts)
encoded = pipe.encode_batch(texts)
print(encoded.shape)  # (2, 20)
πŸ—οΈ

5. Embedding Layer dari Nol

5. Embedding Layer from Scratch

Lookup table yang bisa di-train β€” jembatan antara kata dan vektor
A trainable lookup table β€” the bridge between words and vectors

Embedding layer adalah matrix besar (vocab_size Γ— embed_dim). Untuk mendapatkan vektor sebuah kata, cukup ambil baris yang sesuai. Matrix ini di-train bersama model β€” jadi embedding menyesuaikan task.

An embedding layer is a large matrix (vocab_size Γ— embed_dim). To get a word's vector, just fetch the corresponding row. This matrix is trained alongside the model β€” so embeddings adapt to the task.

32_embedding_layer.pypython
import numpy as np

class EmbeddingLayer:
    def __init__(self, vocab_size, embed_dim, pretrained=None):
        if pretrained is not None:
            self.W = pretrained.copy()
        else:
            self.W = np.random.randn(vocab_size, embed_dim) * 0.01

    def forward(self, indices):
        """indices: (batch, seq_len) β†’ (batch, seq_len, embed_dim)"""
        self.indices = indices
        return self.W[indices]

    def backward(self, d_out, lr=0.01):
        """Update embeddings for used words only"""
        np.add.at(self.W, self.indices, -lr * d_out)

# Full NLP model: TextPipeline β†’ Embedding β†’ LSTM β†’ FC β†’ Softmax
πŸ”—

6. NLP Pipeline Lengkap β€” Sentiment Classifier

6. Complete NLP Pipeline β€” Sentiment Classifier

Text β†’ Tokenize β†’ Embed β†’ LSTM β†’ Classify β†’ "Positif!"
Text β†’ Tokenize β†’ Embed β†’ LSTM β†’ Classify β†’ "Positive!"

Sekarang kita gabungkan semua: TextPipeline β†’ EmbeddingLayer β†’ LSTM β†’ FC + Sigmoid. Ini adalah arsitektur standar NLP sebelum era Transformer (dan masih dipakai untuk banyak kasus!).

Now we combine everything: TextPipeline β†’ EmbeddingLayer β†’ LSTM β†’ FC + Sigmoid. This is the standard NLP architecture before the Transformer era (and still used for many cases!).

πŸŽ‰ Pipeline NLP Lengkap! Anda sekarang bisa membangun end-to-end NLP system: dari teks mentah sampai prediksi sentimen. Semua dari nol, tanpa library NLP.

πŸŽ‰ Complete NLP Pipeline! You can now build an end-to-end NLP system: from raw text to sentiment prediction. All from scratch, no NLP libraries.

πŸ“

7. Ringkasan Page 6

7. Page 6 Summary

Apa yang sudah kita pelajari
What we've learned
KonsepApa ItuKode Kunci
Word EmbeddingRepresentasi kata sebagai vektor padat bermaknaW[word_idx]
Word2VecBelajar embedding dari konteks kataSkipGram / CBOW
GloVeEmbedding dari co-occurrence statistics globalload_glove(file)
Cosine SimilarityUkuran kemiripan antar vektordot(a,b)/(|a|Β·|b|)
TokenizationPecah teks jadi token (kata/subword)text.lower().split()
Embedding LayerLookup table kata→vektor yang trainableW[indices]
NLP PipelineToken→Index→Embed→Model→Predictpipe→embed→lstm→fc
ConceptWhat It IsKey Code
Word EmbeddingWords as dense, meaningful vectorsW[word_idx]
Word2VecLearn embeddings from word contextSkipGram / CBOW
GloVeEmbeddings from global co-occurrence statsload_glove(file)
Cosine SimilaritySimilarity measure between vectorsdot(a,b)/(|a|Β·|b|)
TokenizationSplit text into tokenstext.lower().split()
Embedding LayerTrainable word→vector lookup tableW[indices]
NLP PipelineToken→Index→Embed→Model→Predictpipe→embed→lstm→fc
← Page Sebelumnya← Previous Page

Page 5 β€” RNN, LSTM & Sequence Data

πŸ“˜

Coming Next: Page 7 β€” Generative Adversarial Network (GAN)

Dua network bertarung: Generator membuat data palsu, Discriminator mendeteksinya. Membangun GAN dari nol untuk generate gambar. Stay tuned!

πŸ“˜

Coming Next: Page 7 β€” Generative Adversarial Network (GAN)

Two networks compete: Generator creates fake data, Discriminator detects it. Building a GAN from scratch to generate images. Stay tuned!