Belajar Hugging Face Page 6 — Sentence Embeddings & Semantic Search

📑 Daftar Isi — Page 6

📑 Table of Contents — Page 6

Kenapa BERT Biasa Gagal — Masalah [CLS] untuk similarity
Sentence Transformers — Library & cara kerja
Encode Kalimat → Vektor — Praktik langsung
Cosine Similarity — Mengukur kedekatan makna
Bi-Encoder vs Cross-Encoder — Speed vs accuracy tradeoff
FAISS — Vector search untuk jutaan dokumen
Proyek: Semantic Search Engine — Dari nol ke production
Fine-Tune Embedding Model — Domain-specific embeddings
RAG Foundations — Search + LLM = powerful QA
Pilihan Model Embedding — MiniLM, BGE, E5, multilingual
Di Mana Jalankan? — CPU cukup untuk inference!
Ringkasan & Preview Page 7

Why Plain BERT Fails — The [CLS] problem for similarity
Sentence Transformers — Library & how it works
Encode Sentences → Vectors — Hands-on practice
Cosine Similarity — Measuring semantic closeness
Bi-Encoder vs Cross-Encoder — Speed vs accuracy tradeoff
FAISS — Vector search for millions of documents
Project: Semantic Search Engine — From scratch to production
Fine-Tune Embedding Model — Domain-specific embeddings
RAG Foundations — Search + LLM = powerful QA
Embedding Model Choices — MiniLM, BGE, E5, multilingual
Where to Run? — CPU is enough for inference!
Summary & Page 7 Preview

❌

1. Kenapa BERT Biasa Gagal untuk Similarity — Masalah [CLS]

1. Why Plain BERT Fails for Similarity — The [CLS] Problem

BERT menghasilkan embedding per-TOKEN, bukan per-KALIMAT. [CLS] token ternyata TIDAK bagus untuk similarity.

BERT produces per-TOKEN embeddings, not per-SENTENCE. The [CLS] token is actually NOT good for similarity.

Intuisi awal banyak orang: "BERT punya [CLS] token yang merepresentasikan seluruh kalimat, jadi saya bisa pakai [CLS] embedding untuk menghitung similarity antar kalimat." Ini SALAH! BERT biasa (tanpa fine-tuning untuk similarity) menghasilkan [CLS] embedding yang hampir tidak bermakna untuk perbandingan semantik. Penelitian menunjukkan bahwa bahkan rata-rata GloVe embeddings lebih baik dari [CLS] BERT untuk similarity tasks!

Many people's initial intuition: "BERT has a [CLS] token that represents the entire sentence, so I can use [CLS] embedding to compute similarity between sentences." This is WRONG! Plain BERT (without fine-tuning for similarity) produces [CLS] embeddings that are nearly meaningless for semantic comparison. Research shows that even averaged GloVe embeddings perform better than BERT [CLS] for similarity tasks!

Kenapa [CLS] BERT Gagal untuk Similarity Masalah 1: [CLS] tidak di-train untuk similarity BERT di-pre-train untuk Masked LM dan Next Sentence Prediction. [CLS] token di-optimize untuk NSP (apakah 2 kalimat berurutan?) → BUKAN untuk "apakah 2 kalimat bermakna sama?" Masalah 2: Anisotropic embedding space BERT embeddings menempati "cone" sempit di ruang vektor. SEMUA kalimat punya cosine similarity > 0.6 satu sama lain! → "I love cats" vs "Stock prices fell" = 0.72 ← TINGGI tapi tidak terkait! Masalah 3: Kecepatan Untuk membandingkan 10,000 kalimat: BERT cross-encoding: 10,000 × 10,000 = 100 JUTA forward passes ❌ Sentence embedding: 10,000 encodings + cosine matrix = DETIK ✅ Solusi: Sentence Transformers! Fine-tune BERT khusus untuk menghasilkan sentence embeddings yang bermakna → kalimat serupa = vektor berdekatan. "I love cats" → [0.12, -0.34, 0.56, ...] ← 384-dim vector "I adore kittens" → [0.11, -0.33, 0.55, ...] ← SANGAT mirip! ✅ "Stock prices fell" → [-0.78, 0.22, -0.15, ...] ← JAUH berbeda! ✅

🎓 Analogi: BERT biasa vs Sentence Transformers
BERT biasa = kamus yang bisa menjelaskan arti setiap KATA, tapi tidak bisa menilai apakah dua KALIMAT bermakna sama.
Sentence Transformers = penerjemah yang bisa mengubah kalimat utuh menjadi "sidik jari makna" — dua kalimat dengan makna serupa punya sidik jari yang mirip, meskipun kata-katanya berbeda.
"The cat sat on the mat" ≈ "A feline was resting on a rug" → sidik jari mirip!
"The cat sat on the mat" ≠ "Financial markets crashed" → sidik jari jauh!

🎓 Analogy: Plain BERT vs Sentence Transformers
Plain BERT = a dictionary that explains the meaning of each WORD, but can't judge if two SENTENCES mean the same thing.
Sentence Transformers = a translator that converts whole sentences into "meaning fingerprints" — two sentences with similar meanings have similar fingerprints, even with different words.
"The cat sat on the mat" ≈ "A feline was resting on a rug" → similar fingerprints!
"The cat sat on the mat" ≠ "Financial markets crashed" → distant fingerprints!

🤗

2. Sentence Transformers — Library & Cara Kerja

2. Sentence Transformers — Library & How It Works

Library dari UKP Lab yang menjadi standar industri untuk sentence embeddings

Library from UKP Lab that became the industry standard for sentence embeddings

44_sentence_transformers_setup.py — Install & First Embeddingpython

# ===========================
# Install
# ===========================
# pip install sentence-transformers
# (auto-installs transformers, torch, huggingface-hub)

from sentence_transformers import SentenceTransformer

# ===========================
# Load model (downloads from Hub, cached locally)
# ===========================
model = SentenceTransformer("all-MiniLM-L6-v2")
# all-MiniLM-L6-v2:
# - 22M parameters (TINY! DistilBERT=66M, BERT=110M)
# - 384-dimensional embeddings
# - Trained on 1 BILLION sentence pairs
# - Inference: ~14,000 sentences/second on GPU!
# - Best speed/quality ratio for English

print(f"Model loaded: {model}")
print(f"Max sequence length: {model.max_seq_length}")  # 256
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")  # 384

# ===========================
# How Sentence Transformers works internally:
# ===========================
# 1. Tokenize sentence with BERT tokenizer
# 2. Pass through BERT/MiniLM model → get ALL token embeddings
# 3. POOL token embeddings into ONE sentence vector
#    → Mean pooling (average all token embeddings) ← most common
#    → CLS pooling (use [CLS] token only)
#    → Max pooling (take max across tokens)
# 4. NORMALIZE to unit length (for cosine similarity)
# 5. Return: 1 vector per sentence (384 or 768 dimensions)
#
# KEY DIFFERENCE from plain BERT:
# Model is FINE-TUNED on millions of sentence pairs
# using contrastive learning (similar pairs close, dissimilar far)
# → embedding space is MEANINGFUL for similarity!

Sentence Transformers — Dari Kalimat ke Vektor (Internal Flow) Input: "Jakarta is the capital of Indonesia" │ ▼ ┌──────────────────────────────────┐ │ BERT/MiniLM Tokenizer │ │ → [CLS] Jakarta is the capital │ │ of Indonesia [SEP] │ └──────────┬───────────────────────┘ ▼ ┌──────────────────────────────────┐ │ Transformer Encoder (6 layers) │ │ → Token embeddings: │ │ [CLS]=[0.12,...] Jakarta=[0.34,.]│ │ is=[-0.11,...] the=[0.05,...] │ │ capital=[0.67,...] ... │ │ Shape: (8 tokens, 384 dim) │ └──────────┬───────────────────────┘ ▼ ┌──────────────────────────────────┐ │ Mean Pooling │ │ average ALL token embeddings │ │ (ignoring [PAD] via attn mask) │ │ (8, 384) → (1, 384) │ └──────────┬───────────────────────┘ ▼ ┌──────────────────────────────────┐ │ L2 Normalize │ │ vector / ||vector|| │ │ → unit length (norm = 1.0) │ └──────────┬───────────────────────┘ ▼ Output: [0.032, -0.018, 0.071, ..., 0.045] (384 dimensions) ↑ ini adalah "sidik jari makna" dari kalimat!

🧬

3. Encode Kalimat → Vektor — Praktik Langsung

3. Encode Sentences → Vectors — Hands-on Practice

Satu baris kode: kalimat apapun → vektor 384 dimensi yang bermakna

One line of code: any sentence → a meaningful 384-dimensional vector

45_encode_sentences.py — Encoding in Practicepython

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

# ===========================
# 1. Single sentence
# ===========================
embedding = model.encode("Jakarta is the capital of Indonesia")
print(f"Type: {type(embedding)}")           # numpy.ndarray
print(f"Shape: {embedding.shape}")           # (384,)
print(f"First 5 values: {embedding[:5]}")    # [0.032, -0.018, ...]
print(f"Norm: {np.linalg.norm(embedding):.4f}")  # 1.0000 (normalized!)

# ===========================
# 2. Batch encoding (MUCH faster!)
# ===========================
sentences = [
    "I love machine learning",
    "Deep learning is fascinating",
    "The weather is beautiful today",
    "I enjoy artificial intelligence",
    "It's raining cats and dogs",
]

embeddings = model.encode(sentences, show_progress_bar=True, batch_size=32)
print(f"Batch shape: {embeddings.shape}")  # (5, 384)
# 5 sentences → 5 vectors, each 384 dimensions

# ===========================
# 3. GPU acceleration
# ===========================
model_gpu = SentenceTransformer("all-MiniLM-L6-v2", device="cuda")
embeddings = model_gpu.encode(sentences)  # ~14,000 sentences/sec on T4!

# ===========================
# 4. Return as PyTorch tensors
# ===========================
embeddings_pt = model.encode(sentences, convert_to_tensor=True)
print(f"Tensor device: {embeddings_pt.device}")  # cuda:0 (if GPU)

# ===========================
# 5. Speed benchmark
# ===========================
import time
big_corpus = [f"This is sentence number {i}" for i in range(10000)]
start = time.time()
_ = model.encode(big_corpus, batch_size=256, show_progress_bar=False)
elapsed = time.time() - start
print(f"10,000 sentences in {elapsed:.1f}s ({10000/elapsed:.0f} sent/sec)")
# GPU: 10,000 sentences in 0.7s (14,285 sent/sec)
# CPU: 10,000 sentences in 8.2s (1,220 sent/sec)

📐

4. Cosine Similarity — Mengukur Kedekatan Makna

4. Cosine Similarity — Measuring Semantic Closeness

cos(A,B) = 1.0 → identik, 0.0 → tidak terkait, -1.0 → berlawanan

cos(A,B) = 1.0 → identical, 0.0 → unrelated, -1.0 → opposite

46_cosine_similarity.py — Semantic Similarity 🔬python

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

# ===========================
# 1. Pairwise similarity
# ===========================
sent_a = "I love machine learning"
sent_b = "Deep learning is my passion"
sent_c = "The weather is terrible today"

emb_a = model.encode(sent_a, convert_to_tensor=True)
emb_b = model.encode(sent_b, convert_to_tensor=True)
emb_c = model.encode(sent_c, convert_to_tensor=True)

sim_ab = util.cos_sim(emb_a, emb_b).item()
sim_ac = util.cos_sim(emb_a, emb_c).item()
sim_bc = util.cos_sim(emb_b, emb_c).item()

print(f"'{sent_a}' vs '{sent_b}': {sim_ab:.3f}")  # 0.782 (TINGGI — terkait!)
print(f"'{sent_a}' vs '{sent_c}': {sim_ac:.3f}")  # 0.094 (RENDAH — tidak terkait)
print(f"'{sent_b}' vs '{sent_c}': {sim_bc:.3f}")  # 0.051 (RENDAH — tidak terkait)

# ===========================
# 2. Similarity matrix (all pairs!)
# ===========================
sentences = [
    "I love cats",
    "I adore kittens",
    "Dogs are great pets",
    "The stock market crashed",
    "Financial markets are volatile",
]
embeddings = model.encode(sentences, convert_to_tensor=True)
sim_matrix = util.cos_sim(embeddings, embeddings)
print(f"Similarity matrix shape: {sim_matrix.shape}")  # (5, 5)

# Pretty print
for i in range(len(sentences)):
    for j in range(len(sentences)):
        print(f"{sim_matrix[i][j]:.2f}", end="  ")
    print(f"  ← {sentences[i][:25]}")
# 1.00  0.83  0.45  0.02  0.05  ← I love cats
# 0.83  1.00  0.48  0.01  0.04  ← I adore kittens
# 0.45  0.48  1.00  0.03  0.06  ← Dogs are great pets
# 0.02  0.01  0.03  1.00  0.79  ← The stock market crashed
# 0.05  0.04  0.06  0.79  1.00  ← Financial markets volatile

# PERFECT! Two clusters clearly visible:
# Cluster 1: animals (cats, kittens, dogs) — high similarity
# Cluster 2: finance (stock market, financial) — high similarity
# Cross-cluster: near zero — correctly unrelated!

# ===========================
# 3. Find most similar pair
# ===========================
pairs = util.paraphrase_mining(model, sentences, top_k=3)
for score, i, j in pairs:
    print(f"  {score:.3f}: '{sentences[i]}' ↔ '{sentences[j]}'")
# 0.831: 'I love cats' ↔ 'I adore kittens'
# 0.789: 'The stock market crashed' ↔ 'Financial markets are volatile'
# 0.478: 'I adore kittens' ↔ 'Dogs are great pets'

⚖️

5. Bi-Encoder vs Cross-Encoder — Speed vs Accuracy

Bi-encoder = cepat (search jutaan). Cross-encoder = akurat (re-ranking top-K).

Bi-encoder = fast (search millions). Cross-encoder = accurate (re-rank top-K).

Bi-Encoder vs Cross-Encoder — Dua Pendekatan Similarity Bi-Encoder (Sentence Transformers default) ────────────────────────────────────────── Sent A → [Encoder] → Vector A ─┐ ├─→ cosine_sim(A, B) = 0.83 Sent B → [Encoder] → Vector B ─┘ ✅ CEPAT: encode sekali, bandingkan jutaan kali (vektor pre-computed!) ✅ Scalable: FAISS index → search 1M docs dalam 5ms ❌ Less accurate: sentence diproses TERPISAH (no cross-attention) Cross-Encoder (more accurate, slower) ────────────────────────────────────────── [Sent A] [SEP] [Sent B] → [Encoder] → Score: 0.91 ✅ AKURAT: kedua kalimat diproses BERSAMA (full cross-attention) ❌ LAMBAT: O(n²) — harus run model untuk SETIAP pasangan! ❌ Not scalable: 10,000 docs = 10,000 forward passes per query Best Practice: COMBINE both! Step 1: Bi-encoder → retrieve top 100 candidates (fast, 5ms) Step 2: Cross-encoder → re-rank top 100 → pick best 10 (accurate, 200ms) → Fast AND accurate! This is how production search works.

47_cross_encoder.py — Cross-Encoder for Re-rankingpython

from sentence_transformers import CrossEncoder

# ===========================
# Cross-encoder: score PAIRS directly
# ===========================
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

pairs = [
    ["How many people live in Jakarta?", "Jakarta has a population of 10 million."],
    ["How many people live in Jakarta?", "Jakarta is the capital of Indonesia."],
    ["How many people live in Jakarta?", "The weather in Jakarta is tropical."],
]

scores = cross_encoder.predict(pairs)
for pair, score in zip(pairs, scores):
    print(f"  {score:+.3f}: Q: {pair[0][:40]} | D: {pair[1][:40]}")
# +7.234: Q: How many people live in Jakarta?  | D: Jakarta has a population of 10 million  ← BEST!
# +1.123: Q: How many people live in Jakarta?  | D: Jakarta is the capital of Indonesia.
# -3.456: Q: How many people live in Jakarta?  | D: The weather in Jakarta is tropical.

Aspek	Bi-Encoder	Cross-Encoder
Kecepatan	~14,000 sent/sec (encode)	~300 pairs/sec
Scalability	✅ Jutaan dokumen	❌ Ratusan dokumen
Akurasi	Good (~85%)	Best (~92%)
Pre-compute	✅ Embed sekali, query berkali-kali	❌ Harus run per query-doc pair
Use Case	Retrieval (cari top-K dari jutaan)	Re-ranking (sortir top-K kandidat)
Production	Step 1: retrieve	Step 2: re-rank

Aspect	Bi-Encoder	Cross-Encoder
Speed	~14,000 sent/sec (encode)	~300 pairs/sec
Scalability	✅ Millions of docs	❌ Hundreds of docs
Accuracy	Good (~85%)	Best (~92%)
Pre-compute	✅ Embed once, query many times	❌ Must run per query-doc pair
Use Case	Retrieval (find top-K from millions)	Re-ranking (sort top-K candidates)
Production	Step 1: retrieve	Step 2: re-rank

⚡

6. FAISS — Vector Search untuk Jutaan Dokumen

6. FAISS — Vector Search for Millions of Documents

Facebook AI Similarity Search — cari dokumen paling mirip dalam milidetik

Facebook AI Similarity Search — find most similar documents in milliseconds

48_faiss_search.py — FAISS Vector Search 🔥python

# pip install faiss-cpu  (atau faiss-gpu untuk GPU)
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# ===========================
# 1. Create document corpus
# ===========================
documents = [
    "Python is a popular programming language",
    "Jakarta is the capital of Indonesia",
    "Machine learning uses data to make predictions",
    "The Eiffel Tower is in Paris, France",
    "TensorFlow is a deep learning framework",
    "Nasi goreng is a famous Indonesian dish",
    "Neural networks are inspired by the brain",
    "Mount Bromo is a volcano in East Java",
    "PyTorch was developed by Facebook AI",
    "Bali is a popular tourist destination",
]

# Encode all documents (ONE TIME — then saved!)
doc_embeddings = model.encode(documents, convert_to_numpy=True)
print(f"Embeddings shape: {doc_embeddings.shape}")  # (10, 384)

# ===========================
# 2. Build FAISS index
# ===========================
dimension = doc_embeddings.shape[1]  # 384

# Exact search (small datasets < 100k)
index = faiss.IndexFlatIP(dimension)  # Inner Product (= cosine sim for normalized vectors)
# IndexFlatIP = exact brute-force search
# IndexFlatL2 = L2 distance (use for non-normalized vectors)

# Normalize embeddings (required for cosine similarity with IndexFlatIP)
faiss.normalize_L2(doc_embeddings)
index.add(doc_embeddings)
print(f"Index size: {index.ntotal} vectors")  # 10

# ===========================
# 3. Search!
# ===========================
query = "What programming language is good for AI?"
query_embedding = model.encode([query], convert_to_numpy=True)
faiss.normalize_L2(query_embedding)

k = 3  # return top 3 results
scores, indices = index.search(query_embedding, k)

print(f"\\nQuery: '{query}'")
print(f"Top {k} results:")
for rank, (score, idx) in enumerate(zip(scores[0], indices[0])):
    print(f"  #{rank+1} (sim={score:.3f}): {documents[idx]}")
# #1 (sim=0.623): Python is a popular programming language
# #2 (sim=0.541): TensorFlow is a deep learning framework
# #3 (sim=0.502): Machine learning uses data to make predictions

# ===========================
# 4. For LARGE datasets (1M+ docs): use approximate index
# ===========================
# nlist = 100  # number of Voronoi cells
# quantizer = faiss.IndexFlatIP(dimension)
# index = faiss.IndexIVFFlat(quantizer, dimension, nlist)
# index.train(doc_embeddings)  # train Voronoi cells
# index.add(doc_embeddings)
# index.nprobe = 10  # search 10 nearest cells (speed/accuracy tradeoff)
# → 1M docs: ~5ms per query (vs 50ms for brute-force)

# ===========================
# 5. Save/load index
# ===========================
faiss.write_index(index, "my_search_index.faiss")
loaded_index = faiss.read_index("my_search_index.faiss")

🔥

7. Proyek: Semantic Search Engine — Dari Nol ke Production

7. Project: Semantic Search Engine — From Scratch to Production

Gabungkan semua: embedding + FAISS + cross-encoder re-ranking = production search

Combine all: embedding + FAISS + cross-encoder re-ranking = production search

49_semantic_search_engine.py — Complete Search Engine 🔥🔥🔥python

import faiss
import numpy as np
import json
from sentence_transformers import SentenceTransformer, CrossEncoder

class SemanticSearchEngine:
    """Production semantic search: bi-encoder retrieval + cross-encoder re-ranking."""

    def __init__(self, bi_model="all-MiniLM-L6-v2",
                 cross_model="cross-encoder/ms-marco-MiniLM-L-6-v2"):
        self.bi_encoder = SentenceTransformer(bi_model)
        self.cross_encoder = CrossEncoder(cross_model)
        self.documents = []
        self.index = None
        self.dim = self.bi_encoder.get_sentence_embedding_dimension()

    def index_documents(self, documents):
        """Encode documents and build FAISS index."""
        self.documents = documents
        embeddings = self.bi_encoder.encode(documents, convert_to_numpy=True,
                                             show_progress_bar=True, batch_size=256)
        faiss.normalize_L2(embeddings)
        self.index = faiss.IndexFlatIP(self.dim)
        self.index.add(embeddings)
        print(f"Indexed {self.index.ntotal} documents")

    def search(self, query, top_k=5, rerank_top=20):
        """Two-stage search: retrieve + re-rank."""
        # Stage 1: Bi-encoder retrieval (fast!)
        q_emb = self.bi_encoder.encode([query], convert_to_numpy=True)
        faiss.normalize_L2(q_emb)
        scores, indices = self.index.search(q_emb, rerank_top)

        candidates = [(self.documents[idx], score)
                      for idx, score in zip(indices[0], scores[0]) if idx >= 0]

        # Stage 2: Cross-encoder re-ranking (accurate!)
        pairs = [[query, doc] for doc, _ in candidates]
        rerank_scores = self.cross_encoder.predict(pairs)

        # Sort by cross-encoder score
        results = sorted(zip(candidates, rerank_scores),
                         key=lambda x: x[1], reverse=True)[:top_k]

        return [{"document": doc, "bi_score": float(bi_s),
                 "rerank_score": float(re_s)}
                for (doc, bi_s), re_s in results]

    def save(self, path):
        faiss.write_index(self.index, f"{path}/index.faiss")
        with open(f"{path}/docs.json", "w") as f:
            json.dump(self.documents, f)

# ===========================
# Use it!
# ===========================
engine = SemanticSearchEngine()
engine.index_documents([
    "Python is a popular programming language for data science",
    "Jakarta is the capital and largest city of Indonesia",
    "TensorFlow and PyTorch are deep learning frameworks",
    "Machine learning models learn patterns from data",
    "Indonesia has over 17,000 islands",
    # ... add thousands of documents!
])

results = engine.search("What language is best for AI?", top_k=3)
for i, r in enumerate(results):
    print(f"  #{i+1} (rerank={r['rerank_score']:.2f}): {r['document'][:60]}...")

🎯

8. Fine-Tune Embedding Model — Domain-Specific

Model generic bagus, tapi domain-specific LEBIH BAIK — medical, legal, Indonesian

Generic models are good, but domain-specific is BETTER — medical, legal, Indonesian

50_finetune_embeddings.py — Train Your Own Embedding Modelpython

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# ===========================
# 1. Prepare training data (pairs + similarity score)
# ===========================
train_examples = [
    InputExample(texts=["I love cats", "I adore kittens"], label=0.9),
    InputExample(texts=["I love cats", "Stock market crashed"], label=0.0),
    InputExample(texts=["Python is great", "I enjoy coding in Python"], label=0.85),
    InputExample(texts=["Jakarta is busy", "The capital is crowded"], label=0.8),
    # ... hundreds/thousands of pairs
]

# ===========================
# 2. Fine-tune!
# ===========================
model = SentenceTransformer("all-MiniLM-L6-v2")  # start from pre-trained
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./my-embedding-model",
    show_progress_bar=True,
)
# → Domain-specific embedding model! Push to Hub if you want.

# ===========================
# 3. Alternative: fine-tune with NLI data (triplets)
# ===========================
# anchor, positive, negative
triplet_examples = [
    InputExample(texts=["I love cats", "Kittens are adorable", "Stock prices fell"]),
    # ...
]
# train_loss = losses.TripletLoss(model)
# or losses.MultipleNegativesRankingLoss(model) ← BEST for retrieval!

🧩

9. RAG Foundations — Search + LLM = Powerful QA

Retrieval-Augmented Generation: cari dokumen relevan, lalu berikan ke LLM sebagai konteks

Retrieval-Augmented Generation: find relevant docs, then give them to LLM as context

RAG = menggabungkan semantic search (Page 6 ini) dengan text generation (Page 3). Alih-alih mengandalkan LLM untuk "mengingat" segalanya, kita cari dokumen relevan dulu, lalu berikan ke LLM sebagai konteks. Hasilnya: jawaban akurat yang berbasis data terkini.

RAG = combining semantic search (this Page 6) with text generation (Page 3). Instead of relying on the LLM to "remember" everything, we search for relevant documents first, then provide them to the LLM as context. Result: accurate answers grounded in current data.

RAG Pipeline — Retrieval-Augmented Generation User Question: "Berapa penduduk Jakarta tahun 2024?" Step 1: RETRIEVE (semantic search — Page 6!) Query → Bi-Encoder → FAISS search → Top 3 documents: ┌─────────────────────────────────────────────────────────┐ │ Doc 1: "Jakarta memiliki populasi sekitar 10.56 juta │ │ jiwa pada tahun 2024 berdasarkan data BPS." │ │ Doc 2: "Jakarta adalah ibukota Indonesia yang terletak │ │ di pulau Jawa." │ │ Doc 3: "Pertumbuhan penduduk Jakarta melambat menjadi │ │ 0.73% per tahun." │ └─────────────────────────────────────────────────────────┘ Step 2: AUGMENT (masukkan docs ke prompt) ┌─────────────────────────────────────────────────────────┐ │ System: Answer based on the following context only. │ │ Context: {Doc 1} {Doc 2} {Doc 3} │ │ Question: Berapa penduduk Jakarta tahun 2024? │ └─────────────────────────────────────────────────────────┘ Step 3: GENERATE (LLM menjawab berdasarkan context) ┌─────────────────────────────────────────────────────────┐ │ LLM: "Berdasarkan data BPS, penduduk Jakarta pada │ │ tahun 2024 sekitar 10.56 juta jiwa, dengan │ │ pertumbuhan 0.73% per tahun." │ └─────────────────────────────────────────────────────────┘ RAG advantages vs plain LLM: ✅ Factual (berbasis dokumen, bukan halusinasi) ✅ Up-to-date (dokumen bisa di-update tanpa retrain model) ✅ Verifiable (bisa tunjukkan sumber/dokumen) ✅ Domain-specific (index dokumen internal perusahaan)

51_rag_simple.py — Simple RAG Pipelinepython

from sentence_transformers import SentenceTransformer
from transformers import pipeline
import faiss, numpy as np

# ===========================
# 1. Setup retriever + generator
# ===========================
retriever = SentenceTransformer("all-MiniLM-L6-v2")
generator = pipeline("text2text-generation", model="google/flan-t5-small", device=0)

# ===========================
# 2. Index knowledge base
# ===========================
knowledge_base = [
    "Jakarta has a population of 10.56 million people in 2024.",
    "Indonesia declared independence on August 17, 1945.",
    "Mount Bromo is an active volcano in East Java, Indonesia.",
    "Python was created by Guido van Rossum in 1991.",
    "TensorFlow was developed by Google Brain team.",
]

kb_embeddings = retriever.encode(knowledge_base, convert_to_numpy=True)
faiss.normalize_L2(kb_embeddings)
index = faiss.IndexFlatIP(kb_embeddings.shape[1])
index.add(kb_embeddings)

# ===========================
# 3. RAG function
# ===========================
def rag_answer(question, top_k=2):
    # Retrieve
    q_emb = retriever.encode([question], convert_to_numpy=True)
    faiss.normalize_L2(q_emb)
    scores, indices = index.search(q_emb, top_k)
    context = " ".join([knowledge_base[i] for i in indices[0]])

    # Generate
    prompt = f"Answer based on this context: {context}\n\nQuestion: {question}\nAnswer:"
    result = generator(prompt, max_length=100)
    return result[0]["generated_text"], context

answer, ctx = rag_answer("What is the population of Jakarta?")
print(f"Answer: {answer}")
# "10.56 million people" ← grounded in context! ✅

🧩 RAG = Fondasi Chatbot Production Modern!
ChatGPT, Perplexity, Google Gemini — semuanya menggunakan varian RAG. Anda sudah punya semua building blocks:
• Page 3: Text generation (GPT/T5)
• Page 6 (ini): Embeddings + FAISS search
• Combine → RAG!
Page 7 akan membahas Hugging Face Spaces untuk deploy RAG app dengan Gradio.

🧩 RAG = Foundation of Modern Production Chatbots!
ChatGPT, Perplexity, Google Gemini — all use RAG variants. You already have all the building blocks:
• Page 3: Text generation (GPT/T5)
• Page 6 (this): Embeddings + FAISS search
• Combine → RAG!
Page 7 will cover Hugging Face Spaces for deploying RAG apps with Gradio.

🏆

10. Pilihan Model Embedding — Mana yang Terbaik?

10. Embedding Model Choices — Which is Best?

Banyak model, masing-masing punya kelebihan — pilih berdasarkan kebutuhan

Many models, each with strengths — choose based on your needs

Model	Dim	Speed	Quality	Bahasa	Best For
all-MiniLM-L6-v2	384	⚡⚡⚡	⭐⭐⭐	English	Speed priority, prototyping ⭐
all-mpnet-base-v2	768	⚡⚡	⭐⭐⭐⭐	English	Best English quality
BGE-small-en-v1.5	384	⚡⚡⚡	⭐⭐⭐⭐	English	MTEB leaderboard top ⭐
BGE-base-en-v1.5	768	⚡⚡	⭐⭐⭐⭐⭐	English	Best overall English
E5-large-v2	1024	⚡	⭐⭐⭐⭐⭐	English	Max quality
multilingual-e5-base	768	⚡⚡	⭐⭐⭐⭐	100+ bahasa	Multilingual + Indonesian ⭐
paraphrase-multilingual	768	⚡⚡	⭐⭐⭐	50+ bahasa	Multilingual general
OpenAI text-embedding-3	3072	API	⭐⭐⭐⭐⭐	Multi	Best quality (paid API)

Model	Dim	Speed	Quality	Language	Best For
all-MiniLM-L6-v2	384	⚡⚡⚡	⭐⭐⭐	English	Speed priority, prototyping ⭐
all-mpnet-base-v2	768	⚡⚡	⭐⭐⭐⭐	English	Best English quality
BGE-small-en-v1.5	384	⚡⚡⚡	⭐⭐⭐⭐	English	MTEB leaderboard top ⭐
BGE-base-en-v1.5	768	⚡⚡	⭐⭐⭐⭐⭐	English	Best overall English
E5-large-v2	1024	⚡	⭐⭐⭐⭐⭐	English	Max quality
multilingual-e5-base	768	⚡⚡	⭐⭐⭐⭐	100+ langs	Multilingual + Indonesian ⭐
paraphrase-multilingual	768	⚡⚡	⭐⭐⭐	50+ langs	Multilingual general
OpenAI text-embedding-3	3072	API	⭐⭐⭐⭐⭐	Multi	Best quality (paid API)

🎓 Rekomendasi Cepat:
English + speed: all-MiniLM-L6-v2 (22M params, 384 dim) → SentenceTransformer("all-MiniLM-L6-v2")
English + quality: BGE-base-en-v1.5 (110M, 768 dim) → SentenceTransformer("BAAI/bge-base-en-v1.5")
Indonesian / multilingual: multilingual-e5-base → SentenceTransformer("intfloat/multilingual-e5-base")
Max quality (paid): OpenAI text-embedding-3-large → API call
Page 6 ini: all-MiniLM-L6-v2 (fastest, free, Colab-friendly)

🎓 Quick Recommendations:
English + speed: all-MiniLM-L6-v2 (22M params, 384 dim) → SentenceTransformer("all-MiniLM-L6-v2")
English + quality: BGE-base-en-v1.5 (110M, 768 dim) → SentenceTransformer("BAAI/bge-base-en-v1.5")
Indonesian / multilingual: multilingual-e5-base → SentenceTransformer("intfloat/multilingual-e5-base")
Max quality (paid): OpenAI text-embedding-3-large → API call
This Page 6: all-MiniLM-L6-v2 (fastest, free, Colab-friendly)

💻

11. Di Mana Jalankan? — CPU Cukup untuk Inference!

11. Where to Run? — CPU Is Enough for Inference!

Kabar baik: embedding inference bisa di CPU! GPU hanya untuk fine-tuning dan batch besar.

Good news: embedding inference works on CPU! GPU only needed for fine-tuning and large batches.

Task	CPU	GPU (T4)	Rekomendasi
Encode 1 query	~5ms ✅	~1ms	CPU cukup!
Encode 100 docs	~500ms ✅	~50ms	CPU OK
Encode 10k docs	~8 sec	~0.7 sec ✅	GPU lebih baik
Encode 1M docs	~13 min	~1.2 min ✅	GPU wajib
FAISS search 1M	~5ms ✅	~1ms	CPU cukup!
Fine-tune embedding	Lambat	✅ GPU	Colab T4

Task	CPU	GPU (T4)	Recommendation
Encode 1 query	~5ms ✅	~1ms	CPU is enough!
Encode 100 docs	~500ms ✅	~50ms	CPU OK
Encode 10k docs	~8 sec	~0.7 sec ✅	GPU better
Encode 1M docs	~13 min	~1.2 min ✅	GPU required
FAISS search 1M	~5ms ✅	~1ms	CPU is enough!
Fine-tune embedding	Slow	✅ GPU	Colab T4

🎉 Kabar Baik: Untuk production search engine, CPU sudah cukup! Encode query (5ms) + FAISS search (5ms) = total 10ms per query pada CPU. Dokumen di-encode sekali (bisa offline di GPU), lalu FAISS search berjalan di CPU. Anda tidak perlu GPU mahal untuk serving. Deploy di VPS $5/bulan pun bisa!

🎉 Good News: For production search engines, CPU is enough! Encode query (5ms) + FAISS search (5ms) = total 10ms per query on CPU. Documents are encoded once (can be done offline on GPU), then FAISS search runs on CPU. You don't need expensive GPUs for serving. Even a $5/month VPS works!

📝

12. Ringkasan Page 6

12. Page 6 Summary

Konsep	Apa Itu	Kode Kunci
Sentence Embedding	Kalimat → vektor bermakna	`model.encode("text")`
SentenceTransformer	Library untuk sentence embeddings	`SentenceTransformer("all-MiniLM-L6-v2")`
Cosine Similarity	Ukur kedekatan makna (0-1)	`util.cos_sim(emb_a, emb_b)`
Bi-Encoder	Encode terpisah, cepat	SentenceTransformer default
Cross-Encoder	Encode bersama, akurat	`CrossEncoder("ms-marco-MiniLM")`
FAISS	Vector search jutaan docs (ms!)	`faiss.IndexFlatIP(dim)`
Semantic Search	Retrieve + Re-rank	Bi-encoder → FAISS → Cross-encoder
Fine-Tune Embeddings	Domain-specific similarity	`model.fit(train_objectives)`
RAG	Search + LLM = factual QA	Retrieve context → prompt LLM

Concept	What It Is	Key Code
Sentence Embedding	Sentence → meaningful vector	`model.encode("text")`
SentenceTransformer	Library for sentence embeddings	`SentenceTransformer("all-MiniLM-L6-v2")`
Cosine Similarity	Measure semantic closeness (0-1)	`util.cos_sim(emb_a, emb_b)`
Bi-Encoder	Encode separately, fast	SentenceTransformer default
Cross-Encoder	Encode together, accurate	`CrossEncoder("ms-marco-MiniLM")`
FAISS	Vector search millions of docs (ms!)	`faiss.IndexFlatIP(dim)`
Semantic Search	Retrieve + Re-rank	Bi-encoder → FAISS → Cross-encoder
Fine-Tune Embeddings	Domain-specific similarity	`model.fit(train_objectives)`
RAG	Search + LLM = factual QA	Retrieve context → prompt LLM

← Page Sebelumnya← Previous Page

Page 5 — Question Answering & Seq2Seq (T5)

📘

Coming Next: Page 7 — Hugging Face Spaces & Gradio Apps

Deploy model Anda sebagai web app! Page 7 membahas: Gradio library untuk UI interaktif, membangun demo app untuk setiap model dari Page 2-6, deploy ke HF Spaces (URL publik gratis!), Streamlit integration, sharing models dan apps, dan building a complete RAG chatbot app.

📘

Coming Next: Page 7 — Hugging Face Spaces & Gradio Apps

Deploy your model as a web app! Page 7 covers: Gradio library for interactive UI, building demo apps for every model from Pages 2-6, deploying to HF Spaces (free public URL!), Streamlit integration, sharing models and apps, and building a complete RAG chatbot app.