๐Ÿ“ Artikel ini ditulis dalam Bahasa Indonesia & English
๐Ÿ“ This article is available in English & Bahasa Indonesia

๐Ÿ” Belajar Hugging Face โ€” Page 6Learn Hugging Face โ€” Page 6

Sentence Embeddings
& Semantic Search

Sentence Embeddings
& Semantic Search

Ubah kalimat menjadi vektor bermakna โ€” fondasi RAG, search engine, dan recommendation. Page 6 membahas super detail: kenapa BERT biasa TIDAK bisa dipakai langsung untuk semantic similarity dan apa solusinya, Sentence Transformers library โ€” cara kerja dan instalasi, cara mengubah kalimat menjadi vektor 384-768 dimensi, cosine similarity โ€” mengukur kedekatan makna antar kalimat, bi-encoder vs cross-encoder (kapan pakai mana dan tradeoff-nya), FAISS โ€” vector search engine untuk jutaan dokumen dalam milidetik, membangun semantic search engine dari nol, fine-tuning embedding model pada domain Anda sendiri, Retrieval-Augmented Generation (RAG) foundations โ€” menggabungkan search + LLM, pilihan model embedding (all-MiniLM, BGE, E5, multilingual), di mana menjalankan (CPU cukup untuk inference!), dan production deployment tips.

Turn sentences into meaningful vectors โ€” the foundation of RAG, search engines, and recommendations. Page 6 covers in super detail: why plain BERT CANNOT be used directly for semantic similarity and what the solution is, Sentence Transformers library โ€” how it works and installation, how to turn sentences into 384-768 dimensional vectors, cosine similarity โ€” measuring semantic closeness between sentences, bi-encoder vs cross-encoder (when to use which and tradeoffs), FAISS โ€” vector search engine for millions of documents in milliseconds, building a semantic search engine from scratch, fine-tuning embedding models on your own domain, Retrieval-Augmented Generation (RAG) foundations โ€” combining search + LLM, embedding model choices (all-MiniLM, BGE, E5, multilingual), where to run (CPU is enough for inference!), and production deployment tips.

๐Ÿ“… MaretMarch 2026โฑ 42 menit baca42 min read
๐Ÿท EmbeddingsSentence TransformersCosine SimilarityFAISSSemantic SearchRAGBi-EncoderCross-Encoder
๐Ÿ“š Seri Belajar Hugging Face:Learn Hugging Face Series:

๐Ÿ“‘ Daftar Isi โ€” Page 6

๐Ÿ“‘ Table of Contents โ€” Page 6

  1. Kenapa BERT Biasa Gagal โ€” Masalah [CLS] untuk similarity
  2. Sentence Transformers โ€” Library & cara kerja
  3. Encode Kalimat โ†’ Vektor โ€” Praktik langsung
  4. Cosine Similarity โ€” Mengukur kedekatan makna
  5. Bi-Encoder vs Cross-Encoder โ€” Speed vs accuracy tradeoff
  6. FAISS โ€” Vector search untuk jutaan dokumen
  7. Proyek: Semantic Search Engine โ€” Dari nol ke production
  8. Fine-Tune Embedding Model โ€” Domain-specific embeddings
  9. RAG Foundations โ€” Search + LLM = powerful QA
  10. Pilihan Model Embedding โ€” MiniLM, BGE, E5, multilingual
  11. Di Mana Jalankan? โ€” CPU cukup untuk inference!
  12. Ringkasan & Preview Page 7
  1. Why Plain BERT Fails โ€” The [CLS] problem for similarity
  2. Sentence Transformers โ€” Library & how it works
  3. Encode Sentences โ†’ Vectors โ€” Hands-on practice
  4. Cosine Similarity โ€” Measuring semantic closeness
  5. Bi-Encoder vs Cross-Encoder โ€” Speed vs accuracy tradeoff
  6. FAISS โ€” Vector search for millions of documents
  7. Project: Semantic Search Engine โ€” From scratch to production
  8. Fine-Tune Embedding Model โ€” Domain-specific embeddings
  9. RAG Foundations โ€” Search + LLM = powerful QA
  10. Embedding Model Choices โ€” MiniLM, BGE, E5, multilingual
  11. Where to Run? โ€” CPU is enough for inference!
  12. Summary & Page 7 Preview
โŒ

1. Kenapa BERT Biasa Gagal untuk Similarity โ€” Masalah [CLS]

1. Why Plain BERT Fails for Similarity โ€” The [CLS] Problem

BERT menghasilkan embedding per-TOKEN, bukan per-KALIMAT. [CLS] token ternyata TIDAK bagus untuk similarity.
BERT produces per-TOKEN embeddings, not per-SENTENCE. The [CLS] token is actually NOT good for similarity.

Intuisi awal banyak orang: "BERT punya [CLS] token yang merepresentasikan seluruh kalimat, jadi saya bisa pakai [CLS] embedding untuk menghitung similarity antar kalimat." Ini SALAH! BERT biasa (tanpa fine-tuning untuk similarity) menghasilkan [CLS] embedding yang hampir tidak bermakna untuk perbandingan semantik. Penelitian menunjukkan bahwa bahkan rata-rata GloVe embeddings lebih baik dari [CLS] BERT untuk similarity tasks!

Many people's initial intuition: "BERT has a [CLS] token that represents the entire sentence, so I can use [CLS] embedding to compute similarity between sentences." This is WRONG! Plain BERT (without fine-tuning for similarity) produces [CLS] embeddings that are nearly meaningless for semantic comparison. Research shows that even averaged GloVe embeddings perform better than BERT [CLS] for similarity tasks!

Kenapa [CLS] BERT Gagal untuk Similarity Masalah 1: [CLS] tidak di-train untuk similarity BERT di-pre-train untuk Masked LM dan Next Sentence Prediction. [CLS] token di-optimize untuk NSP (apakah 2 kalimat berurutan?) โ†’ BUKAN untuk "apakah 2 kalimat bermakna sama?" Masalah 2: Anisotropic embedding space BERT embeddings menempati "cone" sempit di ruang vektor. SEMUA kalimat punya cosine similarity > 0.6 satu sama lain! โ†’ "I love cats" vs "Stock prices fell" = 0.72 โ† TINGGI tapi tidak terkait! Masalah 3: Kecepatan Untuk membandingkan 10,000 kalimat: BERT cross-encoding: 10,000 ร— 10,000 = 100 JUTA forward passes โŒ Sentence embedding: 10,000 encodings + cosine matrix = DETIK โœ… Solusi: Sentence Transformers! Fine-tune BERT khusus untuk menghasilkan sentence embeddings yang bermakna โ†’ kalimat serupa = vektor berdekatan. "I love cats" โ†’ [0.12, -0.34, 0.56, ...] โ† 384-dim vector "I adore kittens" โ†’ [0.11, -0.33, 0.55, ...] โ† SANGAT mirip! โœ… "Stock prices fell" โ†’ [-0.78, 0.22, -0.15, ...] โ† JAUH berbeda! โœ…

๐ŸŽ“ Analogi: BERT biasa vs Sentence Transformers
BERT biasa = kamus yang bisa menjelaskan arti setiap KATA, tapi tidak bisa menilai apakah dua KALIMAT bermakna sama.
Sentence Transformers = penerjemah yang bisa mengubah kalimat utuh menjadi "sidik jari makna" โ€” dua kalimat dengan makna serupa punya sidik jari yang mirip, meskipun kata-katanya berbeda.
"The cat sat on the mat" โ‰ˆ "A feline was resting on a rug" โ†’ sidik jari mirip!
"The cat sat on the mat" โ‰  "Financial markets crashed" โ†’ sidik jari jauh!

๐ŸŽ“ Analogy: Plain BERT vs Sentence Transformers
Plain BERT = a dictionary that explains the meaning of each WORD, but can't judge if two SENTENCES mean the same thing.
Sentence Transformers = a translator that converts whole sentences into "meaning fingerprints" โ€” two sentences with similar meanings have similar fingerprints, even with different words.
"The cat sat on the mat" โ‰ˆ "A feline was resting on a rug" โ†’ similar fingerprints!
"The cat sat on the mat" โ‰  "Financial markets crashed" โ†’ distant fingerprints!

๐Ÿค—

2. Sentence Transformers โ€” Library & Cara Kerja

2. Sentence Transformers โ€” Library & How It Works

Library dari UKP Lab yang menjadi standar industri untuk sentence embeddings
Library from UKP Lab that became the industry standard for sentence embeddings
44_sentence_transformers_setup.py โ€” Install & First Embeddingpython
# ===========================
# Install
# ===========================
# pip install sentence-transformers
# (auto-installs transformers, torch, huggingface-hub)

from sentence_transformers import SentenceTransformer

# ===========================
# Load model (downloads from Hub, cached locally)
# ===========================
model = SentenceTransformer("all-MiniLM-L6-v2")
# all-MiniLM-L6-v2:
# - 22M parameters (TINY! DistilBERT=66M, BERT=110M)
# - 384-dimensional embeddings
# - Trained on 1 BILLION sentence pairs
# - Inference: ~14,000 sentences/second on GPU!
# - Best speed/quality ratio for English

print(f"Model loaded: {model}")
print(f"Max sequence length: {model.max_seq_length}")  # 256
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")  # 384

# ===========================
# How Sentence Transformers works internally:
# ===========================
# 1. Tokenize sentence with BERT tokenizer
# 2. Pass through BERT/MiniLM model โ†’ get ALL token embeddings
# 3. POOL token embeddings into ONE sentence vector
#    โ†’ Mean pooling (average all token embeddings) โ† most common
#    โ†’ CLS pooling (use [CLS] token only)
#    โ†’ Max pooling (take max across tokens)
# 4. NORMALIZE to unit length (for cosine similarity)
# 5. Return: 1 vector per sentence (384 or 768 dimensions)
#
# KEY DIFFERENCE from plain BERT:
# Model is FINE-TUNED on millions of sentence pairs
# using contrastive learning (similar pairs close, dissimilar far)
# โ†’ embedding space is MEANINGFUL for similarity!
Sentence Transformers โ€” Dari Kalimat ke Vektor (Internal Flow) Input: "Jakarta is the capital of Indonesia" โ”‚ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ BERT/MiniLM Tokenizer โ”‚ โ”‚ โ†’ [CLS] Jakarta is the capital โ”‚ โ”‚ of Indonesia [SEP] โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Transformer Encoder (6 layers) โ”‚ โ”‚ โ†’ Token embeddings: โ”‚ โ”‚ [CLS]=[0.12,...] Jakarta=[0.34,.]โ”‚ โ”‚ is=[-0.11,...] the=[0.05,...] โ”‚ โ”‚ capital=[0.67,...] ... โ”‚ โ”‚ Shape: (8 tokens, 384 dim) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Mean Pooling โ”‚ โ”‚ average ALL token embeddings โ”‚ โ”‚ (ignoring [PAD] via attn mask) โ”‚ โ”‚ (8, 384) โ†’ (1, 384) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ L2 Normalize โ”‚ โ”‚ vector / ||vector|| โ”‚ โ”‚ โ†’ unit length (norm = 1.0) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ Output: [0.032, -0.018, 0.071, ..., 0.045] (384 dimensions) โ†‘ ini adalah "sidik jari makna" dari kalimat!
๐Ÿงฌ

3. Encode Kalimat โ†’ Vektor โ€” Praktik Langsung

3. Encode Sentences โ†’ Vectors โ€” Hands-on Practice

Satu baris kode: kalimat apapun โ†’ vektor 384 dimensi yang bermakna
One line of code: any sentence โ†’ a meaningful 384-dimensional vector
45_encode_sentences.py โ€” Encoding in Practicepython
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

# ===========================
# 1. Single sentence
# ===========================
embedding = model.encode("Jakarta is the capital of Indonesia")
print(f"Type: {type(embedding)}")           # numpy.ndarray
print(f"Shape: {embedding.shape}")           # (384,)
print(f"First 5 values: {embedding[:5]}")    # [0.032, -0.018, ...]
print(f"Norm: {np.linalg.norm(embedding):.4f}")  # 1.0000 (normalized!)

# ===========================
# 2. Batch encoding (MUCH faster!)
# ===========================
sentences = [
    "I love machine learning",
    "Deep learning is fascinating",
    "The weather is beautiful today",
    "I enjoy artificial intelligence",
    "It's raining cats and dogs",
]

embeddings = model.encode(sentences, show_progress_bar=True, batch_size=32)
print(f"Batch shape: {embeddings.shape}")  # (5, 384)
# 5 sentences โ†’ 5 vectors, each 384 dimensions

# ===========================
# 3. GPU acceleration
# ===========================
model_gpu = SentenceTransformer("all-MiniLM-L6-v2", device="cuda")
embeddings = model_gpu.encode(sentences)  # ~14,000 sentences/sec on T4!

# ===========================
# 4. Return as PyTorch tensors
# ===========================
embeddings_pt = model.encode(sentences, convert_to_tensor=True)
print(f"Tensor device: {embeddings_pt.device}")  # cuda:0 (if GPU)

# ===========================
# 5. Speed benchmark
# ===========================
import time
big_corpus = [f"This is sentence number {i}" for i in range(10000)]
start = time.time()
_ = model.encode(big_corpus, batch_size=256, show_progress_bar=False)
elapsed = time.time() - start
print(f"10,000 sentences in {elapsed:.1f}s ({10000/elapsed:.0f} sent/sec)")
# GPU: 10,000 sentences in 0.7s (14,285 sent/sec)
# CPU: 10,000 sentences in 8.2s (1,220 sent/sec)
๐Ÿ“

4. Cosine Similarity โ€” Mengukur Kedekatan Makna

4. Cosine Similarity โ€” Measuring Semantic Closeness

cos(A,B) = 1.0 โ†’ identik, 0.0 โ†’ tidak terkait, -1.0 โ†’ berlawanan
cos(A,B) = 1.0 โ†’ identical, 0.0 โ†’ unrelated, -1.0 โ†’ opposite
46_cosine_similarity.py โ€” Semantic Similarity ๐Ÿ”ฌpython
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

# ===========================
# 1. Pairwise similarity
# ===========================
sent_a = "I love machine learning"
sent_b = "Deep learning is my passion"
sent_c = "The weather is terrible today"

emb_a = model.encode(sent_a, convert_to_tensor=True)
emb_b = model.encode(sent_b, convert_to_tensor=True)
emb_c = model.encode(sent_c, convert_to_tensor=True)

sim_ab = util.cos_sim(emb_a, emb_b).item()
sim_ac = util.cos_sim(emb_a, emb_c).item()
sim_bc = util.cos_sim(emb_b, emb_c).item()

print(f"'{sent_a}' vs '{sent_b}': {sim_ab:.3f}")  # 0.782 (TINGGI โ€” terkait!)
print(f"'{sent_a}' vs '{sent_c}': {sim_ac:.3f}")  # 0.094 (RENDAH โ€” tidak terkait)
print(f"'{sent_b}' vs '{sent_c}': {sim_bc:.3f}")  # 0.051 (RENDAH โ€” tidak terkait)

# ===========================
# 2. Similarity matrix (all pairs!)
# ===========================
sentences = [
    "I love cats",
    "I adore kittens",
    "Dogs are great pets",
    "The stock market crashed",
    "Financial markets are volatile",
]
embeddings = model.encode(sentences, convert_to_tensor=True)
sim_matrix = util.cos_sim(embeddings, embeddings)
print(f"Similarity matrix shape: {sim_matrix.shape}")  # (5, 5)

# Pretty print
for i in range(len(sentences)):
    for j in range(len(sentences)):
        print(f"{sim_matrix[i][j]:.2f}", end="  ")
    print(f"  โ† {sentences[i][:25]}")
# 1.00  0.83  0.45  0.02  0.05  โ† I love cats
# 0.83  1.00  0.48  0.01  0.04  โ† I adore kittens
# 0.45  0.48  1.00  0.03  0.06  โ† Dogs are great pets
# 0.02  0.01  0.03  1.00  0.79  โ† The stock market crashed
# 0.05  0.04  0.06  0.79  1.00  โ† Financial markets volatile

# PERFECT! Two clusters clearly visible:
# Cluster 1: animals (cats, kittens, dogs) โ€” high similarity
# Cluster 2: finance (stock market, financial) โ€” high similarity
# Cross-cluster: near zero โ€” correctly unrelated!

# ===========================
# 3. Find most similar pair
# ===========================
pairs = util.paraphrase_mining(model, sentences, top_k=3)
for score, i, j in pairs:
    print(f"  {score:.3f}: '{sentences[i]}' โ†” '{sentences[j]}'")
# 0.831: 'I love cats' โ†” 'I adore kittens'
# 0.789: 'The stock market crashed' โ†” 'Financial markets are volatile'
# 0.478: 'I adore kittens' โ†” 'Dogs are great pets'
โš–๏ธ

5. Bi-Encoder vs Cross-Encoder โ€” Speed vs Accuracy

5. Bi-Encoder vs Cross-Encoder โ€” Speed vs Accuracy

Bi-encoder = cepat (search jutaan). Cross-encoder = akurat (re-ranking top-K).
Bi-encoder = fast (search millions). Cross-encoder = accurate (re-rank top-K).
Bi-Encoder vs Cross-Encoder โ€” Dua Pendekatan Similarity Bi-Encoder (Sentence Transformers default) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Sent A โ†’ [Encoder] โ†’ Vector A โ”€โ” โ”œโ”€โ†’ cosine_sim(A, B) = 0.83 Sent B โ†’ [Encoder] โ†’ Vector B โ”€โ”˜ โœ… CEPAT: encode sekali, bandingkan jutaan kali (vektor pre-computed!) โœ… Scalable: FAISS index โ†’ search 1M docs dalam 5ms โŒ Less accurate: sentence diproses TERPISAH (no cross-attention) Cross-Encoder (more accurate, slower) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ [Sent A] [SEP] [Sent B] โ†’ [Encoder] โ†’ Score: 0.91 โœ… AKURAT: kedua kalimat diproses BERSAMA (full cross-attention) โŒ LAMBAT: O(nยฒ) โ€” harus run model untuk SETIAP pasangan! โŒ Not scalable: 10,000 docs = 10,000 forward passes per query Best Practice: COMBINE both! Step 1: Bi-encoder โ†’ retrieve top 100 candidates (fast, 5ms) Step 2: Cross-encoder โ†’ re-rank top 100 โ†’ pick best 10 (accurate, 200ms) โ†’ Fast AND accurate! This is how production search works.
47_cross_encoder.py โ€” Cross-Encoder for Re-rankingpython
from sentence_transformers import CrossEncoder

# ===========================
# Cross-encoder: score PAIRS directly
# ===========================
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

pairs = [
    ["How many people live in Jakarta?", "Jakarta has a population of 10 million."],
    ["How many people live in Jakarta?", "Jakarta is the capital of Indonesia."],
    ["How many people live in Jakarta?", "The weather in Jakarta is tropical."],
]

scores = cross_encoder.predict(pairs)
for pair, score in zip(pairs, scores):
    print(f"  {score:+.3f}: Q: {pair[0][:40]} | D: {pair[1][:40]}")
# +7.234: Q: How many people live in Jakarta?  | D: Jakarta has a population of 10 million  โ† BEST!
# +1.123: Q: How many people live in Jakarta?  | D: Jakarta is the capital of Indonesia.
# -3.456: Q: How many people live in Jakarta?  | D: The weather in Jakarta is tropical.
AspekBi-EncoderCross-Encoder
Kecepatan~14,000 sent/sec (encode)~300 pairs/sec
Scalabilityโœ… Jutaan dokumenโŒ Ratusan dokumen
AkurasiGood (~85%)Best (~92%)
Pre-computeโœ… Embed sekali, query berkali-kaliโŒ Harus run per query-doc pair
Use CaseRetrieval (cari top-K dari jutaan)Re-ranking (sortir top-K kandidat)
ProductionStep 1: retrieveStep 2: re-rank
AspectBi-EncoderCross-Encoder
Speed~14,000 sent/sec (encode)~300 pairs/sec
Scalabilityโœ… Millions of docsโŒ Hundreds of docs
AccuracyGood (~85%)Best (~92%)
Pre-computeโœ… Embed once, query many timesโŒ Must run per query-doc pair
Use CaseRetrieval (find top-K from millions)Re-ranking (sort top-K candidates)
ProductionStep 1: retrieveStep 2: re-rank
โšก

6. FAISS โ€” Vector Search untuk Jutaan Dokumen

6. FAISS โ€” Vector Search for Millions of Documents

Facebook AI Similarity Search โ€” cari dokumen paling mirip dalam milidetik
Facebook AI Similarity Search โ€” find most similar documents in milliseconds
48_faiss_search.py โ€” FAISS Vector Search ๐Ÿ”ฅpython
# pip install faiss-cpu  (atau faiss-gpu untuk GPU)
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# ===========================
# 1. Create document corpus
# ===========================
documents = [
    "Python is a popular programming language",
    "Jakarta is the capital of Indonesia",
    "Machine learning uses data to make predictions",
    "The Eiffel Tower is in Paris, France",
    "TensorFlow is a deep learning framework",
    "Nasi goreng is a famous Indonesian dish",
    "Neural networks are inspired by the brain",
    "Mount Bromo is a volcano in East Java",
    "PyTorch was developed by Facebook AI",
    "Bali is a popular tourist destination",
]

# Encode all documents (ONE TIME โ€” then saved!)
doc_embeddings = model.encode(documents, convert_to_numpy=True)
print(f"Embeddings shape: {doc_embeddings.shape}")  # (10, 384)

# ===========================
# 2. Build FAISS index
# ===========================
dimension = doc_embeddings.shape[1]  # 384

# Exact search (small datasets < 100k)
index = faiss.IndexFlatIP(dimension)  # Inner Product (= cosine sim for normalized vectors)
# IndexFlatIP = exact brute-force search
# IndexFlatL2 = L2 distance (use for non-normalized vectors)

# Normalize embeddings (required for cosine similarity with IndexFlatIP)
faiss.normalize_L2(doc_embeddings)
index.add(doc_embeddings)
print(f"Index size: {index.ntotal} vectors")  # 10

# ===========================
# 3. Search!
# ===========================
query = "What programming language is good for AI?"
query_embedding = model.encode([query], convert_to_numpy=True)
faiss.normalize_L2(query_embedding)

k = 3  # return top 3 results
scores, indices = index.search(query_embedding, k)

print(f"\\nQuery: '{query}'")
print(f"Top {k} results:")
for rank, (score, idx) in enumerate(zip(scores[0], indices[0])):
    print(f"  #{rank+1} (sim={score:.3f}): {documents[idx]}")
# #1 (sim=0.623): Python is a popular programming language
# #2 (sim=0.541): TensorFlow is a deep learning framework
# #3 (sim=0.502): Machine learning uses data to make predictions

# ===========================
# 4. For LARGE datasets (1M+ docs): use approximate index
# ===========================
# nlist = 100  # number of Voronoi cells
# quantizer = faiss.IndexFlatIP(dimension)
# index = faiss.IndexIVFFlat(quantizer, dimension, nlist)
# index.train(doc_embeddings)  # train Voronoi cells
# index.add(doc_embeddings)
# index.nprobe = 10  # search 10 nearest cells (speed/accuracy tradeoff)
# โ†’ 1M docs: ~5ms per query (vs 50ms for brute-force)

# ===========================
# 5. Save/load index
# ===========================
faiss.write_index(index, "my_search_index.faiss")
loaded_index = faiss.read_index("my_search_index.faiss")
๐Ÿ”ฅ

7. Proyek: Semantic Search Engine โ€” Dari Nol ke Production

7. Project: Semantic Search Engine โ€” From Scratch to Production

Gabungkan semua: embedding + FAISS + cross-encoder re-ranking = production search
Combine all: embedding + FAISS + cross-encoder re-ranking = production search
49_semantic_search_engine.py โ€” Complete Search Engine ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅpython
import faiss
import numpy as np
import json
from sentence_transformers import SentenceTransformer, CrossEncoder

class SemanticSearchEngine:
    """Production semantic search: bi-encoder retrieval + cross-encoder re-ranking."""

    def __init__(self, bi_model="all-MiniLM-L6-v2",
                 cross_model="cross-encoder/ms-marco-MiniLM-L-6-v2"):
        self.bi_encoder = SentenceTransformer(bi_model)
        self.cross_encoder = CrossEncoder(cross_model)
        self.documents = []
        self.index = None
        self.dim = self.bi_encoder.get_sentence_embedding_dimension()

    def index_documents(self, documents):
        """Encode documents and build FAISS index."""
        self.documents = documents
        embeddings = self.bi_encoder.encode(documents, convert_to_numpy=True,
                                             show_progress_bar=True, batch_size=256)
        faiss.normalize_L2(embeddings)
        self.index = faiss.IndexFlatIP(self.dim)
        self.index.add(embeddings)
        print(f"Indexed {self.index.ntotal} documents")

    def search(self, query, top_k=5, rerank_top=20):
        """Two-stage search: retrieve + re-rank."""
        # Stage 1: Bi-encoder retrieval (fast!)
        q_emb = self.bi_encoder.encode([query], convert_to_numpy=True)
        faiss.normalize_L2(q_emb)
        scores, indices = self.index.search(q_emb, rerank_top)

        candidates = [(self.documents[idx], score)
                      for idx, score in zip(indices[0], scores[0]) if idx >= 0]

        # Stage 2: Cross-encoder re-ranking (accurate!)
        pairs = [[query, doc] for doc, _ in candidates]
        rerank_scores = self.cross_encoder.predict(pairs)

        # Sort by cross-encoder score
        results = sorted(zip(candidates, rerank_scores),
                         key=lambda x: x[1], reverse=True)[:top_k]

        return [{"document": doc, "bi_score": float(bi_s),
                 "rerank_score": float(re_s)}
                for (doc, bi_s), re_s in results]

    def save(self, path):
        faiss.write_index(self.index, f"{path}/index.faiss")
        with open(f"{path}/docs.json", "w") as f:
            json.dump(self.documents, f)

# ===========================
# Use it!
# ===========================
engine = SemanticSearchEngine()
engine.index_documents([
    "Python is a popular programming language for data science",
    "Jakarta is the capital and largest city of Indonesia",
    "TensorFlow and PyTorch are deep learning frameworks",
    "Machine learning models learn patterns from data",
    "Indonesia has over 17,000 islands",
    # ... add thousands of documents!
])

results = engine.search("What language is best for AI?", top_k=3)
for i, r in enumerate(results):
    print(f"  #{i+1} (rerank={r['rerank_score']:.2f}): {r['document'][:60]}...")
๐ŸŽฏ

8. Fine-Tune Embedding Model โ€” Domain-Specific

8. Fine-Tune Embedding Model โ€” Domain-Specific

Model generic bagus, tapi domain-specific LEBIH BAIK โ€” medical, legal, Indonesian
Generic models are good, but domain-specific is BETTER โ€” medical, legal, Indonesian
50_finetune_embeddings.py โ€” Train Your Own Embedding Modelpython
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# ===========================
# 1. Prepare training data (pairs + similarity score)
# ===========================
train_examples = [
    InputExample(texts=["I love cats", "I adore kittens"], label=0.9),
    InputExample(texts=["I love cats", "Stock market crashed"], label=0.0),
    InputExample(texts=["Python is great", "I enjoy coding in Python"], label=0.85),
    InputExample(texts=["Jakarta is busy", "The capital is crowded"], label=0.8),
    # ... hundreds/thousands of pairs
]

# ===========================
# 2. Fine-tune!
# ===========================
model = SentenceTransformer("all-MiniLM-L6-v2")  # start from pre-trained
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./my-embedding-model",
    show_progress_bar=True,
)
# โ†’ Domain-specific embedding model! Push to Hub if you want.

# ===========================
# 3. Alternative: fine-tune with NLI data (triplets)
# ===========================
# anchor, positive, negative
triplet_examples = [
    InputExample(texts=["I love cats", "Kittens are adorable", "Stock prices fell"]),
    # ...
]
# train_loss = losses.TripletLoss(model)
# or losses.MultipleNegativesRankingLoss(model) โ† BEST for retrieval!
๐Ÿงฉ

9. RAG Foundations โ€” Search + LLM = Powerful QA

9. RAG Foundations โ€” Search + LLM = Powerful QA

Retrieval-Augmented Generation: cari dokumen relevan, lalu berikan ke LLM sebagai konteks
Retrieval-Augmented Generation: find relevant docs, then give them to LLM as context

RAG = menggabungkan semantic search (Page 6 ini) dengan text generation (Page 3). Alih-alih mengandalkan LLM untuk "mengingat" segalanya, kita cari dokumen relevan dulu, lalu berikan ke LLM sebagai konteks. Hasilnya: jawaban akurat yang berbasis data terkini.

RAG = combining semantic search (this Page 6) with text generation (Page 3). Instead of relying on the LLM to "remember" everything, we search for relevant documents first, then provide them to the LLM as context. Result: accurate answers grounded in current data.

RAG Pipeline โ€” Retrieval-Augmented Generation User Question: "Berapa penduduk Jakarta tahun 2024?" Step 1: RETRIEVE (semantic search โ€” Page 6!) Query โ†’ Bi-Encoder โ†’ FAISS search โ†’ Top 3 documents: โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Doc 1: "Jakarta memiliki populasi sekitar 10.56 juta โ”‚ โ”‚ jiwa pada tahun 2024 berdasarkan data BPS." โ”‚ โ”‚ Doc 2: "Jakarta adalah ibukota Indonesia yang terletak โ”‚ โ”‚ di pulau Jawa." โ”‚ โ”‚ Doc 3: "Pertumbuhan penduduk Jakarta melambat menjadi โ”‚ โ”‚ 0.73% per tahun." โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Step 2: AUGMENT (masukkan docs ke prompt) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ System: Answer based on the following context only. โ”‚ โ”‚ Context: {Doc 1} {Doc 2} {Doc 3} โ”‚ โ”‚ Question: Berapa penduduk Jakarta tahun 2024? โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Step 3: GENERATE (LLM menjawab berdasarkan context) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ LLM: "Berdasarkan data BPS, penduduk Jakarta pada โ”‚ โ”‚ tahun 2024 sekitar 10.56 juta jiwa, dengan โ”‚ โ”‚ pertumbuhan 0.73% per tahun." โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ RAG advantages vs plain LLM: โœ… Factual (berbasis dokumen, bukan halusinasi) โœ… Up-to-date (dokumen bisa di-update tanpa retrain model) โœ… Verifiable (bisa tunjukkan sumber/dokumen) โœ… Domain-specific (index dokumen internal perusahaan)
51_rag_simple.py โ€” Simple RAG Pipelinepython
from sentence_transformers import SentenceTransformer
from transformers import pipeline
import faiss, numpy as np

# ===========================
# 1. Setup retriever + generator
# ===========================
retriever = SentenceTransformer("all-MiniLM-L6-v2")
generator = pipeline("text2text-generation", model="google/flan-t5-small", device=0)

# ===========================
# 2. Index knowledge base
# ===========================
knowledge_base = [
    "Jakarta has a population of 10.56 million people in 2024.",
    "Indonesia declared independence on August 17, 1945.",
    "Mount Bromo is an active volcano in East Java, Indonesia.",
    "Python was created by Guido van Rossum in 1991.",
    "TensorFlow was developed by Google Brain team.",
]

kb_embeddings = retriever.encode(knowledge_base, convert_to_numpy=True)
faiss.normalize_L2(kb_embeddings)
index = faiss.IndexFlatIP(kb_embeddings.shape[1])
index.add(kb_embeddings)

# ===========================
# 3. RAG function
# ===========================
def rag_answer(question, top_k=2):
    # Retrieve
    q_emb = retriever.encode([question], convert_to_numpy=True)
    faiss.normalize_L2(q_emb)
    scores, indices = index.search(q_emb, top_k)
    context = " ".join([knowledge_base[i] for i in indices[0]])

    # Generate
    prompt = f"Answer based on this context: {context}\n\nQuestion: {question}\nAnswer:"
    result = generator(prompt, max_length=100)
    return result[0]["generated_text"], context

answer, ctx = rag_answer("What is the population of Jakarta?")
print(f"Answer: {answer}")
# "10.56 million people" โ† grounded in context! โœ…

๐Ÿงฉ RAG = Fondasi Chatbot Production Modern!
ChatGPT, Perplexity, Google Gemini โ€” semuanya menggunakan varian RAG. Anda sudah punya semua building blocks:
โ€ข Page 3: Text generation (GPT/T5)
โ€ข Page 6 (ini): Embeddings + FAISS search
โ€ข Combine โ†’ RAG!
Page 7 akan membahas Hugging Face Spaces untuk deploy RAG app dengan Gradio.

๐Ÿงฉ RAG = Foundation of Modern Production Chatbots!
ChatGPT, Perplexity, Google Gemini โ€” all use RAG variants. You already have all the building blocks:
โ€ข Page 3: Text generation (GPT/T5)
โ€ข Page 6 (this): Embeddings + FAISS search
โ€ข Combine โ†’ RAG!
Page 7 will cover Hugging Face Spaces for deploying RAG apps with Gradio.

๐Ÿ†

10. Pilihan Model Embedding โ€” Mana yang Terbaik?

10. Embedding Model Choices โ€” Which is Best?

Banyak model, masing-masing punya kelebihan โ€” pilih berdasarkan kebutuhan
Many models, each with strengths โ€” choose based on your needs
ModelDimSpeedQualityBahasaBest For
all-MiniLM-L6-v2384โšกโšกโšกโญโญโญEnglishSpeed priority, prototyping โญ
all-mpnet-base-v2768โšกโšกโญโญโญโญEnglishBest English quality
BGE-small-en-v1.5384โšกโšกโšกโญโญโญโญEnglishMTEB leaderboard top โญ
BGE-base-en-v1.5768โšกโšกโญโญโญโญโญEnglishBest overall English
E5-large-v21024โšกโญโญโญโญโญEnglishMax quality
multilingual-e5-base768โšกโšกโญโญโญโญ100+ bahasaMultilingual + Indonesian โญ
paraphrase-multilingual768โšกโšกโญโญโญ50+ bahasaMultilingual general
OpenAI text-embedding-33072APIโญโญโญโญโญMultiBest quality (paid API)
ModelDimSpeedQualityLanguageBest For
all-MiniLM-L6-v2384โšกโšกโšกโญโญโญEnglishSpeed priority, prototyping โญ
all-mpnet-base-v2768โšกโšกโญโญโญโญEnglishBest English quality
BGE-small-en-v1.5384โšกโšกโšกโญโญโญโญEnglishMTEB leaderboard top โญ
BGE-base-en-v1.5768โšกโšกโญโญโญโญโญEnglishBest overall English
E5-large-v21024โšกโญโญโญโญโญEnglishMax quality
multilingual-e5-base768โšกโšกโญโญโญโญ100+ langsMultilingual + Indonesian โญ
paraphrase-multilingual768โšกโšกโญโญโญ50+ langsMultilingual general
OpenAI text-embedding-33072APIโญโญโญโญโญMultiBest quality (paid API)

๐ŸŽ“ Rekomendasi Cepat:
English + speed: all-MiniLM-L6-v2 (22M params, 384 dim) โ†’ SentenceTransformer("all-MiniLM-L6-v2")
English + quality: BGE-base-en-v1.5 (110M, 768 dim) โ†’ SentenceTransformer("BAAI/bge-base-en-v1.5")
Indonesian / multilingual: multilingual-e5-base โ†’ SentenceTransformer("intfloat/multilingual-e5-base")
Max quality (paid): OpenAI text-embedding-3-large โ†’ API call
Page 6 ini: all-MiniLM-L6-v2 (fastest, free, Colab-friendly)

๐ŸŽ“ Quick Recommendations:
English + speed: all-MiniLM-L6-v2 (22M params, 384 dim) โ†’ SentenceTransformer("all-MiniLM-L6-v2")
English + quality: BGE-base-en-v1.5 (110M, 768 dim) โ†’ SentenceTransformer("BAAI/bge-base-en-v1.5")
Indonesian / multilingual: multilingual-e5-base โ†’ SentenceTransformer("intfloat/multilingual-e5-base")
Max quality (paid): OpenAI text-embedding-3-large โ†’ API call
This Page 6: all-MiniLM-L6-v2 (fastest, free, Colab-friendly)

๐Ÿ’ป

11. Di Mana Jalankan? โ€” CPU Cukup untuk Inference!

11. Where to Run? โ€” CPU Is Enough for Inference!

Kabar baik: embedding inference bisa di CPU! GPU hanya untuk fine-tuning dan batch besar.
Good news: embedding inference works on CPU! GPU only needed for fine-tuning and large batches.
TaskCPUGPU (T4)Rekomendasi
Encode 1 query~5ms โœ…~1msCPU cukup!
Encode 100 docs~500ms โœ…~50msCPU OK
Encode 10k docs~8 sec~0.7 sec โœ…GPU lebih baik
Encode 1M docs~13 min~1.2 min โœ…GPU wajib
FAISS search 1M~5ms โœ…~1msCPU cukup!
Fine-tune embeddingLambatโœ… GPUColab T4
TaskCPUGPU (T4)Recommendation
Encode 1 query~5ms โœ…~1msCPU is enough!
Encode 100 docs~500ms โœ…~50msCPU OK
Encode 10k docs~8 sec~0.7 sec โœ…GPU better
Encode 1M docs~13 min~1.2 min โœ…GPU required
FAISS search 1M~5ms โœ…~1msCPU is enough!
Fine-tune embeddingSlowโœ… GPUColab T4

๐ŸŽ‰ Kabar Baik: Untuk production search engine, CPU sudah cukup! Encode query (5ms) + FAISS search (5ms) = total 10ms per query pada CPU. Dokumen di-encode sekali (bisa offline di GPU), lalu FAISS search berjalan di CPU. Anda tidak perlu GPU mahal untuk serving. Deploy di VPS $5/bulan pun bisa!

๐ŸŽ‰ Good News: For production search engines, CPU is enough! Encode query (5ms) + FAISS search (5ms) = total 10ms per query on CPU. Documents are encoded once (can be done offline on GPU), then FAISS search runs on CPU. You don't need expensive GPUs for serving. Even a $5/month VPS works!

๐Ÿ“

12. Ringkasan Page 6

12. Page 6 Summary

KonsepApa ItuKode Kunci
Sentence EmbeddingKalimat โ†’ vektor bermaknamodel.encode("text")
SentenceTransformerLibrary untuk sentence embeddingsSentenceTransformer("all-MiniLM-L6-v2")
Cosine SimilarityUkur kedekatan makna (0-1)util.cos_sim(emb_a, emb_b)
Bi-EncoderEncode terpisah, cepatSentenceTransformer default
Cross-EncoderEncode bersama, akuratCrossEncoder("ms-marco-MiniLM")
FAISSVector search jutaan docs (ms!)faiss.IndexFlatIP(dim)
Semantic SearchRetrieve + Re-rankBi-encoder โ†’ FAISS โ†’ Cross-encoder
Fine-Tune EmbeddingsDomain-specific similaritymodel.fit(train_objectives)
RAGSearch + LLM = factual QARetrieve context โ†’ prompt LLM
ConceptWhat It IsKey Code
Sentence EmbeddingSentence โ†’ meaningful vectormodel.encode("text")
SentenceTransformerLibrary for sentence embeddingsSentenceTransformer("all-MiniLM-L6-v2")
Cosine SimilarityMeasure semantic closeness (0-1)util.cos_sim(emb_a, emb_b)
Bi-EncoderEncode separately, fastSentenceTransformer default
Cross-EncoderEncode together, accurateCrossEncoder("ms-marco-MiniLM")
FAISSVector search millions of docs (ms!)faiss.IndexFlatIP(dim)
Semantic SearchRetrieve + Re-rankBi-encoder โ†’ FAISS โ†’ Cross-encoder
Fine-Tune EmbeddingsDomain-specific similaritymodel.fit(train_objectives)
RAGSearch + LLM = factual QARetrieve context โ†’ prompt LLM
โ† Page Sebelumnyaโ† Previous Page

Page 5 โ€” Question Answering & Seq2Seq (T5)

๐Ÿ“˜

Coming Next: Page 7 โ€” Hugging Face Spaces & Gradio Apps

Deploy model Anda sebagai web app! Page 7 membahas: Gradio library untuk UI interaktif, membangun demo app untuk setiap model dari Page 2-6, deploy ke HF Spaces (URL publik gratis!), Streamlit integration, sharing models dan apps, dan building a complete RAG chatbot app.

๐Ÿ“˜

Coming Next: Page 7 โ€” Hugging Face Spaces & Gradio Apps

Deploy your model as a web app! Page 7 covers: Gradio library for interactive UI, building demo apps for every model from Pages 2-6, deploying to HF Spaces (free public URL!), Streamlit integration, sharing models and apps, and building a complete RAG chatbot app.