๐ Daftar Isi โ Page 6
๐ Table of Contents โ Page 6
- Kenapa BERT Biasa Gagal โ Masalah [CLS] untuk similarity
- Sentence Transformers โ Library & cara kerja
- Encode Kalimat โ Vektor โ Praktik langsung
- Cosine Similarity โ Mengukur kedekatan makna
- Bi-Encoder vs Cross-Encoder โ Speed vs accuracy tradeoff
- FAISS โ Vector search untuk jutaan dokumen
- Proyek: Semantic Search Engine โ Dari nol ke production
- Fine-Tune Embedding Model โ Domain-specific embeddings
- RAG Foundations โ Search + LLM = powerful QA
- Pilihan Model Embedding โ MiniLM, BGE, E5, multilingual
- Di Mana Jalankan? โ CPU cukup untuk inference!
- Ringkasan & Preview Page 7
- Why Plain BERT Fails โ The [CLS] problem for similarity
- Sentence Transformers โ Library & how it works
- Encode Sentences โ Vectors โ Hands-on practice
- Cosine Similarity โ Measuring semantic closeness
- Bi-Encoder vs Cross-Encoder โ Speed vs accuracy tradeoff
- FAISS โ Vector search for millions of documents
- Project: Semantic Search Engine โ From scratch to production
- Fine-Tune Embedding Model โ Domain-specific embeddings
- RAG Foundations โ Search + LLM = powerful QA
- Embedding Model Choices โ MiniLM, BGE, E5, multilingual
- Where to Run? โ CPU is enough for inference!
- Summary & Page 7 Preview
1. Kenapa BERT Biasa Gagal untuk Similarity โ Masalah [CLS]
1. Why Plain BERT Fails for Similarity โ The [CLS] Problem
Intuisi awal banyak orang: "BERT punya [CLS] token yang merepresentasikan seluruh kalimat, jadi saya bisa pakai [CLS] embedding untuk menghitung similarity antar kalimat." Ini SALAH! BERT biasa (tanpa fine-tuning untuk similarity) menghasilkan [CLS] embedding yang hampir tidak bermakna untuk perbandingan semantik. Penelitian menunjukkan bahwa bahkan rata-rata GloVe embeddings lebih baik dari [CLS] BERT untuk similarity tasks!
Many people's initial intuition: "BERT has a [CLS] token that represents the entire sentence, so I can use [CLS] embedding to compute similarity between sentences." This is WRONG! Plain BERT (without fine-tuning for similarity) produces [CLS] embeddings that are nearly meaningless for semantic comparison. Research shows that even averaged GloVe embeddings perform better than BERT [CLS] for similarity tasks!
๐ Analogi: BERT biasa vs Sentence Transformers
BERT biasa = kamus yang bisa menjelaskan arti setiap KATA, tapi tidak bisa menilai apakah dua KALIMAT bermakna sama.
Sentence Transformers = penerjemah yang bisa mengubah kalimat utuh menjadi "sidik jari makna" โ dua kalimat dengan makna serupa punya sidik jari yang mirip, meskipun kata-katanya berbeda.
"The cat sat on the mat" โ "A feline was resting on a rug" โ sidik jari mirip!
"The cat sat on the mat" โ "Financial markets crashed" โ sidik jari jauh!
๐ Analogy: Plain BERT vs Sentence Transformers
Plain BERT = a dictionary that explains the meaning of each WORD, but can't judge if two SENTENCES mean the same thing.
Sentence Transformers = a translator that converts whole sentences into "meaning fingerprints" โ two sentences with similar meanings have similar fingerprints, even with different words.
"The cat sat on the mat" โ "A feline was resting on a rug" โ similar fingerprints!
"The cat sat on the mat" โ "Financial markets crashed" โ distant fingerprints!
2. Sentence Transformers โ Library & Cara Kerja
2. Sentence Transformers โ Library & How It Works
# =========================== # Install # =========================== # pip install sentence-transformers # (auto-installs transformers, torch, huggingface-hub) from sentence_transformers import SentenceTransformer # =========================== # Load model (downloads from Hub, cached locally) # =========================== model = SentenceTransformer("all-MiniLM-L6-v2") # all-MiniLM-L6-v2: # - 22M parameters (TINY! DistilBERT=66M, BERT=110M) # - 384-dimensional embeddings # - Trained on 1 BILLION sentence pairs # - Inference: ~14,000 sentences/second on GPU! # - Best speed/quality ratio for English print(f"Model loaded: {model}") print(f"Max sequence length: {model.max_seq_length}") # 256 print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}") # 384 # =========================== # How Sentence Transformers works internally: # =========================== # 1. Tokenize sentence with BERT tokenizer # 2. Pass through BERT/MiniLM model โ get ALL token embeddings # 3. POOL token embeddings into ONE sentence vector # โ Mean pooling (average all token embeddings) โ most common # โ CLS pooling (use [CLS] token only) # โ Max pooling (take max across tokens) # 4. NORMALIZE to unit length (for cosine similarity) # 5. Return: 1 vector per sentence (384 or 768 dimensions) # # KEY DIFFERENCE from plain BERT: # Model is FINE-TUNED on millions of sentence pairs # using contrastive learning (similar pairs close, dissimilar far) # โ embedding space is MEANINGFUL for similarity!
3. Encode Kalimat โ Vektor โ Praktik Langsung
3. Encode Sentences โ Vectors โ Hands-on Practice
from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer("all-MiniLM-L6-v2") # =========================== # 1. Single sentence # =========================== embedding = model.encode("Jakarta is the capital of Indonesia") print(f"Type: {type(embedding)}") # numpy.ndarray print(f"Shape: {embedding.shape}") # (384,) print(f"First 5 values: {embedding[:5]}") # [0.032, -0.018, ...] print(f"Norm: {np.linalg.norm(embedding):.4f}") # 1.0000 (normalized!) # =========================== # 2. Batch encoding (MUCH faster!) # =========================== sentences = [ "I love machine learning", "Deep learning is fascinating", "The weather is beautiful today", "I enjoy artificial intelligence", "It's raining cats and dogs", ] embeddings = model.encode(sentences, show_progress_bar=True, batch_size=32) print(f"Batch shape: {embeddings.shape}") # (5, 384) # 5 sentences โ 5 vectors, each 384 dimensions # =========================== # 3. GPU acceleration # =========================== model_gpu = SentenceTransformer("all-MiniLM-L6-v2", device="cuda") embeddings = model_gpu.encode(sentences) # ~14,000 sentences/sec on T4! # =========================== # 4. Return as PyTorch tensors # =========================== embeddings_pt = model.encode(sentences, convert_to_tensor=True) print(f"Tensor device: {embeddings_pt.device}") # cuda:0 (if GPU) # =========================== # 5. Speed benchmark # =========================== import time big_corpus = [f"This is sentence number {i}" for i in range(10000)] start = time.time() _ = model.encode(big_corpus, batch_size=256, show_progress_bar=False) elapsed = time.time() - start print(f"10,000 sentences in {elapsed:.1f}s ({10000/elapsed:.0f} sent/sec)") # GPU: 10,000 sentences in 0.7s (14,285 sent/sec) # CPU: 10,000 sentences in 8.2s (1,220 sent/sec)
4. Cosine Similarity โ Mengukur Kedekatan Makna
4. Cosine Similarity โ Measuring Semantic Closeness
from sentence_transformers import SentenceTransformer, util model = SentenceTransformer("all-MiniLM-L6-v2") # =========================== # 1. Pairwise similarity # =========================== sent_a = "I love machine learning" sent_b = "Deep learning is my passion" sent_c = "The weather is terrible today" emb_a = model.encode(sent_a, convert_to_tensor=True) emb_b = model.encode(sent_b, convert_to_tensor=True) emb_c = model.encode(sent_c, convert_to_tensor=True) sim_ab = util.cos_sim(emb_a, emb_b).item() sim_ac = util.cos_sim(emb_a, emb_c).item() sim_bc = util.cos_sim(emb_b, emb_c).item() print(f"'{sent_a}' vs '{sent_b}': {sim_ab:.3f}") # 0.782 (TINGGI โ terkait!) print(f"'{sent_a}' vs '{sent_c}': {sim_ac:.3f}") # 0.094 (RENDAH โ tidak terkait) print(f"'{sent_b}' vs '{sent_c}': {sim_bc:.3f}") # 0.051 (RENDAH โ tidak terkait) # =========================== # 2. Similarity matrix (all pairs!) # =========================== sentences = [ "I love cats", "I adore kittens", "Dogs are great pets", "The stock market crashed", "Financial markets are volatile", ] embeddings = model.encode(sentences, convert_to_tensor=True) sim_matrix = util.cos_sim(embeddings, embeddings) print(f"Similarity matrix shape: {sim_matrix.shape}") # (5, 5) # Pretty print for i in range(len(sentences)): for j in range(len(sentences)): print(f"{sim_matrix[i][j]:.2f}", end=" ") print(f" โ {sentences[i][:25]}") # 1.00 0.83 0.45 0.02 0.05 โ I love cats # 0.83 1.00 0.48 0.01 0.04 โ I adore kittens # 0.45 0.48 1.00 0.03 0.06 โ Dogs are great pets # 0.02 0.01 0.03 1.00 0.79 โ The stock market crashed # 0.05 0.04 0.06 0.79 1.00 โ Financial markets volatile # PERFECT! Two clusters clearly visible: # Cluster 1: animals (cats, kittens, dogs) โ high similarity # Cluster 2: finance (stock market, financial) โ high similarity # Cross-cluster: near zero โ correctly unrelated! # =========================== # 3. Find most similar pair # =========================== pairs = util.paraphrase_mining(model, sentences, top_k=3) for score, i, j in pairs: print(f" {score:.3f}: '{sentences[i]}' โ '{sentences[j]}'") # 0.831: 'I love cats' โ 'I adore kittens' # 0.789: 'The stock market crashed' โ 'Financial markets are volatile' # 0.478: 'I adore kittens' โ 'Dogs are great pets'
5. Bi-Encoder vs Cross-Encoder โ Speed vs Accuracy
5. Bi-Encoder vs Cross-Encoder โ Speed vs Accuracy
from sentence_transformers import CrossEncoder # =========================== # Cross-encoder: score PAIRS directly # =========================== cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2") pairs = [ ["How many people live in Jakarta?", "Jakarta has a population of 10 million."], ["How many people live in Jakarta?", "Jakarta is the capital of Indonesia."], ["How many people live in Jakarta?", "The weather in Jakarta is tropical."], ] scores = cross_encoder.predict(pairs) for pair, score in zip(pairs, scores): print(f" {score:+.3f}: Q: {pair[0][:40]} | D: {pair[1][:40]}") # +7.234: Q: How many people live in Jakarta? | D: Jakarta has a population of 10 million โ BEST! # +1.123: Q: How many people live in Jakarta? | D: Jakarta is the capital of Indonesia. # -3.456: Q: How many people live in Jakarta? | D: The weather in Jakarta is tropical.
| Aspek | Bi-Encoder | Cross-Encoder |
|---|---|---|
| Kecepatan | ~14,000 sent/sec (encode) | ~300 pairs/sec |
| Scalability | โ Jutaan dokumen | โ Ratusan dokumen |
| Akurasi | Good (~85%) | Best (~92%) |
| Pre-compute | โ Embed sekali, query berkali-kali | โ Harus run per query-doc pair |
| Use Case | Retrieval (cari top-K dari jutaan) | Re-ranking (sortir top-K kandidat) |
| Production | Step 1: retrieve | Step 2: re-rank |
| Aspect | Bi-Encoder | Cross-Encoder |
|---|---|---|
| Speed | ~14,000 sent/sec (encode) | ~300 pairs/sec |
| Scalability | โ Millions of docs | โ Hundreds of docs |
| Accuracy | Good (~85%) | Best (~92%) |
| Pre-compute | โ Embed once, query many times | โ Must run per query-doc pair |
| Use Case | Retrieval (find top-K from millions) | Re-ranking (sort top-K candidates) |
| Production | Step 1: retrieve | Step 2: re-rank |
6. FAISS โ Vector Search untuk Jutaan Dokumen
6. FAISS โ Vector Search for Millions of Documents
# pip install faiss-cpu (atau faiss-gpu untuk GPU) import faiss import numpy as np from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") # =========================== # 1. Create document corpus # =========================== documents = [ "Python is a popular programming language", "Jakarta is the capital of Indonesia", "Machine learning uses data to make predictions", "The Eiffel Tower is in Paris, France", "TensorFlow is a deep learning framework", "Nasi goreng is a famous Indonesian dish", "Neural networks are inspired by the brain", "Mount Bromo is a volcano in East Java", "PyTorch was developed by Facebook AI", "Bali is a popular tourist destination", ] # Encode all documents (ONE TIME โ then saved!) doc_embeddings = model.encode(documents, convert_to_numpy=True) print(f"Embeddings shape: {doc_embeddings.shape}") # (10, 384) # =========================== # 2. Build FAISS index # =========================== dimension = doc_embeddings.shape[1] # 384 # Exact search (small datasets < 100k) index = faiss.IndexFlatIP(dimension) # Inner Product (= cosine sim for normalized vectors) # IndexFlatIP = exact brute-force search # IndexFlatL2 = L2 distance (use for non-normalized vectors) # Normalize embeddings (required for cosine similarity with IndexFlatIP) faiss.normalize_L2(doc_embeddings) index.add(doc_embeddings) print(f"Index size: {index.ntotal} vectors") # 10 # =========================== # 3. Search! # =========================== query = "What programming language is good for AI?" query_embedding = model.encode([query], convert_to_numpy=True) faiss.normalize_L2(query_embedding) k = 3 # return top 3 results scores, indices = index.search(query_embedding, k) print(f"\\nQuery: '{query}'") print(f"Top {k} results:") for rank, (score, idx) in enumerate(zip(scores[0], indices[0])): print(f" #{rank+1} (sim={score:.3f}): {documents[idx]}") # #1 (sim=0.623): Python is a popular programming language # #2 (sim=0.541): TensorFlow is a deep learning framework # #3 (sim=0.502): Machine learning uses data to make predictions # =========================== # 4. For LARGE datasets (1M+ docs): use approximate index # =========================== # nlist = 100 # number of Voronoi cells # quantizer = faiss.IndexFlatIP(dimension) # index = faiss.IndexIVFFlat(quantizer, dimension, nlist) # index.train(doc_embeddings) # train Voronoi cells # index.add(doc_embeddings) # index.nprobe = 10 # search 10 nearest cells (speed/accuracy tradeoff) # โ 1M docs: ~5ms per query (vs 50ms for brute-force) # =========================== # 5. Save/load index # =========================== faiss.write_index(index, "my_search_index.faiss") loaded_index = faiss.read_index("my_search_index.faiss")
7. Proyek: Semantic Search Engine โ Dari Nol ke Production
7. Project: Semantic Search Engine โ From Scratch to Production
import faiss import numpy as np import json from sentence_transformers import SentenceTransformer, CrossEncoder class SemanticSearchEngine: """Production semantic search: bi-encoder retrieval + cross-encoder re-ranking.""" def __init__(self, bi_model="all-MiniLM-L6-v2", cross_model="cross-encoder/ms-marco-MiniLM-L-6-v2"): self.bi_encoder = SentenceTransformer(bi_model) self.cross_encoder = CrossEncoder(cross_model) self.documents = [] self.index = None self.dim = self.bi_encoder.get_sentence_embedding_dimension() def index_documents(self, documents): """Encode documents and build FAISS index.""" self.documents = documents embeddings = self.bi_encoder.encode(documents, convert_to_numpy=True, show_progress_bar=True, batch_size=256) faiss.normalize_L2(embeddings) self.index = faiss.IndexFlatIP(self.dim) self.index.add(embeddings) print(f"Indexed {self.index.ntotal} documents") def search(self, query, top_k=5, rerank_top=20): """Two-stage search: retrieve + re-rank.""" # Stage 1: Bi-encoder retrieval (fast!) q_emb = self.bi_encoder.encode([query], convert_to_numpy=True) faiss.normalize_L2(q_emb) scores, indices = self.index.search(q_emb, rerank_top) candidates = [(self.documents[idx], score) for idx, score in zip(indices[0], scores[0]) if idx >= 0] # Stage 2: Cross-encoder re-ranking (accurate!) pairs = [[query, doc] for doc, _ in candidates] rerank_scores = self.cross_encoder.predict(pairs) # Sort by cross-encoder score results = sorted(zip(candidates, rerank_scores), key=lambda x: x[1], reverse=True)[:top_k] return [{"document": doc, "bi_score": float(bi_s), "rerank_score": float(re_s)} for (doc, bi_s), re_s in results] def save(self, path): faiss.write_index(self.index, f"{path}/index.faiss") with open(f"{path}/docs.json", "w") as f: json.dump(self.documents, f) # =========================== # Use it! # =========================== engine = SemanticSearchEngine() engine.index_documents([ "Python is a popular programming language for data science", "Jakarta is the capital and largest city of Indonesia", "TensorFlow and PyTorch are deep learning frameworks", "Machine learning models learn patterns from data", "Indonesia has over 17,000 islands", # ... add thousands of documents! ]) results = engine.search("What language is best for AI?", top_k=3) for i, r in enumerate(results): print(f" #{i+1} (rerank={r['rerank_score']:.2f}): {r['document'][:60]}...")
8. Fine-Tune Embedding Model โ Domain-Specific
8. Fine-Tune Embedding Model โ Domain-Specific
from sentence_transformers import SentenceTransformer, InputExample, losses from torch.utils.data import DataLoader # =========================== # 1. Prepare training data (pairs + similarity score) # =========================== train_examples = [ InputExample(texts=["I love cats", "I adore kittens"], label=0.9), InputExample(texts=["I love cats", "Stock market crashed"], label=0.0), InputExample(texts=["Python is great", "I enjoy coding in Python"], label=0.85), InputExample(texts=["Jakarta is busy", "The capital is crowded"], label=0.8), # ... hundreds/thousands of pairs ] # =========================== # 2. Fine-tune! # =========================== model = SentenceTransformer("all-MiniLM-L6-v2") # start from pre-trained train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16) train_loss = losses.CosineSimilarityLoss(model) model.fit( train_objectives=[(train_dataloader, train_loss)], epochs=3, warmup_steps=100, output_path="./my-embedding-model", show_progress_bar=True, ) # โ Domain-specific embedding model! Push to Hub if you want. # =========================== # 3. Alternative: fine-tune with NLI data (triplets) # =========================== # anchor, positive, negative triplet_examples = [ InputExample(texts=["I love cats", "Kittens are adorable", "Stock prices fell"]), # ... ] # train_loss = losses.TripletLoss(model) # or losses.MultipleNegativesRankingLoss(model) โ BEST for retrieval!
9. RAG Foundations โ Search + LLM = Powerful QA
9. RAG Foundations โ Search + LLM = Powerful QA
RAG = menggabungkan semantic search (Page 6 ini) dengan text generation (Page 3). Alih-alih mengandalkan LLM untuk "mengingat" segalanya, kita cari dokumen relevan dulu, lalu berikan ke LLM sebagai konteks. Hasilnya: jawaban akurat yang berbasis data terkini.
RAG = combining semantic search (this Page 6) with text generation (Page 3). Instead of relying on the LLM to "remember" everything, we search for relevant documents first, then provide them to the LLM as context. Result: accurate answers grounded in current data.
from sentence_transformers import SentenceTransformer from transformers import pipeline import faiss, numpy as np # =========================== # 1. Setup retriever + generator # =========================== retriever = SentenceTransformer("all-MiniLM-L6-v2") generator = pipeline("text2text-generation", model="google/flan-t5-small", device=0) # =========================== # 2. Index knowledge base # =========================== knowledge_base = [ "Jakarta has a population of 10.56 million people in 2024.", "Indonesia declared independence on August 17, 1945.", "Mount Bromo is an active volcano in East Java, Indonesia.", "Python was created by Guido van Rossum in 1991.", "TensorFlow was developed by Google Brain team.", ] kb_embeddings = retriever.encode(knowledge_base, convert_to_numpy=True) faiss.normalize_L2(kb_embeddings) index = faiss.IndexFlatIP(kb_embeddings.shape[1]) index.add(kb_embeddings) # =========================== # 3. RAG function # =========================== def rag_answer(question, top_k=2): # Retrieve q_emb = retriever.encode([question], convert_to_numpy=True) faiss.normalize_L2(q_emb) scores, indices = index.search(q_emb, top_k) context = " ".join([knowledge_base[i] for i in indices[0]]) # Generate prompt = f"Answer based on this context: {context}\n\nQuestion: {question}\nAnswer:" result = generator(prompt, max_length=100) return result[0]["generated_text"], context answer, ctx = rag_answer("What is the population of Jakarta?") print(f"Answer: {answer}") # "10.56 million people" โ grounded in context! โ
๐งฉ RAG = Fondasi Chatbot Production Modern!
ChatGPT, Perplexity, Google Gemini โ semuanya menggunakan varian RAG. Anda sudah punya semua building blocks:
โข Page 3: Text generation (GPT/T5)
โข Page 6 (ini): Embeddings + FAISS search
โข Combine โ RAG!
Page 7 akan membahas Hugging Face Spaces untuk deploy RAG app dengan Gradio.
๐งฉ RAG = Foundation of Modern Production Chatbots!
ChatGPT, Perplexity, Google Gemini โ all use RAG variants. You already have all the building blocks:
โข Page 3: Text generation (GPT/T5)
โข Page 6 (this): Embeddings + FAISS search
โข Combine โ RAG!
Page 7 will cover Hugging Face Spaces for deploying RAG apps with Gradio.
10. Pilihan Model Embedding โ Mana yang Terbaik?
10. Embedding Model Choices โ Which is Best?
| Model | Dim | Speed | Quality | Bahasa | Best For |
|---|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | โกโกโก | โญโญโญ | English | Speed priority, prototyping โญ |
| all-mpnet-base-v2 | 768 | โกโก | โญโญโญโญ | English | Best English quality |
| BGE-small-en-v1.5 | 384 | โกโกโก | โญโญโญโญ | English | MTEB leaderboard top โญ |
| BGE-base-en-v1.5 | 768 | โกโก | โญโญโญโญโญ | English | Best overall English |
| E5-large-v2 | 1024 | โก | โญโญโญโญโญ | English | Max quality |
| multilingual-e5-base | 768 | โกโก | โญโญโญโญ | 100+ bahasa | Multilingual + Indonesian โญ |
| paraphrase-multilingual | 768 | โกโก | โญโญโญ | 50+ bahasa | Multilingual general |
| OpenAI text-embedding-3 | 3072 | API | โญโญโญโญโญ | Multi | Best quality (paid API) |
| Model | Dim | Speed | Quality | Language | Best For |
|---|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | โกโกโก | โญโญโญ | English | Speed priority, prototyping โญ |
| all-mpnet-base-v2 | 768 | โกโก | โญโญโญโญ | English | Best English quality |
| BGE-small-en-v1.5 | 384 | โกโกโก | โญโญโญโญ | English | MTEB leaderboard top โญ |
| BGE-base-en-v1.5 | 768 | โกโก | โญโญโญโญโญ | English | Best overall English |
| E5-large-v2 | 1024 | โก | โญโญโญโญโญ | English | Max quality |
| multilingual-e5-base | 768 | โกโก | โญโญโญโญ | 100+ langs | Multilingual + Indonesian โญ |
| paraphrase-multilingual | 768 | โกโก | โญโญโญ | 50+ langs | Multilingual general |
| OpenAI text-embedding-3 | 3072 | API | โญโญโญโญโญ | Multi | Best quality (paid API) |
๐ Rekomendasi Cepat:
English + speed: all-MiniLM-L6-v2 (22M params, 384 dim) โ SentenceTransformer("all-MiniLM-L6-v2")
English + quality: BGE-base-en-v1.5 (110M, 768 dim) โ SentenceTransformer("BAAI/bge-base-en-v1.5")
Indonesian / multilingual: multilingual-e5-base โ SentenceTransformer("intfloat/multilingual-e5-base")
Max quality (paid): OpenAI text-embedding-3-large โ API call
Page 6 ini: all-MiniLM-L6-v2 (fastest, free, Colab-friendly)
๐ Quick Recommendations:
English + speed: all-MiniLM-L6-v2 (22M params, 384 dim) โ SentenceTransformer("all-MiniLM-L6-v2")
English + quality: BGE-base-en-v1.5 (110M, 768 dim) โ SentenceTransformer("BAAI/bge-base-en-v1.5")
Indonesian / multilingual: multilingual-e5-base โ SentenceTransformer("intfloat/multilingual-e5-base")
Max quality (paid): OpenAI text-embedding-3-large โ API call
This Page 6: all-MiniLM-L6-v2 (fastest, free, Colab-friendly)
11. Di Mana Jalankan? โ CPU Cukup untuk Inference!
11. Where to Run? โ CPU Is Enough for Inference!
| Task | CPU | GPU (T4) | Rekomendasi |
|---|---|---|---|
| Encode 1 query | ~5ms โ | ~1ms | CPU cukup! |
| Encode 100 docs | ~500ms โ | ~50ms | CPU OK |
| Encode 10k docs | ~8 sec | ~0.7 sec โ | GPU lebih baik |
| Encode 1M docs | ~13 min | ~1.2 min โ | GPU wajib |
| FAISS search 1M | ~5ms โ | ~1ms | CPU cukup! |
| Fine-tune embedding | Lambat | โ GPU | Colab T4 |
| Task | CPU | GPU (T4) | Recommendation |
|---|---|---|---|
| Encode 1 query | ~5ms โ | ~1ms | CPU is enough! |
| Encode 100 docs | ~500ms โ | ~50ms | CPU OK |
| Encode 10k docs | ~8 sec | ~0.7 sec โ | GPU better |
| Encode 1M docs | ~13 min | ~1.2 min โ | GPU required |
| FAISS search 1M | ~5ms โ | ~1ms | CPU is enough! |
| Fine-tune embedding | Slow | โ GPU | Colab T4 |
๐ Kabar Baik: Untuk production search engine, CPU sudah cukup! Encode query (5ms) + FAISS search (5ms) = total 10ms per query pada CPU. Dokumen di-encode sekali (bisa offline di GPU), lalu FAISS search berjalan di CPU. Anda tidak perlu GPU mahal untuk serving. Deploy di VPS $5/bulan pun bisa!
๐ Good News: For production search engines, CPU is enough! Encode query (5ms) + FAISS search (5ms) = total 10ms per query on CPU. Documents are encoded once (can be done offline on GPU), then FAISS search runs on CPU. You don't need expensive GPUs for serving. Even a $5/month VPS works!
12. Ringkasan Page 6
12. Page 6 Summary
| Konsep | Apa Itu | Kode Kunci |
|---|---|---|
| Sentence Embedding | Kalimat โ vektor bermakna | model.encode("text") |
| SentenceTransformer | Library untuk sentence embeddings | SentenceTransformer("all-MiniLM-L6-v2") |
| Cosine Similarity | Ukur kedekatan makna (0-1) | util.cos_sim(emb_a, emb_b) |
| Bi-Encoder | Encode terpisah, cepat | SentenceTransformer default |
| Cross-Encoder | Encode bersama, akurat | CrossEncoder("ms-marco-MiniLM") |
| FAISS | Vector search jutaan docs (ms!) | faiss.IndexFlatIP(dim) |
| Semantic Search | Retrieve + Re-rank | Bi-encoder โ FAISS โ Cross-encoder |
| Fine-Tune Embeddings | Domain-specific similarity | model.fit(train_objectives) |
| RAG | Search + LLM = factual QA | Retrieve context โ prompt LLM |
| Concept | What It Is | Key Code |
|---|---|---|
| Sentence Embedding | Sentence โ meaningful vector | model.encode("text") |
| SentenceTransformer | Library for sentence embeddings | SentenceTransformer("all-MiniLM-L6-v2") |
| Cosine Similarity | Measure semantic closeness (0-1) | util.cos_sim(emb_a, emb_b) |
| Bi-Encoder | Encode separately, fast | SentenceTransformer default |
| Cross-Encoder | Encode together, accurate | CrossEncoder("ms-marco-MiniLM") |
| FAISS | Vector search millions of docs (ms!) | faiss.IndexFlatIP(dim) |
| Semantic Search | Retrieve + Re-rank | Bi-encoder โ FAISS โ Cross-encoder |
| Fine-Tune Embeddings | Domain-specific similarity | model.fit(train_objectives) |
| RAG | Search + LLM = factual QA | Retrieve context โ prompt LLM |
Page 5 โ Question Answering & Seq2Seq (T5)
Coming Next: Page 7 โ Hugging Face Spaces & Gradio Apps
Deploy model Anda sebagai web app! Page 7 membahas: Gradio library untuk UI interaktif, membangun demo app untuk setiap model dari Page 2-6, deploy ke HF Spaces (URL publik gratis!), Streamlit integration, sharing models dan apps, dan building a complete RAG chatbot app.
Coming Next: Page 7 โ Hugging Face Spaces & Gradio Apps
Deploy your model as a web app! Page 7 covers: Gradio library for interactive UI, building demo apps for every model from Pages 2-6, deploying to HF Spaces (free public URL!), Streamlit integration, sharing models and apps, and building a complete RAG chatbot app.