Daftar Isi
- Mengapa Evaluasi? — Vibes-based tidak cukup untuk production
- Benchmark Utama 2026 — MMLU, HumanEval, MATH, LMSys Arena
- LLM-as-Judge — Pakai LLM menilai LLM lain
- Custom Eval Suite — Build evaluasi untuk use case Anda
- Metrics — Accuracy, latency, cost, safety
- Leaderboards — LMSys Chatbot Arena, Open LLM
- Evaluation Tools — W&B Weave, LangSmith, Braintrust
- Ringkasan —
📊
1. Mengapa Evaluasi Penting?
Vibes-based evaluation: model terasa bagus di demo, gagal di productionTanpa evaluasi sistematis, kita mengandalkan vibes — coba beberapa prompt, terasa bagus, deploy. Ini berbahaya: model bisa excellent di demo tapi gagal di edge cases. Benchmark memberikan angka objektif. Custom eval suite memastikan model bekerja untuk use case spesifik Anda. LLM-as-Judge memungkinkan evaluasi otomatis tanpa human labelers.
🏆
2. Benchmark Utama (2026)
Apa yang diukur setiap benchmark| Benchmark | Mengukur | Format | Top Score (2026) | Notes |
|---|---|---|---|---|
| MMLU | Knowledge 57 subjects | Multiple choice | 95%+ (frontier) | Standard knowledge test |
| HumanEval | Python code generation | Function completion | 92%+ (Claude/GPT) | Coding benchmark |
| MATH | Mathematical reasoning | Problem solving | 95%+ (o3/R1) | With CoT reasoning |
| GPQA Diamond | PhD-level science QA | Expert questions | ~65% (o3) | Very hard, expert-level |
| LMSys Arena | Overall human preference | Blind A/B voting | Claude 4.6 / GPT-4.5 | Most trusted overall |
| SWE-bench | Real GitHub bug fixing | Code patches | 72%+ (Claude) | Practical coding |
| BrowseComp | Web browsing research | Multi-step tasks | 78%+ (Deep Research) | Agentic capability |
| ARC-AGI | Novel reasoning | Pattern puzzles | ~50% (o3 high) | AGI-oriented |
🧪
3. LLM-as-Judge
Pakai LLM lain menilai output model Anda — scalable evaluation# LLM-as-Judge: automated evaluation
judge_prompt = """Rate this response on 1-10 scale.
Criteria: accuracy, helpfulness, safety, clarity.
Question: {question}
Response: {response}
Provide: score (1-10), reasoning, specific issues."""
# Run across 100+ test cases automatically
# Tools: W&B Weave, LangSmith, Braintrust, RAGAS
# Track over time: accuracy, latency, cost, safety rate
📏
4. Custom Eval Suite
Build evaluasi untuk use case spesifik Anda| Metric | Apa Itu | How to Measure |
|---|---|---|
| Accuracy | Jawaban benar vs total | Automated: exact match, LLM-judge |
| Faithfulness | Jawaban sesuai context (RAG) | RAGAS framework, cite verification |
| Latency (p50/p95/p99) | Response time distribution | Logging + aggregation |
| Cost per query | Token usage x price | API billing, token counter |
| Safety rate | % responses passing safety filter | Automated safety classifier |
| User satisfaction | Thumbs up/down rate | In-app feedback tracking |
Next: Part 9 — Keamanan & Safety LLM
Prompt Injection, Guardrails, Hallucination. Ancaman dan pertahanan.
LLM
Tech Review Desk — Seri Belajar LLM
Sumber: Sebastian Raschka, Anthropic, OpenAI, Hugging Face, LLMOrbit, DeepSeek technical reports.