📝 Artikel ini ditulis dalam Bahasa Indonesia
Seri Belajar LLM Part 8

Evaluasi & Benchmark LLM

MMLU, HumanEval, LMSys Arena, SWE-bench — cara mengukur kemampuan LLM secara objektif. Part 8 mengajarkan benchmark utama 2026, LLM-as-Judge evaluation, custom eval suites, dan mengapa vibes-based evaluation tidak cukup untuk production.

Maret 202630 menit bacaBenchmark • MMLU • HumanEval • LMSys • Evaluation
📚 Seri Belajar LLM:
1 2 3 4 5 6 7 8 9 10

Daftar Isi

  1. Mengapa Evaluasi? — Vibes-based tidak cukup untuk production
  2. Benchmark Utama 2026 — MMLU, HumanEval, MATH, LMSys Arena
  3. LLM-as-Judge — Pakai LLM menilai LLM lain
  4. Custom Eval Suite — Build evaluasi untuk use case Anda
  5. Metrics — Accuracy, latency, cost, safety
  6. Leaderboards — LMSys Chatbot Arena, Open LLM
  7. Evaluation Tools — W&B Weave, LangSmith, Braintrust
  8. Ringkasan
📊

1. Mengapa Evaluasi Penting?

Vibes-based evaluation: model terasa bagus di demo, gagal di production

Tanpa evaluasi sistematis, kita mengandalkan vibes — coba beberapa prompt, terasa bagus, deploy. Ini berbahaya: model bisa excellent di demo tapi gagal di edge cases. Benchmark memberikan angka objektif. Custom eval suite memastikan model bekerja untuk use case spesifik Anda. LLM-as-Judge memungkinkan evaluasi otomatis tanpa human labelers.

🏆

2. Benchmark Utama (2026)

Apa yang diukur setiap benchmark
BenchmarkMengukurFormatTop Score (2026)Notes
MMLUKnowledge 57 subjectsMultiple choice95%+ (frontier)Standard knowledge test
HumanEvalPython code generationFunction completion92%+ (Claude/GPT)Coding benchmark
MATHMathematical reasoningProblem solving95%+ (o3/R1)With CoT reasoning
GPQA DiamondPhD-level science QAExpert questions~65% (o3)Very hard, expert-level
LMSys ArenaOverall human preferenceBlind A/B votingClaude 4.6 / GPT-4.5Most trusted overall
SWE-benchReal GitHub bug fixingCode patches72%+ (Claude)Practical coding
BrowseCompWeb browsing researchMulti-step tasks78%+ (Deep Research)Agentic capability
ARC-AGINovel reasoningPattern puzzles~50% (o3 high)AGI-oriented
🧪

3. LLM-as-Judge

Pakai LLM lain menilai output model Anda — scalable evaluation
12_llm_judge.py
# LLM-as-Judge: automated evaluation judge_prompt = """Rate this response on 1-10 scale. Criteria: accuracy, helpfulness, safety, clarity. Question: {question} Response: {response} Provide: score (1-10), reasoning, specific issues.""" # Run across 100+ test cases automatically # Tools: W&B Weave, LangSmith, Braintrust, RAGAS # Track over time: accuracy, latency, cost, safety rate
📏

4. Custom Eval Suite

Build evaluasi untuk use case spesifik Anda
MetricApa ItuHow to Measure
AccuracyJawaban benar vs totalAutomated: exact match, LLM-judge
FaithfulnessJawaban sesuai context (RAG)RAGAS framework, cite verification
Latency (p50/p95/p99)Response time distributionLogging + aggregation
Cost per queryToken usage x priceAPI billing, token counter
Safety rate% responses passing safety filterAutomated safety classifier
User satisfactionThumbs up/down rateIn-app feedback tracking
LLM
Tech Review Desk — Seri Belajar LLM
Sumber: Sebastian Raschka, Anthropic, OpenAI, Hugging Face, LLMOrbit, DeepSeek technical reports.
rominur@gmail.com  •  t.me/Jekardah_AI