Daftar Isi
- Mengapa Local LLM? — Privacy, cost, speed, offline
- Ollama — Install dan run dalam 2 commands
- Model Populer 2026 — Qwen3, DeepSeek-R1, LLaMA4, Gemma3
- Quantization — FP16, Q8, Q5, Q4, Q3, Q2 — trade-offs
- Hardware Guide — RAM, GPU, CPU requirements
- Python Integration — Ollama API + LangChain
- Perbandingan Local vs Cloud — Kapan pakai mana
- Ringkasan —
💻
1. Mengapa Local LLM?
Privacy 100%, gratis, instant, offline. 4 alasan utama.
Cloud LLM APIs (GPT-4, Claude) powerful tapi memiliki trade-offs: data dikirim ke server pihak ketiga (privacy concern), biaya per-token yang bisa mahal untuk volume tinggi, latency network, dan ketergantungan pada internet. Local LLM menjalankan model langsung di hardware Anda — data tidak pernah keluar, gratis setelah download, latency minimal, dan bisa offline. Di 2026, model 8B parameter yang di-quantize bisa berjalan lancar di laptop dengan 8GB RAM.
| Aspek | Cloud API (GPT-4/Claude) | Local LLM (Ollama) |
| Privacy | Data dikirim ke server | Data tidak pernah keluar |
| Cost | $0.01-0.06 per 1K tokens | Gratis setelah download |
| Latency | 200-2000ms (network) | 50-200ms (local) |
| Offline | Butuh internet | Works offline 100% |
| Quality | Frontier-level | Good (8B) to Great (70B) |
| Setup | API key, done | Install Ollama, download model |
🦙
2. Ollama — LLM in 2 Commands
Install dan run semudah Docker
# Install Ollama (Mac/Linux/Windows)
$ curl -fsSL https://ollama.ai/install.sh | sh
# Download dan run model — SATU COMMAND!
$ ollama run llama3.2
>>> Apa itu machine learning?
Machine learning adalah cabang AI yang memungkinkan...
# Model populer Maret 2026
$ ollama run qwen3:8b # Alibaba, bagus untuk Asia
$ ollama run deepseek-r1:8b # Reasoning model
$ ollama run llama4-scout # Meta latest
$ ollama run gemma3:12b # Google, compact
$ ollama run codestral # Mistral, coding
$ ollama run phi4 # Microsoft, small but mighty
# API compatible dengan OpenAI!
$ curl http://localhost:11434/v1/chat/completions \
-d '{"model":"qwen3:8b","messages":[{"role":"user","content":"Halo!"}]}'
📐
3. Quantization — Model Lebih Kecil
Float16 ke Int4: 4x lebih kecil, bisa jalan di laptop
| Quant Level | Bits | Size (8B model) | Quality | Min RAM | Recommended |
| FP16 | 16-bit | ~16 GB | Full quality | 20+ GB | Server/GPU |
| Q8_0 | 8-bit | ~8.5 GB | Near-perfect (99%) | 12 GB | Desktop GPU |
| Q5_K_M | 5-bit | ~5.7 GB | Very good (97%) | 8 GB | Good laptop |
| Q4_K_M | 4-bit | ~4.9 GB | Good (95%) | 8 GB | RECOMMENDED |
| Q3_K_M | 3-bit | ~3.5 GB | Acceptable (90%) | 6 GB | Low-end laptop |
| Q2_K | 2-bit | ~2.7 GB | Degraded (80%) | 4 GB | Desperate only |
💻
4. Hardware Guide
Berapa RAM dan GPU yang dibutuhkan
| Model Size | Min RAM | Recommended GPU | Speed (tok/s) | Use Case |
| 1-3B | 4 GB | CPU only | 20-40 | Simple tasks, edge devices |
| 7-8B | 8 GB | CPU or 8GB GPU | 15-30 | General purpose, coding |
| 13-14B | 16 GB | 12-16GB GPU | 10-20 | Better quality, analysis |
| 32-34B | 32 GB | 24GB GPU (RTX 4090) | 5-15 | Near-frontier quality |
| 70B | 48-64 GB | 2x 24GB GPUs | 3-8 | Best open-source quality |
📕
Next: Part 7 — Fine-tuning Praktis
LoRA, QLoRA, PEFT. Adaptasi LLM ke domain Anda dengan GPU consumer.
LLM
Tech Review Desk — Seri Belajar LLM
Sumber: Sebastian Raschka, Anthropic, OpenAI, Hugging Face, LLMOrbit, DeepSeek technical reports.