šŸ“ Artikel ini ditulis dalam Bahasa Indonesia & English
šŸ“ This article is available in English & Bahasa Indonesia

šŸŽÆ Belajar Hugging Face — Page 9Learn Hugging Face — Page 9

RLHF, DPO &
Model Alignment

RLHF, DPO &
Model Alignment

Teknik yang mengubah GPT-3 menjadi ChatGPT. Page 9 membahas super detail: kenapa SFT saja tidak cukup (model bisa toxic, berbohong, tidak helpful), 3 tahap training ChatGPT (pre-training → SFT → RLHF), Reinforcement Learning from Human Feedback — konsep lengkap, Reward Modeling — melatih model untuk menilai kualitas respons, PPO (Proximal Policy Optimization) — algoritma RL klasik untuk alignment, DPO (Direct Preference Optimization) — alternatif RLHF yang lebih sederhana dan stabil, TRL library untuk semua teknik alignment, proyek: DPO fine-tuning dengan preference dataset, ORPO dan SimPO sebagai alternatif terbaru, safety dan Constitutional AI, dan di mana menjalankan (QLoRA + DPO di Colab T4).

The technique that turned GPT-3 into ChatGPT. Page 9 covers in super detail: why SFT alone isn't enough (model can be toxic, lie, be unhelpful), 3 stages of ChatGPT training (pre-training → SFT → RLHF), Reinforcement Learning from Human Feedback — complete concept, Reward Modeling — training a model to judge response quality, PPO (Proximal Policy Optimization) — classic RL algorithm for alignment, DPO (Direct Preference Optimization) — simpler and more stable RLHF alternative, TRL library for all alignment techniques, project: DPO fine-tuning with preference dataset, ORPO and SimPO as newest alternatives, safety and Constitutional AI, and where to run (QLoRA + DPO on Colab T4).

šŸ“… MaretMarch 2026ā± 40 menit baca40 min read
šŸ· RLHFDPOPPOAlignmentTRLReward ModelPreferenceSafety
šŸ“š Seri Belajar Hugging Face:Learn Hugging Face Series:

šŸ“‘ Daftar Isi — Page 9

šŸ“‘ Table of Contents — Page 9

  1. Kenapa SFT Tidak Cukup — Masalah model yang hanya di-SFT
  2. 3 Tahap Training ChatGPT — Pre-train → SFT → RLHF
  3. RLHF Explained — Konsep lengkap step by step
  4. Reward Modeling — Melatih "hakim" kualitas respons
  5. PPO — Proximal Policy Optimization untuk alignment
  6. DPO — Direct Preference Optimization (lebih mudah!)
  7. TRL Library — Satu library untuk semua alignment
  8. Proyek: DPO Fine-Tuning — Preference dataset + QLoRA
  9. ORPO & SimPO — Alternatif terbaru yang lebih efisien
  10. Safety & Constitutional AI — Membuat model aman
  11. Ringkasan & Preview Page 10
  1. Why SFT Isn't Enough — Problems with SFT-only models
  2. 3 Stages of ChatGPT Training — Pre-train → SFT → RLHF
  3. RLHF Explained — Complete concept step by step
  4. Reward Modeling — Training a response quality "judge"
  5. PPO — Proximal Policy Optimization for alignment
  6. DPO — Direct Preference Optimization (easier!)
  7. TRL Library — One library for all alignment
  8. Project: DPO Fine-Tuning — Preference dataset + QLoRA
  9. ORPO & SimPO — Newest more efficient alternatives
  10. Safety & Constitutional AI — Making models safe
  11. Summary & Page 10 Preview
āš ļø

1. Kenapa SFT Tidak Cukup — Model Bisa Berbahaya

1. Why SFT Isn't Enough — Models Can Be Harmful

SFT mengajarkan FORMAT jawaban, tapi BUKAN kualitas, keamanan, atau kejujuran
SFT teaches answer FORMAT, but NOT quality, safety, or honesty

Di Page 3 dan 8, kita melakukan Supervised Fine-Tuning (SFT) — mengajarkan model menjawab instruksi dengan format yang benar. Tapi SFT punya masalah besar: model bisa menghasilkan jawaban yang formatnya benar tapi isinya berbahaya, tidak akurat, atau tidak helpful.

In Pages 3 and 8, we did Supervised Fine-Tuning (SFT) — teaching models to answer instructions in the correct format. But SFT has a big problem: models can produce answers that are correctly formatted but harmful, inaccurate, or unhelpful.

SFT vs RLHF — Kenapa Alignment Dibutuhkan Model setelah SFT saja: User: "Bagaimana cara membobol WiFi?" Model: "Langkah 1: Install aircrack-ng. Langkah 2: ..." → Format BENAR (ikuti instruksi āœ…) tapi BERBAHAYA! āŒ User: "Siapa presiden Indonesia tahun 2030?" Model: "Presiden Indonesia tahun 2030 adalah Prabowo Subianto." → Format BENAR tapi HALUSINASI (confident tapi salah)! āŒ User: "Tulis puisi tentang cinta." Model: "Cinta itu indah. Cinta itu baik. Cinta itu bagus. Cinta..." → Format BENAR tapi BORING dan repetitive! āŒ Model setelah SFT + RLHF/DPO: User: "Bagaimana cara membobol WiFi?" Model: "Saya tidak bisa membantu membobol WiFi karena itu ilegal. Jika Anda lupa password WiFi sendiri, coba reset router..." → AMAN āœ… dan tetap HELPFUL āœ… User: "Siapa presiden Indonesia tahun 2030?" Model: "Saya tidak memiliki informasi tentang masa depan. Presiden Indonesia saat ini adalah..." → JUJUR āœ… mengakui ketidaktahuan User: "Tulis puisi tentang cinta." Model: "Di antara dua hati yang berdetak / Cinta tumbuh bagai bunga yang mekar di pagi hari / Lembut namun kuat..." → KREATIF āœ… dan BERKUALITAS āœ… RLHF/DPO mengajarkan model untuk prefer respons yang: āœ… Helpful (berguna bagi user) āœ… Harmless (tidak berbahaya) āœ… Honest (jujur, mengakui ketidaktahuan)
šŸ”„

2. 3 Tahap Training ChatGPT — Pipeline Lengkap

2. 3 Stages of ChatGPT Training — Complete Pipeline

Pre-training → Supervised Fine-Tuning → RLHF — ini yang membuat ChatGPT dari GPT-3
Pre-training → Supervised Fine-Tuning → RLHF — this is what made ChatGPT from GPT-3
3 Tahap Training: Dari GPT-3 ke ChatGPT Stage 1: PRE-TRAINING (dilakukan Google/Meta/OpenAI — BUKAN kita) ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā” │ Data: Internet, buku, Wikipedia (triliunan token) │ │ Task: Next token prediction (causal LM) │ │ Cost: $1M - $100M+ │ │ Result: GPT-3, LLaMA, Mistral (base model) │ │ → Model bisa melanjutkan teks, tapi BUKAN assistant │ ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜ ↓ Stage 2: SFT — Supervised Fine-Tuning (Page 3, 8 seri ini!) ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā” │ Data: 10k-100k instruction-response pairs │ │ Task: Follow instructions, answer in correct format │ │ Cost: $0 (Colab + QLoRA) │ │ Result: Model yang bisa mengikuti instruksi │ │ → Tapi belum "aligned" (bisa toxic, halusinasi) │ ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜ ↓ Stage 3: RLHF / DPO — Alignment (Page 9 ini! ⭐) ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā” │ Data: Human preference data (chosen vs rejected) │ │ Task: Prefer helpful, harmless, honest responses │ │ Cost: $0 (Colab + QLoRA + DPO) │ │ Result: ChatGPT-like assistant! šŸŽ‰ │ │ → Helpful + Harmless + Honest │ ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜ Di seri ini, Anda sudah melakukan Stage 2 (Page 3, 8). Sekarang Page 9: Stage 3 — alignment dengan DPO!
šŸŽÆ

3. RLHF Explained — Konsep Lengkap

3. RLHF Explained — Complete Concept

Human menilai respons → model belajar menghasilkan respons yang disukai human
Humans rate responses → model learns to generate responses humans prefer
RLHF Pipeline — 3 Sub-steps Step 1: Collect Human Preferences Prompt: "Jelaskan machine learning" Response A: "ML adalah cabang AI yang memungkinkan komputer belajar..." Response B: "ML itu kayak neural network gitu deh, pokoknya bagus..." Human: "A lebih baik!" → A = chosen, B = rejected (Kumpulkan ribuan pasangan seperti ini) Step 2: Train Reward Model Input: prompt + response → Output: quality score (0-1) Train pada preference data: reward(prompt, A) > reward(prompt, B) untuk semua pasangan → Reward model bisa "menilai" kualitas respons apapun! Step 3: Optimize Policy with RL (PPO) 1. LLM generates response 2. Reward model scores it 3. PPO updates LLM to maximize reward 4. KL penalty: jangan terlalu jauh dari SFT model (prevent reward hacking) Repeat → LLM semakin menghasilkan respons berkualitas tinggi! MASALAH RLHF: sangat kompleks (3 model: LLM + reward + reference), tidak stabil (reward hacking, mode collapse), sulit di-tune. SOLUSI: DPO — skip reward model, langsung optimize! (Section 6)
āš–ļø

4. Reward Modeling — Melatih "Hakim" Kualitas

4. Reward Modeling — Training a Quality "Judge"

65_reward_model.py — Reward Model Conceptpython
from trl import RewardTrainer, RewardConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset

# ===========================
# Reward Model = classifier: (prompt, response) → score
# Trained on human preferences: score(chosen) > score(rejected)
# ===========================

# 1. Load preference dataset
# Format: {"prompt": "...", "chosen": "good response", "rejected": "bad response"}
dataset = load_dataset("Anthropic/hh-rlhf", split="train[:5000]")

# 2. Load base model as reward model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=1)  # 1 output = scalar score
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# 3. Train: reward(chosen) > reward(rejected)
# Loss = -log(sigmoid(reward(chosen) - reward(rejected)))
# → Bradley-Terry model of preferences

# trainer = RewardTrainer(
#     model=model, args=RewardConfig(...),
#     train_dataset=dataset, tokenizer=tokenizer)
# trainer.train()

# CATATAN: Di praktik, SKIP reward model dan gunakan DPO langsung!
# DPO = implicit reward model (lebih sederhana, lebih stabil)
šŸŽ®

5. PPO — Proximal Policy Optimization

5. PPO — Proximal Policy Optimization

Algoritma RL yang dipakai OpenAI untuk ChatGPT — powerful tapi kompleks
RL algorithm used by OpenAI for ChatGPT — powerful but complex
66_ppo_concept.py — PPO with TRLpython
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead

# ===========================
# PPO for RLHF — conceptual code
# ===========================
# PPO needs 3-4 models in VRAM simultaneously:
# 1. Policy model (LLM being trained)
# 2. Reference model (frozen copy — for KL penalty)
# 3. Reward model (scores responses)
# 4. Value head (estimates future rewards)
# → VERY expensive! 4Ɨ VRAM of single model

# config = PPOConfig(
#     model_name="gpt2",
#     learning_rate=1e-5,
#     batch_size=16,
#     mini_batch_size=4,
#     ppo_epochs=4,           # PPO optimization epochs per batch
#     kl_penalty="kl",        # KL divergence from reference model
#     target_kl=0.1,          # target KL (stops if exceeded)
# )

# PPO Training Loop:
# for batch in dataloader:
#     1. Generate responses: policy_model(prompts) → responses
#     2. Score responses: reward_model(prompts, responses) → rewards
#     3. Compute KL penalty: KL(policy || reference) → penalty
#     4. Adjusted reward = reward - β Ɨ KL_penalty
#     5. PPO update: maximize adjusted reward

# MASALAH PPO:
# āŒ Butuh 3-4 model di VRAM → sangat mahal
# āŒ Tidak stabil (reward hacking, mode collapse)
# āŒ Banyak hyperparameters (KL target, clip ratio, etc.)
# āŒ Sulit di-reproduce
# → SOLUSI: DPO! (Section 6)
⚔

6. DPO — Direct Preference Optimization (Game-Changer!)

6. DPO — Direct Preference Optimization (Game-Changer!)

Skip reward model, skip RL — langsung optimize dari preference data!
Skip reward model, skip RL — directly optimize from preference data!

DPO (Rafailov et al., 2023) adalah breakthrough yang menyederhanakan RLHF secara drastis. Alih-alih melatih reward model terpisah lalu melakukan RL (PPO), DPO langsung mengoptimasi LLM menggunakan preference data. Hasilnya: sama bagusnya dengan RLHF, tapi jauh lebih mudah dan stabil.

DPO (Rafailov et al., 2023) is a breakthrough that drastically simplifies RLHF. Instead of training a separate reward model then doing RL (PPO), DPO directly optimizes the LLM using preference data. Result: as good as RLHF, but much easier and more stable.

PPO vs DPO — Kenapa DPO Menang RLHF/PPO (traditional, complex): Preference data → [Train Reward Model] → [PPO with RL] → Aligned LLM ↑ model terpisah! ↑ 3-4 models di VRAM! • 3-4 models needed simultaneously • Unstable training (reward hacking) • Many hyperparameters to tune • Hard to reproduce DPO (modern, simple): Preference data → [DPO Loss on LLM directly] → Aligned LLM ↑ HANYA 2 models (policy + reference)! • NO reward model needed • NO RL needed • Stable training (standard cross-entropy-like loss) • Few hyperparameters (mainly β) • Easy to reproduce • SAME or BETTER results than PPO! DPO Loss (simplified): L = -log σ(β Ɨ (log Ļ€(chosen) - log Ļ€_ref(chosen) - log Ļ€(rejected) + log Ļ€_ref(rejected))) Intuisi: naikkan probability "chosen" response, turunkan probability "rejected" response, tapi jangan terlalu jauh dari reference model (β controls this).
AspekPPO (RLHF)DPO
Models di VRAM3-4 (policy, ref, reward, value)2 (policy, ref)
Reward modelPerlu train terpisahTidak perlu!
StabilitasSering unstableStabil
HyperparametersBanyak (KL, clip, etc.)Sedikit (β)
KualitasSangat baikSama/lebih baik
VRAM (7B + QLoRA)~24+ GB~14 GB
Colab T4?āŒ Sulitāœ… Bisa dengan QLoRA!
RekomendasiResearch, large-scaleDefault choice ⭐
AspectPPO (RLHF)DPO
Models in VRAM3-4 (policy, ref, reward, value)2 (policy, ref)
Reward modelNeed separate trainingNot needed!
StabilityOften unstableStable
HyperparametersMany (KL, clip, etc.)Few (β)
QualityVery goodSame/better
VRAM (7B + QLoRA)~24+ GB~14 GB
Colab T4?āŒ Difficultāœ… Works with QLoRA!
RecommendationResearch, large-scaleDefault choice ⭐
šŸ“š

7. TRL Library — Satu Library untuk Semua Alignment

7. TRL Library — One Library for All Alignment

SFTTrainer, RewardTrainer, PPOTrainer, DPOTrainer — semuanya dari TRL
SFTTrainer, RewardTrainer, PPOTrainer, DPOTrainer — all from TRL
67_trl_overview.py — TRL Library Overviewpython
# pip install trl
# TRL = Transformer Reinforcement Learning
# By Hugging Face — the standard library for LLM alignment

from trl import (
    SFTTrainer,      # Supervised Fine-Tuning (Page 8)
    RewardTrainer,   # Train reward model
    PPOTrainer,      # RLHF with PPO
    DPOTrainer,      # Direct Preference Optimization ⭐
    ORPOTrainer,     # Odds Ratio Preference Optimization (newest)
    # KTOTrainer,    # Kahneman-Tversky Optimization
)

# Training pipeline:
# Step 1: SFTTrainer → instruction-following model (Page 8 āœ… done!)
# Step 2: DPOTrainer → aligned model (Page 9 this! ⭐)
# That's it! 2 steps = ChatGPT-like assistant.
šŸ”„

8. Proyek: DPO Fine-Tuning — Preference Dataset + QLoRA

8. Project: DPO Fine-Tuning — Preference Dataset + QLoRA

Align model dari Page 8 menggunakan human preferences — di Colab T4!
Align the model from Page 8 using human preferences — on Colab T4!
68_dpo_finetune.py — DPO with QLoRA šŸ”„šŸ”„šŸ”„python
#!/usr/bin/env python3
"""
šŸŽÆ DPO Fine-Tuning: Align LLM with Human Preferences
Uses QLoRA (Page 8) + DPO → ChatGPT-like alignment on Colab T4!
"""

# !pip install -q transformers datasets accelerate peft bitsandbytes trl

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset

# ═══════════════════════════════════════
# STEP 1: LOAD MODEL (4-bit, same as Page 8!)
# ═══════════════════════════════════════
model_name = "mistralai/Mistral-7B-v0.3"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True)

model = AutoModelForCausalLM.from_pretrained(
    model_name, quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Reference model = same base (DPO needs it for KL penalty)
# DPOTrainer automatically creates reference model internally!

# ═══════════════════════════════════════
# STEP 2: LOAD PREFERENCE DATASET
# ═══════════════════════════════════════
# Format: {"prompt": "...", "chosen": "good response", "rejected": "bad response"}
dataset = load_dataset("Intel/orca_dpo_pairs", split="train[:5000]")

# Inspect
print(dataset[0].keys())  # dict_keys(['system', 'question', 'chosen', 'rejected'])
print(f"Chosen:   {dataset[0]['chosen'][:100]}...")
print(f"Rejected: {dataset[0]['rejected'][:100]}...")

# Format for DPOTrainer
def format_dpo(example):
    return {
        "prompt": f"### Instruction:\n{example['question']}\n\n### Response:\n",
        "chosen": example["chosen"],
        "rejected": example["rejected"],
    }

dataset = dataset.map(format_dpo)

# ═══════════════════════════════════════
# STEP 3: CONFIGURE LORA + DPO
# ═══════════════════════════════════════
peft_config = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    bias="none", task_type="CAUSAL_LM")

dpo_config = DPOConfig(
    output_dir="./dpo-mistral",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,          # lower than SFT (more careful!)
    beta=0.1,                    # DPO temperature (0.1-0.5)
    # β controls how much to deviate from reference model
    # β=0.1 → stay close to reference (conservative)
    # β=0.5 → deviate more (aggressive optimization)
    fp16=True,
    gradient_checkpointing=True,
    optim="paged_adamw_8bit",
    max_length=512,
    max_prompt_length=256,
    logging_steps=25,
    save_steps=200,
    report_to="none",
)

# ═══════════════════════════════════════
# STEP 4: TRAIN WITH DPO!
# ═══════════════════════════════════════
trainer = DPOTrainer(
    model=model,
    args=dpo_config,
    train_dataset=dataset,
    processing_class=tokenizer,
    peft_config=peft_config,
    # ref_model=None → auto-created from base model!
)

print("šŸŽÆ Training DPO...")
trainer.train()
trainer.save_model("./dpo-mistral-final")

# Test: model should now prefer helpful, safe responses!
print("\\nšŸ† DPO alignment complete!")

šŸŽÆ Anda Baru Saja Melakukan Alignment — Teknik yang Membuat ChatGPT!
Pipeline lengkap yang Anda sudah kuasai di seri ini:
• Page 8 (SFT): Base model → instruction-following model (QLoRA)
• Page 9 (DPO): SFT model → aligned, helpful, safe assistant (QLoRA + DPO)
Keduanya berjalan di Colab T4 gratis! Dengan model 7B parameters!
Ini exact same pipeline (simplified) yang dipakai untuk membuat ChatGPT, Claude, Gemini.

šŸŽÆ You Just Did Alignment — The Technique That Made ChatGPT!
Complete pipeline you've mastered in this series:
• Page 8 (SFT): Base model → instruction-following model (QLoRA)
• Page 9 (DPO): SFT model → aligned, helpful, safe assistant (QLoRA + DPO)
Both run on free Colab T4! With 7B parameter models!
This is the exact same pipeline (simplified) used to make ChatGPT, Claude, Gemini.

šŸ†•

9. ORPO & SimPO — Alternatif Terbaru

9. ORPO & SimPO — Newest Alternatives

Lebih baru dari DPO — gabungkan SFT + alignment dalam 1 step!
Newer than DPO — combine SFT + alignment in 1 step!
69_orpo.py — ORPO: SFT + DPO in One Step!python
from trl import ORPOTrainer, ORPOConfig

# ===========================
# ORPO (Odds Ratio Preference Optimization, 2024)
# Combines SFT + DPO into ONE training step!
# → No need for separate SFT then DPO
# → Simpler pipeline, competitive results
# ===========================

# Same dataset format as DPO: prompt, chosen, rejected
# trainer = ORPOTrainer(
#     model=model, args=ORPOConfig(beta=0.1, ...),
#     train_dataset=dataset, peft_config=peft_config)
# trainer.train()

# Evolution of alignment methods:
# 2022: RLHF/PPO (OpenAI)    → complex, 3-4 models
# 2023: DPO (Stanford)        → simple, 2 models ← RECOMMENDED
# 2024: ORPO (KAIST)          → simpler, SFT+DPO in 1 step
# 2024: SimPO (Princeton)     → even simpler, no reference model
# Trend: semakin sederhana, semakin efisien, hasil tetap bagus!
šŸ›”ļø

10. Safety & Constitutional AI

10. Safety & Constitutional AI

Alignment bukan hanya kualitas — juga keamanan dan etika
Alignment isn't just quality — also safety and ethics

šŸ›”ļø Prinsip Alignment — HHH Framework (Anthropic):
Helpful: Memberikan jawaban yang berguna dan informatif. Tidak menolak tanpa alasan.
Harmless: Menolak membuat konten berbahaya (weapon instructions, hate speech, dll). Tapi tetap bisa diskusi edukatif tentang topik sensitif.
Honest: Mengakui ketidaktahuan ("Saya tidak tahu"), tidak halusinasi fakta palsu, membedakan opini dari fakta.

Constitutional AI (Anthropic): Alih-alih human feedback, model mengevaluasi responsnya sendiri berdasarkan "constitution" (prinsip/aturan). Self-critique → revise → improved response. Lebih scalable dari human annotation.

šŸ›”ļø Alignment Principles — HHH Framework (Anthropic):
Helpful: Provides useful and informative answers. Doesn't refuse without reason.
Harmless: Refuses to create harmful content (weapon instructions, hate speech, etc). But can still have educational discussions about sensitive topics.
Honest: Admits uncertainty ("I don't know"), doesn't hallucinate fake facts, distinguishes opinion from fact.

Constitutional AI (Anthropic): Instead of human feedback, the model evaluates its own responses based on a "constitution" (principles/rules). Self-critique → revise → improved response. More scalable than human annotation.

šŸ“

11. Ringkasan Page 9

11. Page 9 Summary

KonsepApa ItuKode Kunci
RLHFRL dari feedback manusiaReward model + PPO
DPODirect preference optimizationDPOTrainer(model, args, dataset)
Preference Dataprompt + chosen + rejectedload_dataset("Intel/orca_dpo_pairs")
β (beta)Seberapa jauh dari referenceDPOConfig(beta=0.1)
TRLLibrary untuk semua alignmentpip install trl
ORPOSFT + DPO dalam 1 stepORPOTrainer
HHHHelpful, Harmless, HonestPrinsip alignment
ConceptWhat It IsKey Code
RLHFRL from Human FeedbackReward model + PPO
DPODirect Preference OptimizationDPOTrainer(model, args, dataset)
Preference Dataprompt + chosen + rejectedload_dataset("Intel/orca_dpo_pairs")
β (beta)How far from referenceDPOConfig(beta=0.1)
TRLLibrary for all alignmentpip install trl
ORPOSFT + DPO in 1 stepORPOTrainer
HHHHelpful, Harmless, HonestAlignment principles
← Page Sebelumnya← Previous Page

Page 8 — LoRA, QLoRA & Efficient Fine-Tuning

šŸ†

Coming Next: Page 10 — Capstone: End-to-End LLM Project

Grand finale! Gabungkan SEMUA dari 9 pages sebelumnya dalam satu proyek: base model → QLoRA SFT → DPO alignment → Gradio chatbot → deploy ke HF Spaces. Plus roadmap lanjutan dan career paths.

šŸ†

Coming Next: Page 10 — Capstone: End-to-End LLM Project

Grand finale! Combine EVERYTHING from the previous 9 pages in one project: base model → QLoRA SFT → DPO alignment → Gradio chatbot → deploy to HF Spaces. Plus advanced roadmap and career paths.