š Daftar Isi ā Page 9
š Table of Contents ā Page 9
- Kenapa SFT Tidak Cukup ā Masalah model yang hanya di-SFT
- 3 Tahap Training ChatGPT ā Pre-train ā SFT ā RLHF
- RLHF Explained ā Konsep lengkap step by step
- Reward Modeling ā Melatih "hakim" kualitas respons
- PPO ā Proximal Policy Optimization untuk alignment
- DPO ā Direct Preference Optimization (lebih mudah!)
- TRL Library ā Satu library untuk semua alignment
- Proyek: DPO Fine-Tuning ā Preference dataset + QLoRA
- ORPO & SimPO ā Alternatif terbaru yang lebih efisien
- Safety & Constitutional AI ā Membuat model aman
- Ringkasan & Preview Page 10
- Why SFT Isn't Enough ā Problems with SFT-only models
- 3 Stages of ChatGPT Training ā Pre-train ā SFT ā RLHF
- RLHF Explained ā Complete concept step by step
- Reward Modeling ā Training a response quality "judge"
- PPO ā Proximal Policy Optimization for alignment
- DPO ā Direct Preference Optimization (easier!)
- TRL Library ā One library for all alignment
- Project: DPO Fine-Tuning ā Preference dataset + QLoRA
- ORPO & SimPO ā Newest more efficient alternatives
- Safety & Constitutional AI ā Making models safe
- Summary & Page 10 Preview
1. Kenapa SFT Tidak Cukup ā Model Bisa Berbahaya
1. Why SFT Isn't Enough ā Models Can Be Harmful
Di Page 3 dan 8, kita melakukan Supervised Fine-Tuning (SFT) ā mengajarkan model menjawab instruksi dengan format yang benar. Tapi SFT punya masalah besar: model bisa menghasilkan jawaban yang formatnya benar tapi isinya berbahaya, tidak akurat, atau tidak helpful.
In Pages 3 and 8, we did Supervised Fine-Tuning (SFT) ā teaching models to answer instructions in the correct format. But SFT has a big problem: models can produce answers that are correctly formatted but harmful, inaccurate, or unhelpful.
2. 3 Tahap Training ChatGPT ā Pipeline Lengkap
2. 3 Stages of ChatGPT Training ā Complete Pipeline
3. RLHF Explained ā Konsep Lengkap
3. RLHF Explained ā Complete Concept
4. Reward Modeling ā Melatih "Hakim" Kualitas
4. Reward Modeling ā Training a Quality "Judge"
from trl import RewardTrainer, RewardConfig from transformers import AutoModelForSequenceClassification, AutoTokenizer from datasets import load_dataset # =========================== # Reward Model = classifier: (prompt, response) ā score # Trained on human preferences: score(chosen) > score(rejected) # =========================== # 1. Load preference dataset # Format: {"prompt": "...", "chosen": "good response", "rejected": "bad response"} dataset = load_dataset("Anthropic/hh-rlhf", split="train[:5000]") # 2. Load base model as reward model model = AutoModelForSequenceClassification.from_pretrained( "distilbert-base-uncased", num_labels=1) # 1 output = scalar score tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") # 3. Train: reward(chosen) > reward(rejected) # Loss = -log(sigmoid(reward(chosen) - reward(rejected))) # ā Bradley-Terry model of preferences # trainer = RewardTrainer( # model=model, args=RewardConfig(...), # train_dataset=dataset, tokenizer=tokenizer) # trainer.train() # CATATAN: Di praktik, SKIP reward model dan gunakan DPO langsung! # DPO = implicit reward model (lebih sederhana, lebih stabil)
5. PPO ā Proximal Policy Optimization
5. PPO ā Proximal Policy Optimization
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead # =========================== # PPO for RLHF ā conceptual code # =========================== # PPO needs 3-4 models in VRAM simultaneously: # 1. Policy model (LLM being trained) # 2. Reference model (frozen copy ā for KL penalty) # 3. Reward model (scores responses) # 4. Value head (estimates future rewards) # ā VERY expensive! 4Ć VRAM of single model # config = PPOConfig( # model_name="gpt2", # learning_rate=1e-5, # batch_size=16, # mini_batch_size=4, # ppo_epochs=4, # PPO optimization epochs per batch # kl_penalty="kl", # KL divergence from reference model # target_kl=0.1, # target KL (stops if exceeded) # ) # PPO Training Loop: # for batch in dataloader: # 1. Generate responses: policy_model(prompts) ā responses # 2. Score responses: reward_model(prompts, responses) ā rewards # 3. Compute KL penalty: KL(policy || reference) ā penalty # 4. Adjusted reward = reward - β Ć KL_penalty # 5. PPO update: maximize adjusted reward # MASALAH PPO: # ā Butuh 3-4 model di VRAM ā sangat mahal # ā Tidak stabil (reward hacking, mode collapse) # ā Banyak hyperparameters (KL target, clip ratio, etc.) # ā Sulit di-reproduce # ā SOLUSI: DPO! (Section 6)
6. DPO ā Direct Preference Optimization (Game-Changer!)
6. DPO ā Direct Preference Optimization (Game-Changer!)
DPO (Rafailov et al., 2023) adalah breakthrough yang menyederhanakan RLHF secara drastis. Alih-alih melatih reward model terpisah lalu melakukan RL (PPO), DPO langsung mengoptimasi LLM menggunakan preference data. Hasilnya: sama bagusnya dengan RLHF, tapi jauh lebih mudah dan stabil.
DPO (Rafailov et al., 2023) is a breakthrough that drastically simplifies RLHF. Instead of training a separate reward model then doing RL (PPO), DPO directly optimizes the LLM using preference data. Result: as good as RLHF, but much easier and more stable.
| Aspek | PPO (RLHF) | DPO |
|---|---|---|
| Models di VRAM | 3-4 (policy, ref, reward, value) | 2 (policy, ref) |
| Reward model | Perlu train terpisah | Tidak perlu! |
| Stabilitas | Sering unstable | Stabil |
| Hyperparameters | Banyak (KL, clip, etc.) | Sedikit (β) |
| Kualitas | Sangat baik | Sama/lebih baik |
| VRAM (7B + QLoRA) | ~24+ GB | ~14 GB |
| Colab T4? | ā Sulit | ā Bisa dengan QLoRA! |
| Rekomendasi | Research, large-scale | Default choice ā |
| Aspect | PPO (RLHF) | DPO |
|---|---|---|
| Models in VRAM | 3-4 (policy, ref, reward, value) | 2 (policy, ref) |
| Reward model | Need separate training | Not needed! |
| Stability | Often unstable | Stable |
| Hyperparameters | Many (KL, clip, etc.) | Few (β) |
| Quality | Very good | Same/better |
| VRAM (7B + QLoRA) | ~24+ GB | ~14 GB |
| Colab T4? | ā Difficult | ā Works with QLoRA! |
| Recommendation | Research, large-scale | Default choice ā |
7. TRL Library ā Satu Library untuk Semua Alignment
7. TRL Library ā One Library for All Alignment
# pip install trl # TRL = Transformer Reinforcement Learning # By Hugging Face ā the standard library for LLM alignment from trl import ( SFTTrainer, # Supervised Fine-Tuning (Page 8) RewardTrainer, # Train reward model PPOTrainer, # RLHF with PPO DPOTrainer, # Direct Preference Optimization ā ORPOTrainer, # Odds Ratio Preference Optimization (newest) # KTOTrainer, # Kahneman-Tversky Optimization ) # Training pipeline: # Step 1: SFTTrainer ā instruction-following model (Page 8 ā done!) # Step 2: DPOTrainer ā aligned model (Page 9 this! ā) # That's it! 2 steps = ChatGPT-like assistant.
8. Proyek: DPO Fine-Tuning ā Preference Dataset + QLoRA
8. Project: DPO Fine-Tuning ā Preference Dataset + QLoRA
#!/usr/bin/env python3 """ šÆ DPO Fine-Tuning: Align LLM with Human Preferences Uses QLoRA (Page 8) + DPO ā ChatGPT-like alignment on Colab T4! """ # !pip install -q transformers datasets accelerate peft bitsandbytes trl import torch from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import LoraConfig from trl import DPOTrainer, DPOConfig from datasets import load_dataset # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā # STEP 1: LOAD MODEL (4-bit, same as Page 8!) # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā model_name = "mistralai/Mistral-7B-v0.3" bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True) model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token # Reference model = same base (DPO needs it for KL penalty) # DPOTrainer automatically creates reference model internally! # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā # STEP 2: LOAD PREFERENCE DATASET # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā # Format: {"prompt": "...", "chosen": "good response", "rejected": "bad response"} dataset = load_dataset("Intel/orca_dpo_pairs", split="train[:5000]") # Inspect print(dataset[0].keys()) # dict_keys(['system', 'question', 'chosen', 'rejected']) print(f"Chosen: {dataset[0]['chosen'][:100]}...") print(f"Rejected: {dataset[0]['rejected'][:100]}...") # Format for DPOTrainer def format_dpo(example): return { "prompt": f"### Instruction:\n{example['question']}\n\n### Response:\n", "chosen": example["chosen"], "rejected": example["rejected"], } dataset = dataset.map(format_dpo) # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā # STEP 3: CONFIGURE LORA + DPO # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā peft_config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], bias="none", task_type="CAUSAL_LM") dpo_config = DPOConfig( output_dir="./dpo-mistral", num_train_epochs=1, per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=5e-5, # lower than SFT (more careful!) beta=0.1, # DPO temperature (0.1-0.5) # β controls how much to deviate from reference model # β=0.1 ā stay close to reference (conservative) # β=0.5 ā deviate more (aggressive optimization) fp16=True, gradient_checkpointing=True, optim="paged_adamw_8bit", max_length=512, max_prompt_length=256, logging_steps=25, save_steps=200, report_to="none", ) # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā # STEP 4: TRAIN WITH DPO! # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā trainer = DPOTrainer( model=model, args=dpo_config, train_dataset=dataset, processing_class=tokenizer, peft_config=peft_config, # ref_model=None ā auto-created from base model! ) print("šÆ Training DPO...") trainer.train() trainer.save_model("./dpo-mistral-final") # Test: model should now prefer helpful, safe responses! print("\\nš DPO alignment complete!")
šÆ Anda Baru Saja Melakukan Alignment ā Teknik yang Membuat ChatGPT!
Pipeline lengkap yang Anda sudah kuasai di seri ini:
⢠Page 8 (SFT): Base model ā instruction-following model (QLoRA)
⢠Page 9 (DPO): SFT model ā aligned, helpful, safe assistant (QLoRA + DPO)
Keduanya berjalan di Colab T4 gratis! Dengan model 7B parameters!
Ini exact same pipeline (simplified) yang dipakai untuk membuat ChatGPT, Claude, Gemini.
šÆ You Just Did Alignment ā The Technique That Made ChatGPT!
Complete pipeline you've mastered in this series:
⢠Page 8 (SFT): Base model ā instruction-following model (QLoRA)
⢠Page 9 (DPO): SFT model ā aligned, helpful, safe assistant (QLoRA + DPO)
Both run on free Colab T4! With 7B parameter models!
This is the exact same pipeline (simplified) used to make ChatGPT, Claude, Gemini.
9. ORPO & SimPO ā Alternatif Terbaru
9. ORPO & SimPO ā Newest Alternatives
from trl import ORPOTrainer, ORPOConfig # =========================== # ORPO (Odds Ratio Preference Optimization, 2024) # Combines SFT + DPO into ONE training step! # ā No need for separate SFT then DPO # ā Simpler pipeline, competitive results # =========================== # Same dataset format as DPO: prompt, chosen, rejected # trainer = ORPOTrainer( # model=model, args=ORPOConfig(beta=0.1, ...), # train_dataset=dataset, peft_config=peft_config) # trainer.train() # Evolution of alignment methods: # 2022: RLHF/PPO (OpenAI) ā complex, 3-4 models # 2023: DPO (Stanford) ā simple, 2 models ā RECOMMENDED # 2024: ORPO (KAIST) ā simpler, SFT+DPO in 1 step # 2024: SimPO (Princeton) ā even simpler, no reference model # Trend: semakin sederhana, semakin efisien, hasil tetap bagus!
10. Safety & Constitutional AI
10. Safety & Constitutional AI
š”ļø Prinsip Alignment ā HHH Framework (Anthropic):
Helpful: Memberikan jawaban yang berguna dan informatif. Tidak menolak tanpa alasan.
Harmless: Menolak membuat konten berbahaya (weapon instructions, hate speech, dll). Tapi tetap bisa diskusi edukatif tentang topik sensitif.
Honest: Mengakui ketidaktahuan ("Saya tidak tahu"), tidak halusinasi fakta palsu, membedakan opini dari fakta.
Constitutional AI (Anthropic): Alih-alih human feedback, model mengevaluasi responsnya sendiri berdasarkan "constitution" (prinsip/aturan). Self-critique ā revise ā improved response. Lebih scalable dari human annotation.
š”ļø Alignment Principles ā HHH Framework (Anthropic):
Helpful: Provides useful and informative answers. Doesn't refuse without reason.
Harmless: Refuses to create harmful content (weapon instructions, hate speech, etc). But can still have educational discussions about sensitive topics.
Honest: Admits uncertainty ("I don't know"), doesn't hallucinate fake facts, distinguishes opinion from fact.
Constitutional AI (Anthropic): Instead of human feedback, the model evaluates its own responses based on a "constitution" (principles/rules). Self-critique ā revise ā improved response. More scalable than human annotation.
11. Ringkasan Page 9
11. Page 9 Summary
| Konsep | Apa Itu | Kode Kunci |
|---|---|---|
| RLHF | RL dari feedback manusia | Reward model + PPO |
| DPO | Direct preference optimization | DPOTrainer(model, args, dataset) |
| Preference Data | prompt + chosen + rejected | load_dataset("Intel/orca_dpo_pairs") |
| β (beta) | Seberapa jauh dari reference | DPOConfig(beta=0.1) |
| TRL | Library untuk semua alignment | pip install trl |
| ORPO | SFT + DPO dalam 1 step | ORPOTrainer |
| HHH | Helpful, Harmless, Honest | Prinsip alignment |
| Concept | What It Is | Key Code |
|---|---|---|
| RLHF | RL from Human Feedback | Reward model + PPO |
| DPO | Direct Preference Optimization | DPOTrainer(model, args, dataset) |
| Preference Data | prompt + chosen + rejected | load_dataset("Intel/orca_dpo_pairs") |
| β (beta) | How far from reference | DPOConfig(beta=0.1) |
| TRL | Library for all alignment | pip install trl |
| ORPO | SFT + DPO in 1 step | ORPOTrainer |
| HHH | Helpful, Harmless, Honest | Alignment principles |
Page 8 ā LoRA, QLoRA & Efficient Fine-Tuning
Coming Next: Page 10 ā Capstone: End-to-End LLM Project
Grand finale! Gabungkan SEMUA dari 9 pages sebelumnya dalam satu proyek: base model ā QLoRA SFT ā DPO alignment ā Gradio chatbot ā deploy ke HF Spaces. Plus roadmap lanjutan dan career paths.
Coming Next: Page 10 ā Capstone: End-to-End LLM Project
Grand finale! Combine EVERYTHING from the previous 9 pages in one project: base model ā QLoRA SFT ā DPO alignment ā Gradio chatbot ā deploy to HF Spaces. Plus advanced roadmap and career paths.