Belajar Hugging Face Page 8 — LoRA, QLoRA & Efficient Fine-Tuning

📑 Daftar Isi — Page 8

📑 Table of Contents — Page 8

Masalah VRAM — Kenapa full fine-tuning 7B mustahil di Colab
PEFT — Train 0.1% parameters, dapat 95%+ performa full
LoRA Explained — Matematika & intuisi di baliknya
LoRA dengan PEFT Library — Implementasi praktis
Quantization — 4-bit & 8-bit model loading
QLoRA — 4-bit base + LoRA adapters = game-changer
Di Mana Jalankan? — VRAM QLoRA vs LoRA vs Full
Proyek: Fine-Tune LLM dengan QLoRA — Complete pipeline
Merge LoRA ke Base Model — Untuk deployment
Perbandingan Metode — Full vs LoRA vs QLoRA vs Prefix
Ringkasan & Preview Page 9

The VRAM Problem — Why full fine-tuning 7B is impossible on Colab
PEFT — Train 0.1% parameters, get 95%+ of full performance
LoRA Explained — Math & intuition behind it
LoRA with PEFT Library — Practical implementation
Quantization — 4-bit & 8-bit model loading
QLoRA — 4-bit base + LoRA adapters = game-changer
Where to Run? — VRAM for QLoRA vs LoRA vs Full
Project: Fine-Tune LLM with QLoRA — Complete pipeline
Merge LoRA into Base Model — For deployment
Method Comparison — Full vs LoRA vs QLoRA vs Prefix
Summary & Page 9 Preview

💥

1. Masalah VRAM — Kenapa Full Fine-Tuning 7B Mustahil di Colab

1. The VRAM Problem — Why Full Fine-Tuning 7B Is Impossible on Colab

Hitungan matematika: model 7B butuh ~120GB VRAM. Colab T4 = 16GB. 7.5× kekurangan!

The math: a 7B model needs ~120GB VRAM. Colab T4 = 16GB. 7.5× shortfall!

VRAM Breakdown — Kenapa 7B Model Butuh ~120GB untuk Full Fine-Tuning Model: Mistral-7B (7.24 MILIAR parameters) 1. Model Weights 7.24B params × 2 bytes (FP16) = ~14.5 GB 2. Gradients (sama ukuran dengan weights) 7.24B × 2 bytes = ~14.5 GB 3. Optimizer States (AdamW: 2 states per parameter!) 7.24B × 2 × 4 bytes (FP32) = ~58 GB (momentum + variance, FP32 untuk stabilitas numerik) 4. Activations (depends on batch size & seq length) ~20-40 GB (batch=1, seq=512) ──────────────────────────────────────── TOTAL: ~107-127 GB VRAM untuk full fine-tuning! GPU yang tersedia: • Colab T4: 16 GB → ❌ JAUH tidak cukup (butuh 8× lipat!) • RTX 4090: 24 GB → ❌ Tetap tidak cukup • A100 40GB: 40 GB → ❌ Masih kurang • A100 80GB: 80 GB → ⚠️ Ketat (batch=1, seq pendek) • 2× A100 80GB: 160 GB → ✅ Baru cukup nyaman Solusi: LoRA + Quantization → fine-tune 7B di Colab T4 16GB! 🎉 Dengan QLoRA: • Model: 7.24B × 0.5 bytes (4-bit) = ~3.6 GB (bukan 14.5!) • LoRA adapters: ~10M params × 2 bytes = ~0.02 GB (bukan 14.5!) • Gradients (LoRA only): ~10M × 2 = ~0.02 GB • Optimizer (LoRA only): ~10M × 8 = ~0.08 GB • Activations: ~8 GB (with gradient checkpointing) ──────────────────────────────────────── QLoRA TOTAL: ~12 GB → Colab T4 (16GB) ✅ CUKUP!

🎓 Kenapa Optimizer States Begitu Besar?
AdamW (optimizer default) menyimpan 2 state per parameter: momentum (running mean of gradients) dan variance (running mean of squared gradients). Keduanya disimpan dalam FP32 (bukan FP16) untuk stabilitas. Jadi optimizer = 2 × params × 4 bytes = 8× ukuran model FP16!
Inilah "silent killer" VRAM. Model 14.5GB FP16 butuh 58GB untuk optimizer saja.
LoRA solusinya: optimizer hanya untuk LoRA parameters (~10M, bukan 7B) → optimizer cuma ~80MB.

🎓 Why Are Optimizer States So Large?
AdamW (default optimizer) stores 2 states per parameter: momentum (running mean of gradients) and variance (running mean of squared gradients). Both stored in FP32 (not FP16) for numerical stability. So optimizer = 2 × params × 4 bytes = 8× the FP16 model size!
This is the "silent VRAM killer". A 14.5GB FP16 model needs 58GB just for optimizer.
LoRA's solution: optimizer only for LoRA parameters (~10M, not 7B) → optimizer is just ~80MB.

🎯

2. PEFT — Train 0.1% Parameters, Dapat 95%+ Performa

2. PEFT — Train 0.1% Parameters, Get 95%+ Performance

Parameter-Efficient Fine-Tuning: freeze base model, train hanya adapter kecil

Parameter-Efficient Fine-Tuning: freeze base model, train only small adapters

PEFT = keluarga teknik yang memungkinkan fine-tuning model besar dengan hanya melatih sebagian kecil parameters. Base model di-freeze (tidak diubah), lalu adapter kecil ditambahkan dan dilatih. Hasilnya: 95-99% performa full fine-tuning, tapi dengan 10-100× lebih sedikit VRAM dan waktu training.

PEFT = a family of techniques that enable fine-tuning large models by training only a small fraction of parameters. The base model is frozen (unchanged), then small adapters are added and trained. Result: 95-99% of full fine-tuning performance, but with 10-100× less VRAM and training time.

Metode PEFT	Params Trained	Cara Kerja	Performa vs Full
LoRA	0.1-1%	Low-rank matrices di attention layers	~97% ⭐
QLoRA	0.1-1%	LoRA + 4-bit quantized base	~96% ⭐⭐
Prefix Tuning	0.1%	Learnable prefix tokens	~90%
Prompt Tuning	0.01%	Learnable soft prompts	~85%
IA3	0.01%	Learned rescaling vectors	~93%
Adapters	1-5%	MLP layers di setiap block	~95%
Full Fine-Tuning	100%	Update semua weights	100% (baseline)

PEFT Method	Params Trained	How It Works	Perf vs Full
LoRA	0.1-1%	Low-rank matrices in attention layers	~97% ⭐
QLoRA	0.1-1%	LoRA + 4-bit quantized base	~96% ⭐⭐
Prefix Tuning	0.1%	Learnable prefix tokens	~90%
Prompt Tuning	0.01%	Learnable soft prompts	~85%
IA3	0.01%	Learned rescaling vectors	~93%
Adapters	1-5%	MLP layers in each block	~95%
Full Fine-Tuning	100%	Update all weights	100% (baseline)

🧮

3. LoRA Explained — Matematika & Intuisi

3. LoRA Explained — Math & Intuition

Low-Rank Adaptation: dekomposisi weight update menjadi 2 matriks kecil

Low-Rank Adaptation: decompose weight updates into 2 small matrices

LoRA — Ide Inti (dari Paper "LoRA" oleh Hu et al., 2021) Full Fine-Tuning: W_new = W_original + ΔW W_original: (4096 × 4096) = 16.7 JUTA parameters ← FROZEN ΔW: (4096 × 4096) = 16.7 JUTA parameters ← harus di-train! Total trainable: 16.7M per layer LoRA Fine-Tuning: W_new = W_original + B × A W_original: (4096 × 4096) ← FROZEN (tidak diubah!) A: (4096 × r) dimana r = rank (biasanya 8-64) B: (r × 4096) B × A: (4096 × 4096) ← same shape as ΔW, tapi built from 2 KECIL matrices! Jika r = 16: A: 4096 × 16 = 65,536 params B: 16 × 4096 = 65,536 params Total trainable per layer: 131,072 (bukan 16.7M!) Reduction: 16.7M → 131K = 127× lebih sedikit parameters! Kenapa ini bekerja? Penelitian menunjukkan bahwa weight updates (ΔW) selama fine-tuning memiliki "intrinsic rank" yang RENDAH — artinya ΔW bisa diaproksimasi dengan baik oleh perkalian 2 matriks kecil (B × A). Analoginya: fine-tuning tidak mengubah "segalanya" di model — hanya membuat adjustment kecil yang bisa direpresentasikan secara compact.

60_lora_math.py — LoRA Mathematics Visualizedpython

import torch

# ===========================
# LoRA: W_new = W_frozen + B @ A
# ===========================
d = 4096  # hidden dimension (e.g., LLaMA 7B)
r = 16    # LoRA rank (hyperparameter)

# Original weight (FROZEN — no gradient!)
W_frozen = torch.randn(d, d, requires_grad=False)

# LoRA adapters (TRAINABLE — small!)
A = torch.randn(d, r, requires_grad=True)   # "down projection"
B = torch.zeros(r, d, requires_grad=True)   # "up projection" (init zeros!)
# B initialized to ZEROS → at start, B@A = 0 → model unchanged!
# This is important: training STARTS from original model behavior.

# Forward pass
x = torch.randn(1, d)  # input
output = x @ W_frozen + x @ A @ B  # original + LoRA delta
#         ↑ frozen       ↑ trainable (tiny!)

# Parameter count comparison
full_params = d * d                  # 16,777,216
lora_params = d * r + r * d          # 131,072
reduction = full_params / lora_params
print(f"Full:  {full_params:>12,} params per weight matrix")
print(f"LoRA:  {lora_params:>12,} params (r={r})")
print(f"Reduction: {reduction:.0f}× fewer parameters!")
# Full:    16,777,216 params
# LoRA:       131,072 params (r=16)
# Reduction: 128× fewer parameters!

# For entire Mistral-7B (applied to Q, K, V, O projections):
# Full trainable: ~7.24B parameters
# LoRA trainable: ~10M parameters (0.14% of total!)
# Saved model size: full=14.5GB vs LoRA adapter=~40MB

🎓 LoRA Hyperparameters yang Penting:
r (rank): Dimensi LoRA adapters. Default=8-16. Semakin besar = lebih ekspresif tapi lebih banyak params. r=16 sudah cukup untuk kebanyakan tugas. r=64 untuk tugas kompleks.
alpha: Scaling factor. Default = r atau 2×r. LoRA output di-scale oleh alpha/r. Alpha=16 dengan r=16 → scale=1.0. Alpha=32 dengan r=16 → scale=2.0 (lebih agresif).
target_modules: Layer mana yang diberi LoRA. Default: q_proj, v_proj (attention). Bisa tambah: k_proj, o_proj, gate_proj, up_proj, down_proj untuk performa lebih.
dropout: LoRA dropout, default=0.05-0.1. Mencegah overfitting pada dataset kecil.

🎓 Important LoRA Hyperparameters:
r (rank): LoRA adapter dimension. Default=8-16. Larger = more expressive but more params. r=16 is enough for most tasks. r=64 for complex tasks.
alpha: Scaling factor. Default = r or 2×r. LoRA output is scaled by alpha/r. Alpha=16 with r=16 → scale=1.0. Alpha=32 with r=16 → scale=2.0 (more aggressive).
target_modules: Which layers get LoRA. Default: q_proj, v_proj (attention). Can add: k_proj, o_proj, gate_proj, up_proj, down_proj for better performance.
dropout: LoRA dropout, default=0.05-0.1. Prevents overfitting on small datasets.

🔧

4. LoRA dengan PEFT Library — Implementasi Praktis

4. LoRA with PEFT Library — Practical Implementation

pip install peft → tambahkan LoRA ke model apapun dalam 5 baris

pip install peft → add LoRA to any model in 5 lines

61_lora_peft.py — LoRA with PEFT Librarypython

# pip install peft
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer

# ===========================
# 1. Load base model
# ===========================
model_name = "mistralai/Mistral-7B-v0.3"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"Base model params: {model.num_parameters():,}")
# Base model params: 7,241,732,096 (7.24B!)

# ===========================
# 2. Configure LoRA
# ===========================
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,  # or SEQ_CLS, TOKEN_CLS, SEQ_2_SEQ_LM
    r=16,                         # rank (8-64, sweet spot: 16)
    lora_alpha=32,                # scaling (usually 2×r)
    lora_dropout=0.05,            # dropout
    target_modules=[              # which layers get LoRA
        "q_proj", "k_proj",        # attention query & key
        "v_proj", "o_proj",        # attention value & output
        # "gate_proj", "up_proj", "down_proj",  # MLP layers (optional, more params)
    ],
    bias="none",                  # don't train biases
)

# ===========================
# 3. Apply LoRA to model
# ===========================
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 7,255,363,584 || trainable%: 0.188%
# → Only 13.6M trainable out of 7.24B! (0.188%)
# → Saved adapter size: ~55MB (not 14.5GB!)

# ===========================
# 4. Train with standard Trainer (IDENTICAL to Page 2!)
# ===========================
# trainer = Trainer(model=model, args=args, ...)
# trainer.train()  # ← same API! PEFT is transparent to Trainer.

# ===========================
# 5. Save LoRA adapter (tiny!)
# ===========================
model.save_pretrained("./lora-adapter")
# Saves ONLY the LoRA weights (~55MB)
# NOT the full 14.5GB base model!

# ===========================
# 6. Load LoRA adapter later
# ===========================
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained(model_name)
model_with_lora = PeftModel.from_pretrained(base, "./lora-adapter")
# → base model + LoRA adapter = fine-tuned model!

📦

5. Quantization — 4-bit & 8-bit Model Loading

Kompres model 14.5GB → 3.6GB (4-bit) tanpa kehilangan banyak kualitas

Compress a 14.5GB model → 3.6GB (4-bit) with minimal quality loss

62_quantization.py — BitsAndBytes Quantizationpython

# pip install bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# ===========================
# 1. 4-bit quantization config (QLoRA standard!)
# ===========================
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # load model in 4-bit!
    bnb_4bit_quant_type="nf4",            # NormalFloat4 (best for LLMs)
    bnb_4bit_compute_dtype=torch.float16, # compute in FP16 (speed)
    bnb_4bit_use_double_quant=True,       # quantize the quantization constants!
    # → extra 0.4GB savings for 7B model
)

# ===========================
# 2. Load 7B model in 4-bit (~3.6GB instead of 14.5GB!)
# ===========================
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    quantization_config=bnb_config,
    device_map="auto",  # auto-place layers across GPU/CPU
)

print(f"Model loaded! Memory: {model.get_memory_footprint()/1e9:.1f} GB")
# Model loaded! Memory: 3.6 GB (bukan 14.5GB!)
# → Muat di Colab T4 (16GB)! Sisa ~12GB untuk training.

# ===========================
# 3. Comparison: precision vs VRAM
# ===========================
# FP32 (32-bit): 7B × 4 bytes = ~29 GB  ← TIDAK muat di A100 40GB
# FP16 (16-bit): 7B × 2 bytes = ~14.5 GB ← Muat di A100, tidak T4
# INT8 (8-bit):  7B × 1 byte  = ~7.2 GB  ← Muat di T4, ketat
# NF4 (4-bit):   7B × 0.5 byte = ~3.6 GB ← Muat NYAMAN di T4! ⭐

Precision	Bytes/Param	7B Model Size	Quality Loss	Colab T4?
FP32	4	~29 GB	0% (baseline)	❌
FP16/BF16	2	~14.5 GB	~0%	⚠️ Inference only
INT8	1	~7.2 GB	~0.5%	⚠️ Ketat + LoRA
NF4 (4-bit)	0.5	~3.6 GB	~1-2%	✅ Nyaman + QLoRA ⭐

Precision	Bytes/Param	7B Model Size	Quality Loss	Colab T4?
FP32	4	~29 GB	0% (baseline)	❌
FP16/BF16	2	~14.5 GB	~0%	⚠️ Inference only
INT8	1	~7.2 GB	~0.5%	⚠️ Tight + LoRA
NF4 (4-bit)	0.5	~3.6 GB	~1-2%	✅ Comfortable + QLoRA ⭐

⚡

6. QLoRA — 4-bit Base + LoRA = Game-Changer

Gabungkan quantization (Section 5) + LoRA (Section 3-4) = fine-tune 7B di Colab!

Combine quantization (Section 5) + LoRA (Section 3-4) = fine-tune 7B on Colab!

QLoRA = Quantization + LoRA — Best of Both Worlds ┌──────────────────────────────────────────────────────┐ │ Base Model: 4-bit quantized (NF4) │ │ 7.24B params × 0.5 bytes = 3.6 GB (FROZEN!) │ │ │ │ LoRA Adapters: FP16 (trainable) │ │ ~13M params × 2 bytes = ~0.05 GB (TRAINED!) │ │ Applied to: q_proj, k_proj, v_proj, o_proj │ │ │ │ Compute: FP16 │ │ 4-bit weights → dequantize to FP16 → forward pass │ │ LoRA: FP16 forward + backward (gradients) │ │ Optimizer: only for LoRA params → tiny! │ └──────────────────────────────────────────────────────┘ VRAM Breakdown (Mistral-7B, QLoRA, batch=1, seq=512): • 4-bit model: ~3.6 GB • LoRA params: ~0.05 GB • LoRA optimizer: ~0.08 GB • Gradients: ~0.05 GB • Activations: ~6-8 GB (with gradient checkpointing) ────────────────────────────── Total: ~10-12 GB → Colab T4 (16GB) ✅!

💻

7. Di Mana Jalankan? — VRAM Comparison

7. Where to Run? — VRAM Comparison

Model	Full Fine-Tune	LoRA (FP16)	QLoRA (4-bit)	Colab T4 (16GB)?
LLaMA-3.2 1B	~16 GB	~6 GB	~4 GB	✅ QLoRA nyaman
Gemma-2 2B	~24 GB	~8 GB	~5 GB	✅ QLoRA nyaman
Mistral 7B	~120 GB	~20 GB	~12 GB	✅ QLoRA ⭐
LLaMA-3.1 8B	~130 GB	~22 GB	~13 GB	⚠️ QLoRA ketat
LLaMA-3.1 70B	~1 TB+	~160 GB	~40 GB	❌ Butuh A100 80GB

Model	Full Fine-Tune	LoRA (FP16)	QLoRA (4-bit)	Colab T4 (16GB)?
LLaMA-3.2 1B	~16 GB	~6 GB	~4 GB	✅ QLoRA comfortable
Gemma-2 2B	~24 GB	~8 GB	~5 GB	✅ QLoRA comfortable
Mistral 7B	~120 GB	~20 GB	~12 GB	✅ QLoRA ⭐
LLaMA-3.1 8B	~130 GB	~22 GB	~13 GB	⚠️ QLoRA tight
LLaMA-3.1 70B	~1 TB+	~160 GB	~40 GB	❌ Need A100 80GB

🎉 Page 8 ini menggunakan Mistral-7B + QLoRA di Colab T4 gratis! Full fine-tuning Mistral butuh ~120GB (8× A100). Dengan QLoRA: hanya ~12GB. Colab T4 = 16GB → CUKUP! Dari "mustahil" menjadi "gratis".

🎉 This Page 8 uses Mistral-7B + QLoRA on free Colab T4! Full fine-tuning Mistral needs ~120GB (8× A100). With QLoRA: only ~12GB. Colab T4 = 16GB → ENOUGH! From "impossible" to "free".

🔥

8. Proyek: Fine-Tune LLM dengan QLoRA — Complete Pipeline

8. Project: Fine-Tune LLM with QLoRA — Complete Pipeline

Fine-tune Mistral-7B pada instruction dataset — di Colab T4 gratis!

Fine-tune Mistral-7B on instruction dataset — on free Colab T4!

63_qlora_finetune.py — Complete QLoRA Fine-Tuning 🔥🔥🔥python

#!/usr/bin/env python3
"""
🔥 Fine-Tune Mistral-7B with QLoRA on Google Colab T4 (free!)
From "impossible" (120GB VRAM) to "free" (12GB VRAM) with QLoRA.
"""

# ═══════════════════════════════════════
# STEP 0: INSTALL (run in Colab!)
# ═══════════════════════════════════════
# !pip install -q transformers datasets accelerate peft bitsandbytes trl

import torch
from transformers import (
    AutoModelForCausalLM, AutoTokenizer,
    BitsAndBytesConfig, TrainingArguments
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer  # Supervised Fine-Tuning Trainer

# ═══════════════════════════════════════
# STEP 1: LOAD MODEL IN 4-BIT
# ═══════════════════════════════════════
model_name = "mistralai/Mistral-7B-v0.3"
# Alternatives: "meta-llama/Llama-3.2-1B" (smaller, easier)
#               "Qwen/Qwen2.5-7B" (multilingual)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name, quantization_config=bnb_config,
    device_map="auto", trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print(f"Model VRAM: {model.get_memory_footprint()/1e9:.1f} GB")
# Model VRAM: 3.6 GB ← 4-bit! (was 14.5 GB in FP16)

# ═══════════════════════════════════════
# STEP 2: PREPARE MODEL FOR TRAINING
# ═══════════════════════════════════════
model = prepare_model_for_kbit_training(model)
# Enables gradient checkpointing + casts to correct dtypes

# ═══════════════════════════════════════
# STEP 3: ADD LORA ADAPTERS
# ═══════════════════════════════════════
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 7,255,363,584 || trainable%: 0.19%

# ═══════════════════════════════════════
# STEP 4: LOAD INSTRUCTION DATASET
# ═══════════════════════════════════════
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
# 15k instruction-response pairs (open-source!)

def format_instruction(example):
    if example.get("context") and example["context"].strip():
        text = f"""### Instruction:\n{example['instruction']}\n\n### Context:\n{example['context']}\n\n### Response:\n{example['response']}{tokenizer.eos_token}"""
    else:
        text = f"""### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}{tokenizer.eos_token}"""
    return {"text": text}

dataset = dataset.map(format_instruction)
print(f"Training examples: {len(dataset)}")
print(f"Sample:\n{dataset[0]['text'][:300]}...")

# ═══════════════════════════════════════
# STEP 5: TRAINING ARGUMENTS
# ═══════════════════════════════════════
args = TrainingArguments(
    output_dir="./qlora-mistral",
    num_train_epochs=1,                  # 1 epoch for 15k examples → enough!
    per_device_train_batch_size=2,       # small batch (4-bit model is large)
    gradient_accumulation_steps=8,       # effective batch = 2 × 8 = 16
    learning_rate=2e-4,                  # higher LR for LoRA (2e-4, not 2e-5!)
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    fp16=True,
    logging_steps=25,
    save_strategy="steps",
    save_steps=200,
    save_total_limit=2,
    optim="paged_adamw_8bit",           # 8-bit optimizer → less VRAM!
    gradient_checkpointing=True,        # saves ~30% VRAM (slower training)
    max_grad_norm=0.3,                  # gradient clipping
    report_to="none",
)

# ═══════════════════════════════════════
# STEP 6: TRAIN WITH SFTTrainer (from TRL library)
# ═══════════════════════════════════════
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    dataset_text_field="text",          # column with formatted text
    max_seq_length=512,                 # max token length
    packing=True,                       # pack short examples into one sequence!
    # packing = True → 2-3× faster training (less padding waste)
)

print("🏋️ Training QLoRA...")
print(f"VRAM used: {torch.cuda.memory_allocated()/1e9:.1f} GB")
trainer.train()

# ═══════════════════════════════════════
# STEP 7: SAVE & TEST
# ═══════════════════════════════════════
trainer.save_model("./qlora-mistral-final")

# Test!
prompt = "### Instruction:\nWhat is machine learning?\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=200,
        temperature=0.7, top_p=0.9, repetition_penalty=1.2)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print("\\n🏆 QLoRA fine-tuning complete!")

🔥 Anda Baru Saja Fine-Tune Model 7 MILIAR Parameter di Colab Gratis!
Bandingkan yang sebelumnya mustahil:
• Full fine-tune Mistral-7B: butuh ~120GB VRAM (8× A100 = ~$256/jam) ❌
• QLoRA Mistral-7B: butuh ~12GB VRAM (1× Colab T4 = $0) ✅
• Performa: ~96% dari full fine-tuning!
• Training time: ~30-60 menit untuk 15k examples
• Saved adapter: ~55MB (bukan 14.5GB!)
Ini adalah revolusi yang membuat fine-tuning LLM accessible untuk semua orang.

🔥 You Just Fine-Tuned a 7 BILLION Parameter Model on Free Colab!
Compare what was previously impossible:
• Full fine-tune Mistral-7B: needs ~120GB VRAM (8× A100 = ~$256/hr) ❌
• QLoRA Mistral-7B: needs ~12GB VRAM (1× Colab T4 = $0) ✅
• Performance: ~96% of full fine-tuning!
• Training time: ~30-60 minutes for 15k examples
• Saved adapter: ~55MB (not 14.5GB!)
This is the revolution that makes LLM fine-tuning accessible to everyone.

🔗

9. Merge LoRA ke Base Model — Untuk Deployment

9. Merge LoRA into Base Model — For Deployment

Gabungkan adapter ke base model → satu model utuh tanpa overhead LoRA

Merge adapter into base model → one complete model without LoRA overhead

64_merge_lora.py — Merge & Deploypython

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# ===========================
# 1. Load base model (FP16, NOT 4-bit!) + LoRA adapter
# ===========================
base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    torch_dtype=torch.float16,
    device_map="auto",
)
# Load LoRA adapter on top
model = PeftModel.from_pretrained(base_model, "./qlora-mistral-final")

# ===========================
# 2. Merge LoRA into base weights
# ===========================
model = model.merge_and_unload()
# → W_new = W_base + B @ A (permanently merged!)
# → No more LoRA overhead at inference
# → Same speed as original model

# ===========================
# 3. Save merged model
# ===========================
model.save_pretrained("./mistral-7b-finetuned-merged")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")
tokenizer.save_pretrained("./mistral-7b-finetuned-merged")
# → Full 14.5GB model with LoRA baked in
# → Deploy like any normal model (TF Serving, vLLM, TGI, etc.)

# ===========================
# 4. Push to Hub!
# ===========================
# model.push_to_hub("username/mistral-7b-instruction-tuned")
# tokenizer.push_to_hub("username/mistral-7b-instruction-tuned")

# ===========================
# When to merge vs keep separate?
# ===========================
# MERGE: for production deployment (no PEFT dependency needed)
# KEEP SEPARATE: when you have multiple LoRA adapters for different tasks
#   e.g., base_model + lora_sentiment, base_model + lora_translation
#   → swap adapters at runtime without loading model twice!

⚖️

10. Perbandingan Metode — Full vs LoRA vs QLoRA

10. Method Comparison — Full vs LoRA vs QLoRA

Aspek	Full Fine-Tune	LoRA (FP16)	QLoRA (4-bit)
Params trained	100%	0.1-1%	0.1-1%
VRAM (7B)	~120 GB	~20 GB	~12 GB
Performa	100% (baseline)	~97%	~96%
Kecepatan train	Lambat	~2× faster	~3× faster
Saved size	14.5 GB	~55 MB adapter	~55 MB adapter
GPU minimum	2-4× A100 80GB	1× A100 40GB	1× T4 16GB
Cost (Colab)	Tidak mungkin	Colab Pro	Colab FREE!
Multi-task	1 model per task	Swap adapters!	Swap adapters!
Kapan pakai	Unlimited budget	Good GPU tersedia	Default choice ⭐

Aspect	Full Fine-Tune	LoRA (FP16)	QLoRA (4-bit)
Params trained	100%	0.1-1%	0.1-1%
VRAM (7B)	~120 GB	~20 GB	~12 GB
Performance	100% (baseline)	~97%	~96%
Training speed	Slow	~2× faster	~3× faster
Saved size	14.5 GB	~55 MB adapter	~55 MB adapter
Minimum GPU	2-4× A100 80GB	1× A100 40GB	1× T4 16GB
Cost (Colab)	Impossible	Colab Pro	Colab FREE!
Multi-task	1 model per task	Swap adapters!	Swap adapters!
When to use	Unlimited budget	Good GPU available	Default choice ⭐

📝

11. Ringkasan Page 8

11. Page 8 Summary

Konsep	Apa Itu	Kode Kunci
PEFT	Train 0.1% params	`pip install peft`
LoRA	W + B×A (low-rank adapters)	`LoraConfig(r=16, alpha=32)`
QLoRA	4-bit base + LoRA	`BitsAndBytesConfig(load_in_4bit=True)`
4-bit Quant	14.5GB → 3.6GB	`bnb_4bit_quant_type="nf4"`
get_peft_model	Add LoRA to any model	`get_peft_model(model, config)`
SFTTrainer	Instruction fine-tuning	`SFTTrainer(model, args, dataset)`
Merge	Bake LoRA into base	`model.merge_and_unload()`
Paged AdamW 8-bit	Memory-efficient optimizer	`optim="paged_adamw_8bit"`

Concept	What It Is	Key Code
PEFT	Train 0.1% params	`pip install peft`
LoRA	W + B×A (low-rank adapters)	`LoraConfig(r=16, alpha=32)`
QLoRA	4-bit base + LoRA	`BitsAndBytesConfig(load_in_4bit=True)`
4-bit Quant	14.5GB → 3.6GB	`bnb_4bit_quant_type="nf4"`
get_peft_model	Add LoRA to any model	`get_peft_model(model, config)`
SFTTrainer	Instruction fine-tuning	`SFTTrainer(model, args, dataset)`
Merge	Bake LoRA into base	`model.merge_and_unload()`
Paged AdamW 8-bit	Memory-efficient optimizer	`optim="paged_adamw_8bit"`

← Page Sebelumnya← Previous Page

Page 7 — Spaces, Gradio & Demo Apps

📘

Coming Next: Page 9 — RLHF & Alignment

Dari fine-tuned model menjadi assistant yang helpful & safe! Page 9 membahas: Reinforcement Learning from Human Feedback (RLHF), DPO (Direct Preference Optimization) — lebih mudah dari PPO, reward modeling, TRL library untuk RLHF training, dan alignment techniques yang membuat ChatGPT dari GPT-3.

📘

Coming Next: Page 9 — RLHF & Alignment

From fine-tuned model to helpful & safe assistant! Page 9 covers: Reinforcement Learning from Human Feedback (RLHF), DPO (Direct Preference Optimization) — easier than PPO, reward modeling, TRL library for RLHF training, and alignment techniques that turned GPT-3 into ChatGPT.