šŸ“ Artikel ini ditulis dalam Bahasa Indonesia & English
šŸ“ This article is available in English & Bahasa Indonesia

⚔ Belajar Hugging Face — Page 8Learn Hugging Face — Page 8

LoRA, QLoRA &
Efficient Fine-Tuning

LoRA, QLoRA &
Efficient Fine-Tuning

Fine-tune LLaMA 7B di Colab gratis — yang sebelumnya mustahil! Page 8 membahas super detail: kenapa full fine-tuning model 7B+ mustahil di consumer GPU (hitungan VRAM), PEFT (Parameter-Efficient Fine-Tuning) — train hanya 0.1-1% parameters, LoRA (Low-Rank Adaptation) — matematika di baliknya dan kenapa bekerja, implementasi LoRA dengan PEFT library, QLoRA — 4-bit quantization + LoRA (game-changer!), bitsandbytes library untuk quantization, proyek lengkap: fine-tune Mistral-7B pada custom instruction dataset di Colab T4 gratis, merge LoRA adapter ke base model untuk deployment, perbandingan: full fine-tune vs LoRA vs QLoRA vs prefix tuning, dan tips production LoRA.

Fine-tune LLaMA 7B on free Colab — previously impossible! Page 8 covers in super detail: why full fine-tuning 7B+ models is impossible on consumer GPUs (VRAM calculations), PEFT (Parameter-Efficient Fine-Tuning) — train only 0.1-1% of parameters, LoRA (Low-Rank Adaptation) — the math behind it and why it works, LoRA implementation with PEFT library, QLoRA — 4-bit quantization + LoRA (game-changer!), bitsandbytes library for quantization, complete project: fine-tune Mistral-7B on custom instruction dataset on free Colab T4, merging LoRA adapters into base model for deployment, comparison: full fine-tune vs LoRA vs QLoRA vs prefix tuning, and production LoRA tips.

šŸ“… MaretMarch 2026ā± 42 menit baca42 min read
šŸ· LoRAQLoRAPEFT4-bitbitsandbytesLLaMAMistralEfficient
šŸ“š Seri Belajar Hugging Face:Learn Hugging Face Series:

šŸ“‘ Daftar Isi — Page 8

šŸ“‘ Table of Contents — Page 8

  1. Masalah VRAM — Kenapa full fine-tuning 7B mustahil di Colab
  2. PEFT — Train 0.1% parameters, dapat 95%+ performa full
  3. LoRA Explained — Matematika & intuisi di baliknya
  4. LoRA dengan PEFT Library — Implementasi praktis
  5. Quantization — 4-bit & 8-bit model loading
  6. QLoRA — 4-bit base + LoRA adapters = game-changer
  7. Di Mana Jalankan? — VRAM QLoRA vs LoRA vs Full
  8. Proyek: Fine-Tune LLM dengan QLoRA — Complete pipeline
  9. Merge LoRA ke Base Model — Untuk deployment
  10. Perbandingan Metode — Full vs LoRA vs QLoRA vs Prefix
  11. Ringkasan & Preview Page 9
  1. The VRAM Problem — Why full fine-tuning 7B is impossible on Colab
  2. PEFT — Train 0.1% parameters, get 95%+ of full performance
  3. LoRA Explained — Math & intuition behind it
  4. LoRA with PEFT Library — Practical implementation
  5. Quantization — 4-bit & 8-bit model loading
  6. QLoRA — 4-bit base + LoRA adapters = game-changer
  7. Where to Run? — VRAM for QLoRA vs LoRA vs Full
  8. Project: Fine-Tune LLM with QLoRA — Complete pipeline
  9. Merge LoRA into Base Model — For deployment
  10. Method Comparison — Full vs LoRA vs QLoRA vs Prefix
  11. Summary & Page 9 Preview
šŸ’„

1. Masalah VRAM — Kenapa Full Fine-Tuning 7B Mustahil di Colab

1. The VRAM Problem — Why Full Fine-Tuning 7B Is Impossible on Colab

Hitungan matematika: model 7B butuh ~120GB VRAM. Colab T4 = 16GB. 7.5Ɨ kekurangan!
The math: a 7B model needs ~120GB VRAM. Colab T4 = 16GB. 7.5Ɨ shortfall!
VRAM Breakdown — Kenapa 7B Model Butuh ~120GB untuk Full Fine-Tuning Model: Mistral-7B (7.24 MILIAR parameters) 1. Model Weights 7.24B params Ɨ 2 bytes (FP16) = ~14.5 GB 2. Gradients (sama ukuran dengan weights) 7.24B Ɨ 2 bytes = ~14.5 GB 3. Optimizer States (AdamW: 2 states per parameter!) 7.24B Ɨ 2 Ɨ 4 bytes (FP32) = ~58 GB (momentum + variance, FP32 untuk stabilitas numerik) 4. Activations (depends on batch size & seq length) ~20-40 GB (batch=1, seq=512) ──────────────────────────────────────── TOTAL: ~107-127 GB VRAM untuk full fine-tuning! GPU yang tersedia: • Colab T4: 16 GB → āŒ JAUH tidak cukup (butuh 8Ɨ lipat!) • RTX 4090: 24 GB → āŒ Tetap tidak cukup • A100 40GB: 40 GB → āŒ Masih kurang • A100 80GB: 80 GB → āš ļø Ketat (batch=1, seq pendek) • 2Ɨ A100 80GB: 160 GB → āœ… Baru cukup nyaman Solusi: LoRA + Quantization → fine-tune 7B di Colab T4 16GB! šŸŽ‰ Dengan QLoRA: • Model: 7.24B Ɨ 0.5 bytes (4-bit) = ~3.6 GB (bukan 14.5!) • LoRA adapters: ~10M params Ɨ 2 bytes = ~0.02 GB (bukan 14.5!) • Gradients (LoRA only): ~10M Ɨ 2 = ~0.02 GB • Optimizer (LoRA only): ~10M Ɨ 8 = ~0.08 GB • Activations: ~8 GB (with gradient checkpointing) ──────────────────────────────────────── QLoRA TOTAL: ~12 GB → Colab T4 (16GB) āœ… CUKUP!

šŸŽ“ Kenapa Optimizer States Begitu Besar?
AdamW (optimizer default) menyimpan 2 state per parameter: momentum (running mean of gradients) dan variance (running mean of squared gradients). Keduanya disimpan dalam FP32 (bukan FP16) untuk stabilitas. Jadi optimizer = 2 Ɨ params Ɨ 4 bytes = 8Ɨ ukuran model FP16!
Inilah "silent killer" VRAM. Model 14.5GB FP16 butuh 58GB untuk optimizer saja.
LoRA solusinya: optimizer hanya untuk LoRA parameters (~10M, bukan 7B) → optimizer cuma ~80MB.

šŸŽ“ Why Are Optimizer States So Large?
AdamW (default optimizer) stores 2 states per parameter: momentum (running mean of gradients) and variance (running mean of squared gradients). Both stored in FP32 (not FP16) for numerical stability. So optimizer = 2 Ɨ params Ɨ 4 bytes = 8Ɨ the FP16 model size!
This is the "silent VRAM killer". A 14.5GB FP16 model needs 58GB just for optimizer.
LoRA's solution: optimizer only for LoRA parameters (~10M, not 7B) → optimizer is just ~80MB.

šŸŽÆ

2. PEFT — Train 0.1% Parameters, Dapat 95%+ Performa

2. PEFT — Train 0.1% Parameters, Get 95%+ Performance

Parameter-Efficient Fine-Tuning: freeze base model, train hanya adapter kecil
Parameter-Efficient Fine-Tuning: freeze base model, train only small adapters

PEFT = keluarga teknik yang memungkinkan fine-tuning model besar dengan hanya melatih sebagian kecil parameters. Base model di-freeze (tidak diubah), lalu adapter kecil ditambahkan dan dilatih. Hasilnya: 95-99% performa full fine-tuning, tapi dengan 10-100Ɨ lebih sedikit VRAM dan waktu training.

PEFT = a family of techniques that enable fine-tuning large models by training only a small fraction of parameters. The base model is frozen (unchanged), then small adapters are added and trained. Result: 95-99% of full fine-tuning performance, but with 10-100Ɨ less VRAM and training time.

Metode PEFTParams TrainedCara KerjaPerforma vs Full
LoRA0.1-1%Low-rank matrices di attention layers~97% ⭐
QLoRA0.1-1%LoRA + 4-bit quantized base~96% ⭐⭐
Prefix Tuning0.1%Learnable prefix tokens~90%
Prompt Tuning0.01%Learnable soft prompts~85%
IA30.01%Learned rescaling vectors~93%
Adapters1-5%MLP layers di setiap block~95%
Full Fine-Tuning100%Update semua weights100% (baseline)
PEFT MethodParams TrainedHow It WorksPerf vs Full
LoRA0.1-1%Low-rank matrices in attention layers~97% ⭐
QLoRA0.1-1%LoRA + 4-bit quantized base~96% ⭐⭐
Prefix Tuning0.1%Learnable prefix tokens~90%
Prompt Tuning0.01%Learnable soft prompts~85%
IA30.01%Learned rescaling vectors~93%
Adapters1-5%MLP layers in each block~95%
Full Fine-Tuning100%Update all weights100% (baseline)
🧮

3. LoRA Explained — Matematika & Intuisi

3. LoRA Explained — Math & Intuition

Low-Rank Adaptation: dekomposisi weight update menjadi 2 matriks kecil
Low-Rank Adaptation: decompose weight updates into 2 small matrices
LoRA — Ide Inti (dari Paper "LoRA" oleh Hu et al., 2021) Full Fine-Tuning: W_new = W_original + Ī”W W_original: (4096 Ɨ 4096) = 16.7 JUTA parameters ← FROZEN Ī”W: (4096 Ɨ 4096) = 16.7 JUTA parameters ← harus di-train! Total trainable: 16.7M per layer LoRA Fine-Tuning: W_new = W_original + B Ɨ A W_original: (4096 Ɨ 4096) ← FROZEN (tidak diubah!) A: (4096 Ɨ r) dimana r = rank (biasanya 8-64) B: (r Ɨ 4096) B Ɨ A: (4096 Ɨ 4096) ← same shape as Ī”W, tapi built from 2 KECIL matrices! Jika r = 16: A: 4096 Ɨ 16 = 65,536 params B: 16 Ɨ 4096 = 65,536 params Total trainable per layer: 131,072 (bukan 16.7M!) Reduction: 16.7M → 131K = 127Ɨ lebih sedikit parameters! Kenapa ini bekerja? Penelitian menunjukkan bahwa weight updates (Ī”W) selama fine-tuning memiliki "intrinsic rank" yang RENDAH — artinya Ī”W bisa diaproksimasi dengan baik oleh perkalian 2 matriks kecil (B Ɨ A). Analoginya: fine-tuning tidak mengubah "segalanya" di model — hanya membuat adjustment kecil yang bisa direpresentasikan secara compact.
60_lora_math.py — LoRA Mathematics Visualizedpython
import torch

# ===========================
# LoRA: W_new = W_frozen + B @ A
# ===========================
d = 4096  # hidden dimension (e.g., LLaMA 7B)
r = 16    # LoRA rank (hyperparameter)

# Original weight (FROZEN — no gradient!)
W_frozen = torch.randn(d, d, requires_grad=False)

# LoRA adapters (TRAINABLE — small!)
A = torch.randn(d, r, requires_grad=True)   # "down projection"
B = torch.zeros(r, d, requires_grad=True)   # "up projection" (init zeros!)
# B initialized to ZEROS → at start, B@A = 0 → model unchanged!
# This is important: training STARTS from original model behavior.

# Forward pass
x = torch.randn(1, d)  # input
output = x @ W_frozen + x @ A @ B  # original + LoRA delta
#         ↑ frozen       ↑ trainable (tiny!)

# Parameter count comparison
full_params = d * d                  # 16,777,216
lora_params = d * r + r * d          # 131,072
reduction = full_params / lora_params
print(f"Full:  {full_params:>12,} params per weight matrix")
print(f"LoRA:  {lora_params:>12,} params (r={r})")
print(f"Reduction: {reduction:.0f}Ɨ fewer parameters!")
# Full:    16,777,216 params
# LoRA:       131,072 params (r=16)
# Reduction: 128Ɨ fewer parameters!

# For entire Mistral-7B (applied to Q, K, V, O projections):
# Full trainable: ~7.24B parameters
# LoRA trainable: ~10M parameters (0.14% of total!)
# Saved model size: full=14.5GB vs LoRA adapter=~40MB

šŸŽ“ LoRA Hyperparameters yang Penting:
r (rank): Dimensi LoRA adapters. Default=8-16. Semakin besar = lebih ekspresif tapi lebih banyak params. r=16 sudah cukup untuk kebanyakan tugas. r=64 untuk tugas kompleks.
alpha: Scaling factor. Default = r atau 2Ɨr. LoRA output di-scale oleh alpha/r. Alpha=16 dengan r=16 → scale=1.0. Alpha=32 dengan r=16 → scale=2.0 (lebih agresif).
target_modules: Layer mana yang diberi LoRA. Default: q_proj, v_proj (attention). Bisa tambah: k_proj, o_proj, gate_proj, up_proj, down_proj untuk performa lebih.
dropout: LoRA dropout, default=0.05-0.1. Mencegah overfitting pada dataset kecil.

šŸŽ“ Important LoRA Hyperparameters:
r (rank): LoRA adapter dimension. Default=8-16. Larger = more expressive but more params. r=16 is enough for most tasks. r=64 for complex tasks.
alpha: Scaling factor. Default = r or 2Ɨr. LoRA output is scaled by alpha/r. Alpha=16 with r=16 → scale=1.0. Alpha=32 with r=16 → scale=2.0 (more aggressive).
target_modules: Which layers get LoRA. Default: q_proj, v_proj (attention). Can add: k_proj, o_proj, gate_proj, up_proj, down_proj for better performance.
dropout: LoRA dropout, default=0.05-0.1. Prevents overfitting on small datasets.

šŸ”§

4. LoRA dengan PEFT Library — Implementasi Praktis

4. LoRA with PEFT Library — Practical Implementation

pip install peft → tambahkan LoRA ke model apapun dalam 5 baris
pip install peft → add LoRA to any model in 5 lines
61_lora_peft.py — LoRA with PEFT Librarypython
# pip install peft
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer

# ===========================
# 1. Load base model
# ===========================
model_name = "mistralai/Mistral-7B-v0.3"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"Base model params: {model.num_parameters():,}")
# Base model params: 7,241,732,096 (7.24B!)

# ===========================
# 2. Configure LoRA
# ===========================
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,  # or SEQ_CLS, TOKEN_CLS, SEQ_2_SEQ_LM
    r=16,                         # rank (8-64, sweet spot: 16)
    lora_alpha=32,                # scaling (usually 2Ɨr)
    lora_dropout=0.05,            # dropout
    target_modules=[              # which layers get LoRA
        "q_proj", "k_proj",        # attention query & key
        "v_proj", "o_proj",        # attention value & output
        # "gate_proj", "up_proj", "down_proj",  # MLP layers (optional, more params)
    ],
    bias="none",                  # don't train biases
)

# ===========================
# 3. Apply LoRA to model
# ===========================
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 7,255,363,584 || trainable%: 0.188%
# → Only 13.6M trainable out of 7.24B! (0.188%)
# → Saved adapter size: ~55MB (not 14.5GB!)

# ===========================
# 4. Train with standard Trainer (IDENTICAL to Page 2!)
# ===========================
# trainer = Trainer(model=model, args=args, ...)
# trainer.train()  # ← same API! PEFT is transparent to Trainer.

# ===========================
# 5. Save LoRA adapter (tiny!)
# ===========================
model.save_pretrained("./lora-adapter")
# Saves ONLY the LoRA weights (~55MB)
# NOT the full 14.5GB base model!

# ===========================
# 6. Load LoRA adapter later
# ===========================
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained(model_name)
model_with_lora = PeftModel.from_pretrained(base, "./lora-adapter")
# → base model + LoRA adapter = fine-tuned model!
šŸ“¦

5. Quantization — 4-bit & 8-bit Model Loading

5. Quantization — 4-bit & 8-bit Model Loading

Kompres model 14.5GB → 3.6GB (4-bit) tanpa kehilangan banyak kualitas
Compress a 14.5GB model → 3.6GB (4-bit) with minimal quality loss
62_quantization.py — BitsAndBytes Quantizationpython
# pip install bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# ===========================
# 1. 4-bit quantization config (QLoRA standard!)
# ===========================
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # load model in 4-bit!
    bnb_4bit_quant_type="nf4",            # NormalFloat4 (best for LLMs)
    bnb_4bit_compute_dtype=torch.float16, # compute in FP16 (speed)
    bnb_4bit_use_double_quant=True,       # quantize the quantization constants!
    # → extra 0.4GB savings for 7B model
)

# ===========================
# 2. Load 7B model in 4-bit (~3.6GB instead of 14.5GB!)
# ===========================
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    quantization_config=bnb_config,
    device_map="auto",  # auto-place layers across GPU/CPU
)

print(f"Model loaded! Memory: {model.get_memory_footprint()/1e9:.1f} GB")
# Model loaded! Memory: 3.6 GB (bukan 14.5GB!)
# → Muat di Colab T4 (16GB)! Sisa ~12GB untuk training.

# ===========================
# 3. Comparison: precision vs VRAM
# ===========================
# FP32 (32-bit): 7B Ɨ 4 bytes = ~29 GB  ← TIDAK muat di A100 40GB
# FP16 (16-bit): 7B Ɨ 2 bytes = ~14.5 GB ← Muat di A100, tidak T4
# INT8 (8-bit):  7B Ɨ 1 byte  = ~7.2 GB  ← Muat di T4, ketat
# NF4 (4-bit):   7B Ɨ 0.5 byte = ~3.6 GB ← Muat NYAMAN di T4! ⭐
PrecisionBytes/Param7B Model SizeQuality LossColab T4?
FP324~29 GB0% (baseline)āŒ
FP16/BF162~14.5 GB~0%āš ļø Inference only
INT81~7.2 GB~0.5%āš ļø Ketat + LoRA
NF4 (4-bit)0.5~3.6 GB~1-2%āœ… Nyaman + QLoRA ⭐
PrecisionBytes/Param7B Model SizeQuality LossColab T4?
FP324~29 GB0% (baseline)āŒ
FP16/BF162~14.5 GB~0%āš ļø Inference only
INT81~7.2 GB~0.5%āš ļø Tight + LoRA
NF4 (4-bit)0.5~3.6 GB~1-2%āœ… Comfortable + QLoRA ⭐
⚔

6. QLoRA — 4-bit Base + LoRA = Game-Changer

6. QLoRA — 4-bit Base + LoRA = Game-Changer

Gabungkan quantization (Section 5) + LoRA (Section 3-4) = fine-tune 7B di Colab!
Combine quantization (Section 5) + LoRA (Section 3-4) = fine-tune 7B on Colab!
QLoRA = Quantization + LoRA — Best of Both Worlds ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā” │ Base Model: 4-bit quantized (NF4) │ │ 7.24B params Ɨ 0.5 bytes = 3.6 GB (FROZEN!) │ │ │ │ LoRA Adapters: FP16 (trainable) │ │ ~13M params Ɨ 2 bytes = ~0.05 GB (TRAINED!) │ │ Applied to: q_proj, k_proj, v_proj, o_proj │ │ │ │ Compute: FP16 │ │ 4-bit weights → dequantize to FP16 → forward pass │ │ LoRA: FP16 forward + backward (gradients) │ │ Optimizer: only for LoRA params → tiny! │ ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜ VRAM Breakdown (Mistral-7B, QLoRA, batch=1, seq=512): • 4-bit model: ~3.6 GB • LoRA params: ~0.05 GB • LoRA optimizer: ~0.08 GB • Gradients: ~0.05 GB • Activations: ~6-8 GB (with gradient checkpointing) ────────────────────────────── Total: ~10-12 GB → Colab T4 (16GB) āœ…!
šŸ’»

7. Di Mana Jalankan? — VRAM Comparison

7. Where to Run? — VRAM Comparison

ModelFull Fine-TuneLoRA (FP16)QLoRA (4-bit)Colab T4 (16GB)?
LLaMA-3.2 1B~16 GB~6 GB~4 GBāœ… QLoRA nyaman
Gemma-2 2B~24 GB~8 GB~5 GBāœ… QLoRA nyaman
Mistral 7B~120 GB~20 GB~12 GBāœ… QLoRA ⭐
LLaMA-3.1 8B~130 GB~22 GB~13 GBāš ļø QLoRA ketat
LLaMA-3.1 70B~1 TB+~160 GB~40 GBāŒ Butuh A100 80GB
ModelFull Fine-TuneLoRA (FP16)QLoRA (4-bit)Colab T4 (16GB)?
LLaMA-3.2 1B~16 GB~6 GB~4 GBāœ… QLoRA comfortable
Gemma-2 2B~24 GB~8 GB~5 GBāœ… QLoRA comfortable
Mistral 7B~120 GB~20 GB~12 GBāœ… QLoRA ⭐
LLaMA-3.1 8B~130 GB~22 GB~13 GBāš ļø QLoRA tight
LLaMA-3.1 70B~1 TB+~160 GB~40 GBāŒ Need A100 80GB

šŸŽ‰ Page 8 ini menggunakan Mistral-7B + QLoRA di Colab T4 gratis! Full fine-tuning Mistral butuh ~120GB (8Ɨ A100). Dengan QLoRA: hanya ~12GB. Colab T4 = 16GB → CUKUP! Dari "mustahil" menjadi "gratis".

šŸŽ‰ This Page 8 uses Mistral-7B + QLoRA on free Colab T4! Full fine-tuning Mistral needs ~120GB (8Ɨ A100). With QLoRA: only ~12GB. Colab T4 = 16GB → ENOUGH! From "impossible" to "free".

šŸ”„

8. Proyek: Fine-Tune LLM dengan QLoRA — Complete Pipeline

8. Project: Fine-Tune LLM with QLoRA — Complete Pipeline

Fine-tune Mistral-7B pada instruction dataset — di Colab T4 gratis!
Fine-tune Mistral-7B on instruction dataset — on free Colab T4!
63_qlora_finetune.py — Complete QLoRA Fine-Tuning šŸ”„šŸ”„šŸ”„python
#!/usr/bin/env python3
"""
šŸ”„ Fine-Tune Mistral-7B with QLoRA on Google Colab T4 (free!)
From "impossible" (120GB VRAM) to "free" (12GB VRAM) with QLoRA.
"""

# ═══════════════════════════════════════
# STEP 0: INSTALL (run in Colab!)
# ═══════════════════════════════════════
# !pip install -q transformers datasets accelerate peft bitsandbytes trl

import torch
from transformers import (
    AutoModelForCausalLM, AutoTokenizer,
    BitsAndBytesConfig, TrainingArguments
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer  # Supervised Fine-Tuning Trainer

# ═══════════════════════════════════════
# STEP 1: LOAD MODEL IN 4-BIT
# ═══════════════════════════════════════
model_name = "mistralai/Mistral-7B-v0.3"
# Alternatives: "meta-llama/Llama-3.2-1B" (smaller, easier)
#               "Qwen/Qwen2.5-7B" (multilingual)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name, quantization_config=bnb_config,
    device_map="auto", trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print(f"Model VRAM: {model.get_memory_footprint()/1e9:.1f} GB")
# Model VRAM: 3.6 GB ← 4-bit! (was 14.5 GB in FP16)

# ═══════════════════════════════════════
# STEP 2: PREPARE MODEL FOR TRAINING
# ═══════════════════════════════════════
model = prepare_model_for_kbit_training(model)
# Enables gradient checkpointing + casts to correct dtypes

# ═══════════════════════════════════════
# STEP 3: ADD LORA ADAPTERS
# ═══════════════════════════════════════
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 7,255,363,584 || trainable%: 0.19%

# ═══════════════════════════════════════
# STEP 4: LOAD INSTRUCTION DATASET
# ═══════════════════════════════════════
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
# 15k instruction-response pairs (open-source!)

def format_instruction(example):
    if example.get("context") and example["context"].strip():
        text = f"""### Instruction:\n{example['instruction']}\n\n### Context:\n{example['context']}\n\n### Response:\n{example['response']}{tokenizer.eos_token}"""
    else:
        text = f"""### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}{tokenizer.eos_token}"""
    return {"text": text}

dataset = dataset.map(format_instruction)
print(f"Training examples: {len(dataset)}")
print(f"Sample:\n{dataset[0]['text'][:300]}...")

# ═══════════════════════════════════════
# STEP 5: TRAINING ARGUMENTS
# ═══════════════════════════════════════
args = TrainingArguments(
    output_dir="./qlora-mistral",
    num_train_epochs=1,                  # 1 epoch for 15k examples → enough!
    per_device_train_batch_size=2,       # small batch (4-bit model is large)
    gradient_accumulation_steps=8,       # effective batch = 2 Ɨ 8 = 16
    learning_rate=2e-4,                  # higher LR for LoRA (2e-4, not 2e-5!)
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    fp16=True,
    logging_steps=25,
    save_strategy="steps",
    save_steps=200,
    save_total_limit=2,
    optim="paged_adamw_8bit",           # 8-bit optimizer → less VRAM!
    gradient_checkpointing=True,        # saves ~30% VRAM (slower training)
    max_grad_norm=0.3,                  # gradient clipping
    report_to="none",
)

# ═══════════════════════════════════════
# STEP 6: TRAIN WITH SFTTrainer (from TRL library)
# ═══════════════════════════════════════
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    dataset_text_field="text",          # column with formatted text
    max_seq_length=512,                 # max token length
    packing=True,                       # pack short examples into one sequence!
    # packing = True → 2-3Ɨ faster training (less padding waste)
)

print("šŸ‹ļø Training QLoRA...")
print(f"VRAM used: {torch.cuda.memory_allocated()/1e9:.1f} GB")
trainer.train()

# ═══════════════════════════════════════
# STEP 7: SAVE & TEST
# ═══════════════════════════════════════
trainer.save_model("./qlora-mistral-final")

# Test!
prompt = "### Instruction:\nWhat is machine learning?\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=200,
        temperature=0.7, top_p=0.9, repetition_penalty=1.2)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print("\\nšŸ† QLoRA fine-tuning complete!")

šŸ”„ Anda Baru Saja Fine-Tune Model 7 MILIAR Parameter di Colab Gratis!
Bandingkan yang sebelumnya mustahil:
• Full fine-tune Mistral-7B: butuh ~120GB VRAM (8Ɨ A100 = ~$256/jam) āŒ
• QLoRA Mistral-7B: butuh ~12GB VRAM (1Ɨ Colab T4 = $0) āœ…
• Performa: ~96% dari full fine-tuning!
• Training time: ~30-60 menit untuk 15k examples
• Saved adapter: ~55MB (bukan 14.5GB!)
Ini adalah revolusi yang membuat fine-tuning LLM accessible untuk semua orang.

šŸ”„ You Just Fine-Tuned a 7 BILLION Parameter Model on Free Colab!
Compare what was previously impossible:
• Full fine-tune Mistral-7B: needs ~120GB VRAM (8Ɨ A100 = ~$256/hr) āŒ
• QLoRA Mistral-7B: needs ~12GB VRAM (1Ɨ Colab T4 = $0) āœ…
• Performance: ~96% of full fine-tuning!
• Training time: ~30-60 minutes for 15k examples
• Saved adapter: ~55MB (not 14.5GB!)
This is the revolution that makes LLM fine-tuning accessible to everyone.

šŸ”—

9. Merge LoRA ke Base Model — Untuk Deployment

9. Merge LoRA into Base Model — For Deployment

Gabungkan adapter ke base model → satu model utuh tanpa overhead LoRA
Merge adapter into base model → one complete model without LoRA overhead
64_merge_lora.py — Merge & Deploypython
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# ===========================
# 1. Load base model (FP16, NOT 4-bit!) + LoRA adapter
# ===========================
base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    torch_dtype=torch.float16,
    device_map="auto",
)
# Load LoRA adapter on top
model = PeftModel.from_pretrained(base_model, "./qlora-mistral-final")

# ===========================
# 2. Merge LoRA into base weights
# ===========================
model = model.merge_and_unload()
# → W_new = W_base + B @ A (permanently merged!)
# → No more LoRA overhead at inference
# → Same speed as original model

# ===========================
# 3. Save merged model
# ===========================
model.save_pretrained("./mistral-7b-finetuned-merged")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")
tokenizer.save_pretrained("./mistral-7b-finetuned-merged")
# → Full 14.5GB model with LoRA baked in
# → Deploy like any normal model (TF Serving, vLLM, TGI, etc.)

# ===========================
# 4. Push to Hub!
# ===========================
# model.push_to_hub("username/mistral-7b-instruction-tuned")
# tokenizer.push_to_hub("username/mistral-7b-instruction-tuned")

# ===========================
# When to merge vs keep separate?
# ===========================
# MERGE: for production deployment (no PEFT dependency needed)
# KEEP SEPARATE: when you have multiple LoRA adapters for different tasks
#   e.g., base_model + lora_sentiment, base_model + lora_translation
#   → swap adapters at runtime without loading model twice!
āš–ļø

10. Perbandingan Metode — Full vs LoRA vs QLoRA

10. Method Comparison — Full vs LoRA vs QLoRA

AspekFull Fine-TuneLoRA (FP16)QLoRA (4-bit)
Params trained100%0.1-1%0.1-1%
VRAM (7B)~120 GB~20 GB~12 GB
Performa100% (baseline)~97%~96%
Kecepatan trainLambat~2Ɨ faster~3Ɨ faster
Saved size14.5 GB~55 MB adapter~55 MB adapter
GPU minimum2-4Ɨ A100 80GB1Ɨ A100 40GB1Ɨ T4 16GB
Cost (Colab)Tidak mungkinColab ProColab FREE!
Multi-task1 model per taskSwap adapters!Swap adapters!
Kapan pakaiUnlimited budgetGood GPU tersediaDefault choice ⭐
AspectFull Fine-TuneLoRA (FP16)QLoRA (4-bit)
Params trained100%0.1-1%0.1-1%
VRAM (7B)~120 GB~20 GB~12 GB
Performance100% (baseline)~97%~96%
Training speedSlow~2Ɨ faster~3Ɨ faster
Saved size14.5 GB~55 MB adapter~55 MB adapter
Minimum GPU2-4Ɨ A100 80GB1Ɨ A100 40GB1Ɨ T4 16GB
Cost (Colab)ImpossibleColab ProColab FREE!
Multi-task1 model per taskSwap adapters!Swap adapters!
When to useUnlimited budgetGood GPU availableDefault choice ⭐
šŸ“

11. Ringkasan Page 8

11. Page 8 Summary

KonsepApa ItuKode Kunci
PEFTTrain 0.1% paramspip install peft
LoRAW + BƗA (low-rank adapters)LoraConfig(r=16, alpha=32)
QLoRA4-bit base + LoRABitsAndBytesConfig(load_in_4bit=True)
4-bit Quant14.5GB → 3.6GBbnb_4bit_quant_type="nf4"
get_peft_modelAdd LoRA to any modelget_peft_model(model, config)
SFTTrainerInstruction fine-tuningSFTTrainer(model, args, dataset)
MergeBake LoRA into basemodel.merge_and_unload()
Paged AdamW 8-bitMemory-efficient optimizeroptim="paged_adamw_8bit"
ConceptWhat It IsKey Code
PEFTTrain 0.1% paramspip install peft
LoRAW + BƗA (low-rank adapters)LoraConfig(r=16, alpha=32)
QLoRA4-bit base + LoRABitsAndBytesConfig(load_in_4bit=True)
4-bit Quant14.5GB → 3.6GBbnb_4bit_quant_type="nf4"
get_peft_modelAdd LoRA to any modelget_peft_model(model, config)
SFTTrainerInstruction fine-tuningSFTTrainer(model, args, dataset)
MergeBake LoRA into basemodel.merge_and_unload()
Paged AdamW 8-bitMemory-efficient optimizeroptim="paged_adamw_8bit"
← Page Sebelumnya← Previous Page

Page 7 — Spaces, Gradio & Demo Apps

šŸ“˜

Coming Next: Page 9 — RLHF & Alignment

Dari fine-tuned model menjadi assistant yang helpful & safe! Page 9 membahas: Reinforcement Learning from Human Feedback (RLHF), DPO (Direct Preference Optimization) — lebih mudah dari PPO, reward modeling, TRL library untuk RLHF training, dan alignment techniques yang membuat ChatGPT dari GPT-3.

šŸ“˜

Coming Next: Page 9 — RLHF & Alignment

From fine-tuned model to helpful & safe assistant! Page 9 covers: Reinforcement Learning from Human Feedback (RLHF), DPO (Direct Preference Optimization) — easier than PPO, reward modeling, TRL library for RLHF training, and alignment techniques that turned GPT-3 into ChatGPT.