š Daftar Isi ā Page 8
š Table of Contents ā Page 8
- Masalah VRAM ā Kenapa full fine-tuning 7B mustahil di Colab
- PEFT ā Train 0.1% parameters, dapat 95%+ performa full
- LoRA Explained ā Matematika & intuisi di baliknya
- LoRA dengan PEFT Library ā Implementasi praktis
- Quantization ā 4-bit & 8-bit model loading
- QLoRA ā 4-bit base + LoRA adapters = game-changer
- Di Mana Jalankan? ā VRAM QLoRA vs LoRA vs Full
- Proyek: Fine-Tune LLM dengan QLoRA ā Complete pipeline
- Merge LoRA ke Base Model ā Untuk deployment
- Perbandingan Metode ā Full vs LoRA vs QLoRA vs Prefix
- Ringkasan & Preview Page 9
- The VRAM Problem ā Why full fine-tuning 7B is impossible on Colab
- PEFT ā Train 0.1% parameters, get 95%+ of full performance
- LoRA Explained ā Math & intuition behind it
- LoRA with PEFT Library ā Practical implementation
- Quantization ā 4-bit & 8-bit model loading
- QLoRA ā 4-bit base + LoRA adapters = game-changer
- Where to Run? ā VRAM for QLoRA vs LoRA vs Full
- Project: Fine-Tune LLM with QLoRA ā Complete pipeline
- Merge LoRA into Base Model ā For deployment
- Method Comparison ā Full vs LoRA vs QLoRA vs Prefix
- Summary & Page 9 Preview
1. Masalah VRAM ā Kenapa Full Fine-Tuning 7B Mustahil di Colab
1. The VRAM Problem ā Why Full Fine-Tuning 7B Is Impossible on Colab
š Kenapa Optimizer States Begitu Besar?
AdamW (optimizer default) menyimpan 2 state per parameter: momentum (running mean of gradients) dan variance (running mean of squared gradients). Keduanya disimpan dalam FP32 (bukan FP16) untuk stabilitas. Jadi optimizer = 2 Ć params Ć 4 bytes = 8Ć ukuran model FP16!
Inilah "silent killer" VRAM. Model 14.5GB FP16 butuh 58GB untuk optimizer saja.
LoRA solusinya: optimizer hanya untuk LoRA parameters (~10M, bukan 7B) ā optimizer cuma ~80MB.
š Why Are Optimizer States So Large?
AdamW (default optimizer) stores 2 states per parameter: momentum (running mean of gradients) and variance (running mean of squared gradients). Both stored in FP32 (not FP16) for numerical stability. So optimizer = 2 Ć params Ć 4 bytes = 8Ć the FP16 model size!
This is the "silent VRAM killer". A 14.5GB FP16 model needs 58GB just for optimizer.
LoRA's solution: optimizer only for LoRA parameters (~10M, not 7B) ā optimizer is just ~80MB.
2. PEFT ā Train 0.1% Parameters, Dapat 95%+ Performa
2. PEFT ā Train 0.1% Parameters, Get 95%+ Performance
PEFT = keluarga teknik yang memungkinkan fine-tuning model besar dengan hanya melatih sebagian kecil parameters. Base model di-freeze (tidak diubah), lalu adapter kecil ditambahkan dan dilatih. Hasilnya: 95-99% performa full fine-tuning, tapi dengan 10-100Ć lebih sedikit VRAM dan waktu training.
PEFT = a family of techniques that enable fine-tuning large models by training only a small fraction of parameters. The base model is frozen (unchanged), then small adapters are added and trained. Result: 95-99% of full fine-tuning performance, but with 10-100Ć less VRAM and training time.
| Metode PEFT | Params Trained | Cara Kerja | Performa vs Full |
|---|---|---|---|
| LoRA | 0.1-1% | Low-rank matrices di attention layers | ~97% ā |
| QLoRA | 0.1-1% | LoRA + 4-bit quantized base | ~96% āā |
| Prefix Tuning | 0.1% | Learnable prefix tokens | ~90% |
| Prompt Tuning | 0.01% | Learnable soft prompts | ~85% |
| IA3 | 0.01% | Learned rescaling vectors | ~93% |
| Adapters | 1-5% | MLP layers di setiap block | ~95% |
| Full Fine-Tuning | 100% | Update semua weights | 100% (baseline) |
| PEFT Method | Params Trained | How It Works | Perf vs Full |
|---|---|---|---|
| LoRA | 0.1-1% | Low-rank matrices in attention layers | ~97% ā |
| QLoRA | 0.1-1% | LoRA + 4-bit quantized base | ~96% āā |
| Prefix Tuning | 0.1% | Learnable prefix tokens | ~90% |
| Prompt Tuning | 0.01% | Learnable soft prompts | ~85% |
| IA3 | 0.01% | Learned rescaling vectors | ~93% |
| Adapters | 1-5% | MLP layers in each block | ~95% |
| Full Fine-Tuning | 100% | Update all weights | 100% (baseline) |
3. LoRA Explained ā Matematika & Intuisi
3. LoRA Explained ā Math & Intuition
import torch # =========================== # LoRA: W_new = W_frozen + B @ A # =========================== d = 4096 # hidden dimension (e.g., LLaMA 7B) r = 16 # LoRA rank (hyperparameter) # Original weight (FROZEN ā no gradient!) W_frozen = torch.randn(d, d, requires_grad=False) # LoRA adapters (TRAINABLE ā small!) A = torch.randn(d, r, requires_grad=True) # "down projection" B = torch.zeros(r, d, requires_grad=True) # "up projection" (init zeros!) # B initialized to ZEROS ā at start, B@A = 0 ā model unchanged! # This is important: training STARTS from original model behavior. # Forward pass x = torch.randn(1, d) # input output = x @ W_frozen + x @ A @ B # original + LoRA delta # ā frozen ā trainable (tiny!) # Parameter count comparison full_params = d * d # 16,777,216 lora_params = d * r + r * d # 131,072 reduction = full_params / lora_params print(f"Full: {full_params:>12,} params per weight matrix") print(f"LoRA: {lora_params:>12,} params (r={r})") print(f"Reduction: {reduction:.0f}Ć fewer parameters!") # Full: 16,777,216 params # LoRA: 131,072 params (r=16) # Reduction: 128Ć fewer parameters! # For entire Mistral-7B (applied to Q, K, V, O projections): # Full trainable: ~7.24B parameters # LoRA trainable: ~10M parameters (0.14% of total!) # Saved model size: full=14.5GB vs LoRA adapter=~40MB
š LoRA Hyperparameters yang Penting:
r (rank): Dimensi LoRA adapters. Default=8-16. Semakin besar = lebih ekspresif tapi lebih banyak params. r=16 sudah cukup untuk kebanyakan tugas. r=64 untuk tugas kompleks.
alpha: Scaling factor. Default = r atau 2Ćr. LoRA output di-scale oleh alpha/r. Alpha=16 dengan r=16 ā scale=1.0. Alpha=32 dengan r=16 ā scale=2.0 (lebih agresif).
target_modules: Layer mana yang diberi LoRA. Default: q_proj, v_proj (attention). Bisa tambah: k_proj, o_proj, gate_proj, up_proj, down_proj untuk performa lebih.
dropout: LoRA dropout, default=0.05-0.1. Mencegah overfitting pada dataset kecil.
š Important LoRA Hyperparameters:
r (rank): LoRA adapter dimension. Default=8-16. Larger = more expressive but more params. r=16 is enough for most tasks. r=64 for complex tasks.
alpha: Scaling factor. Default = r or 2Ćr. LoRA output is scaled by alpha/r. Alpha=16 with r=16 ā scale=1.0. Alpha=32 with r=16 ā scale=2.0 (more aggressive).
target_modules: Which layers get LoRA. Default: q_proj, v_proj (attention). Can add: k_proj, o_proj, gate_proj, up_proj, down_proj for better performance.
dropout: LoRA dropout, default=0.05-0.1. Prevents overfitting on small datasets.
4. LoRA dengan PEFT Library ā Implementasi Praktis
4. LoRA with PEFT Library ā Practical Implementation
# pip install peft from peft import LoraConfig, get_peft_model, TaskType from transformers import AutoModelForCausalLM, AutoTokenizer # =========================== # 1. Load base model # =========================== model_name = "mistralai/Mistral-7B-v0.3" model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto") tokenizer = AutoTokenizer.from_pretrained(model_name) print(f"Base model params: {model.num_parameters():,}") # Base model params: 7,241,732,096 (7.24B!) # =========================== # 2. Configure LoRA # =========================== lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, # or SEQ_CLS, TOKEN_CLS, SEQ_2_SEQ_LM r=16, # rank (8-64, sweet spot: 16) lora_alpha=32, # scaling (usually 2Ćr) lora_dropout=0.05, # dropout target_modules=[ # which layers get LoRA "q_proj", "k_proj", # attention query & key "v_proj", "o_proj", # attention value & output # "gate_proj", "up_proj", "down_proj", # MLP layers (optional, more params) ], bias="none", # don't train biases ) # =========================== # 3. Apply LoRA to model # =========================== model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 13,631,488 || all params: 7,255,363,584 || trainable%: 0.188% # ā Only 13.6M trainable out of 7.24B! (0.188%) # ā Saved adapter size: ~55MB (not 14.5GB!) # =========================== # 4. Train with standard Trainer (IDENTICAL to Page 2!) # =========================== # trainer = Trainer(model=model, args=args, ...) # trainer.train() # ā same API! PEFT is transparent to Trainer. # =========================== # 5. Save LoRA adapter (tiny!) # =========================== model.save_pretrained("./lora-adapter") # Saves ONLY the LoRA weights (~55MB) # NOT the full 14.5GB base model! # =========================== # 6. Load LoRA adapter later # =========================== from peft import PeftModel base = AutoModelForCausalLM.from_pretrained(model_name) model_with_lora = PeftModel.from_pretrained(base, "./lora-adapter") # ā base model + LoRA adapter = fine-tuned model!
5. Quantization ā 4-bit & 8-bit Model Loading
5. Quantization ā 4-bit & 8-bit Model Loading
# pip install bitsandbytes from transformers import AutoModelForCausalLM, BitsAndBytesConfig import torch # =========================== # 1. 4-bit quantization config (QLoRA standard!) # =========================== bnb_config = BitsAndBytesConfig( load_in_4bit=True, # load model in 4-bit! bnb_4bit_quant_type="nf4", # NormalFloat4 (best for LLMs) bnb_4bit_compute_dtype=torch.float16, # compute in FP16 (speed) bnb_4bit_use_double_quant=True, # quantize the quantization constants! # ā extra 0.4GB savings for 7B model ) # =========================== # 2. Load 7B model in 4-bit (~3.6GB instead of 14.5GB!) # =========================== model = AutoModelForCausalLM.from_pretrained( "mistralai/Mistral-7B-v0.3", quantization_config=bnb_config, device_map="auto", # auto-place layers across GPU/CPU ) print(f"Model loaded! Memory: {model.get_memory_footprint()/1e9:.1f} GB") # Model loaded! Memory: 3.6 GB (bukan 14.5GB!) # ā Muat di Colab T4 (16GB)! Sisa ~12GB untuk training. # =========================== # 3. Comparison: precision vs VRAM # =========================== # FP32 (32-bit): 7B Ć 4 bytes = ~29 GB ā TIDAK muat di A100 40GB # FP16 (16-bit): 7B Ć 2 bytes = ~14.5 GB ā Muat di A100, tidak T4 # INT8 (8-bit): 7B Ć 1 byte = ~7.2 GB ā Muat di T4, ketat # NF4 (4-bit): 7B Ć 0.5 byte = ~3.6 GB ā Muat NYAMAN di T4! ā
| Precision | Bytes/Param | 7B Model Size | Quality Loss | Colab T4? |
|---|---|---|---|---|
| FP32 | 4 | ~29 GB | 0% (baseline) | ā |
| FP16/BF16 | 2 | ~14.5 GB | ~0% | ā ļø Inference only |
| INT8 | 1 | ~7.2 GB | ~0.5% | ā ļø Ketat + LoRA |
| NF4 (4-bit) | 0.5 | ~3.6 GB | ~1-2% | ā Nyaman + QLoRA ā |
| Precision | Bytes/Param | 7B Model Size | Quality Loss | Colab T4? |
|---|---|---|---|---|
| FP32 | 4 | ~29 GB | 0% (baseline) | ā |
| FP16/BF16 | 2 | ~14.5 GB | ~0% | ā ļø Inference only |
| INT8 | 1 | ~7.2 GB | ~0.5% | ā ļø Tight + LoRA |
| NF4 (4-bit) | 0.5 | ~3.6 GB | ~1-2% | ā Comfortable + QLoRA ā |
6. QLoRA ā 4-bit Base + LoRA = Game-Changer
6. QLoRA ā 4-bit Base + LoRA = Game-Changer
7. Di Mana Jalankan? ā VRAM Comparison
7. Where to Run? ā VRAM Comparison
| Model | Full Fine-Tune | LoRA (FP16) | QLoRA (4-bit) | Colab T4 (16GB)? |
|---|---|---|---|---|
| LLaMA-3.2 1B | ~16 GB | ~6 GB | ~4 GB | ā QLoRA nyaman |
| Gemma-2 2B | ~24 GB | ~8 GB | ~5 GB | ā QLoRA nyaman |
| Mistral 7B | ~120 GB | ~20 GB | ~12 GB | ā QLoRA ā |
| LLaMA-3.1 8B | ~130 GB | ~22 GB | ~13 GB | ā ļø QLoRA ketat |
| LLaMA-3.1 70B | ~1 TB+ | ~160 GB | ~40 GB | ā Butuh A100 80GB |
| Model | Full Fine-Tune | LoRA (FP16) | QLoRA (4-bit) | Colab T4 (16GB)? |
|---|---|---|---|---|
| LLaMA-3.2 1B | ~16 GB | ~6 GB | ~4 GB | ā QLoRA comfortable |
| Gemma-2 2B | ~24 GB | ~8 GB | ~5 GB | ā QLoRA comfortable |
| Mistral 7B | ~120 GB | ~20 GB | ~12 GB | ā QLoRA ā |
| LLaMA-3.1 8B | ~130 GB | ~22 GB | ~13 GB | ā ļø QLoRA tight |
| LLaMA-3.1 70B | ~1 TB+ | ~160 GB | ~40 GB | ā Need A100 80GB |
š Page 8 ini menggunakan Mistral-7B + QLoRA di Colab T4 gratis! Full fine-tuning Mistral butuh ~120GB (8Ć A100). Dengan QLoRA: hanya ~12GB. Colab T4 = 16GB ā CUKUP! Dari "mustahil" menjadi "gratis".
š This Page 8 uses Mistral-7B + QLoRA on free Colab T4! Full fine-tuning Mistral needs ~120GB (8Ć A100). With QLoRA: only ~12GB. Colab T4 = 16GB ā ENOUGH! From "impossible" to "free".
8. Proyek: Fine-Tune LLM dengan QLoRA ā Complete Pipeline
8. Project: Fine-Tune LLM with QLoRA ā Complete Pipeline
#!/usr/bin/env python3 """ š„ Fine-Tune Mistral-7B with QLoRA on Google Colab T4 (free!) From "impossible" (120GB VRAM) to "free" (12GB VRAM) with QLoRA. """ # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā # STEP 0: INSTALL (run in Colab!) # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā # !pip install -q transformers datasets accelerate peft bitsandbytes trl import torch from transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments ) from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training from datasets import load_dataset from trl import SFTTrainer # Supervised Fine-Tuning Trainer # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā # STEP 1: LOAD MODEL IN 4-BIT # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā model_name = "mistralai/Mistral-7B-v0.3" # Alternatives: "meta-llama/Llama-3.2-1B" (smaller, easier) # "Qwen/Qwen2.5-7B" (multilingual) bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, device_map="auto", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right" print(f"Model VRAM: {model.get_memory_footprint()/1e9:.1f} GB") # Model VRAM: 3.6 GB ā 4-bit! (was 14.5 GB in FP16) # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā # STEP 2: PREPARE MODEL FOR TRAINING # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā model = prepare_model_for_kbit_training(model) # Enables gradient checkpointing + casts to correct dtypes # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā # STEP 3: ADD LORA ADAPTERS # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā lora_config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], bias="none", task_type="CAUSAL_LM", ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 13,631,488 || all params: 7,255,363,584 || trainable%: 0.19% # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā # STEP 4: LOAD INSTRUCTION DATASET # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā dataset = load_dataset("databricks/databricks-dolly-15k", split="train") # 15k instruction-response pairs (open-source!) def format_instruction(example): if example.get("context") and example["context"].strip(): text = f"""### Instruction:\n{example['instruction']}\n\n### Context:\n{example['context']}\n\n### Response:\n{example['response']}{tokenizer.eos_token}""" else: text = f"""### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}{tokenizer.eos_token}""" return {"text": text} dataset = dataset.map(format_instruction) print(f"Training examples: {len(dataset)}") print(f"Sample:\n{dataset[0]['text'][:300]}...") # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā # STEP 5: TRAINING ARGUMENTS # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā args = TrainingArguments( output_dir="./qlora-mistral", num_train_epochs=1, # 1 epoch for 15k examples ā enough! per_device_train_batch_size=2, # small batch (4-bit model is large) gradient_accumulation_steps=8, # effective batch = 2 Ć 8 = 16 learning_rate=2e-4, # higher LR for LoRA (2e-4, not 2e-5!) weight_decay=0.01, warmup_ratio=0.03, lr_scheduler_type="cosine", fp16=True, logging_steps=25, save_strategy="steps", save_steps=200, save_total_limit=2, optim="paged_adamw_8bit", # 8-bit optimizer ā less VRAM! gradient_checkpointing=True, # saves ~30% VRAM (slower training) max_grad_norm=0.3, # gradient clipping report_to="none", ) # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā # STEP 6: TRAIN WITH SFTTrainer (from TRL library) # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā trainer = SFTTrainer( model=model, args=args, train_dataset=dataset, tokenizer=tokenizer, dataset_text_field="text", # column with formatted text max_seq_length=512, # max token length packing=True, # pack short examples into one sequence! # packing = True ā 2-3Ć faster training (less padding waste) ) print("šļø Training QLoRA...") print(f"VRAM used: {torch.cuda.memory_allocated()/1e9:.1f} GB") trainer.train() # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā # STEP 7: SAVE & TEST # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā trainer.save_model("./qlora-mistral-final") # Test! prompt = "### Instruction:\nWhat is machine learning?\n\n### Response:\n" inputs = tokenizer(prompt, return_tensors="pt").to("cuda") with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7, top_p=0.9, repetition_penalty=1.2) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) print("\\nš QLoRA fine-tuning complete!")
š„ Anda Baru Saja Fine-Tune Model 7 MILIAR Parameter di Colab Gratis!
Bandingkan yang sebelumnya mustahil:
⢠Full fine-tune Mistral-7B: butuh ~120GB VRAM (8Ć A100 = ~$256/jam) ā
⢠QLoRA Mistral-7B: butuh ~12GB VRAM (1Ć Colab T4 = $0) ā
⢠Performa: ~96% dari full fine-tuning!
⢠Training time: ~30-60 menit untuk 15k examples
⢠Saved adapter: ~55MB (bukan 14.5GB!)
Ini adalah revolusi yang membuat fine-tuning LLM accessible untuk semua orang.
š„ You Just Fine-Tuned a 7 BILLION Parameter Model on Free Colab!
Compare what was previously impossible:
⢠Full fine-tune Mistral-7B: needs ~120GB VRAM (8Ć A100 = ~$256/hr) ā
⢠QLoRA Mistral-7B: needs ~12GB VRAM (1Ć Colab T4 = $0) ā
⢠Performance: ~96% of full fine-tuning!
⢠Training time: ~30-60 minutes for 15k examples
⢠Saved adapter: ~55MB (not 14.5GB!)
This is the revolution that makes LLM fine-tuning accessible to everyone.
9. Merge LoRA ke Base Model ā Untuk Deployment
9. Merge LoRA into Base Model ā For Deployment
from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel import torch # =========================== # 1. Load base model (FP16, NOT 4-bit!) + LoRA adapter # =========================== base_model = AutoModelForCausalLM.from_pretrained( "mistralai/Mistral-7B-v0.3", torch_dtype=torch.float16, device_map="auto", ) # Load LoRA adapter on top model = PeftModel.from_pretrained(base_model, "./qlora-mistral-final") # =========================== # 2. Merge LoRA into base weights # =========================== model = model.merge_and_unload() # ā W_new = W_base + B @ A (permanently merged!) # ā No more LoRA overhead at inference # ā Same speed as original model # =========================== # 3. Save merged model # =========================== model.save_pretrained("./mistral-7b-finetuned-merged") tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3") tokenizer.save_pretrained("./mistral-7b-finetuned-merged") # ā Full 14.5GB model with LoRA baked in # ā Deploy like any normal model (TF Serving, vLLM, TGI, etc.) # =========================== # 4. Push to Hub! # =========================== # model.push_to_hub("username/mistral-7b-instruction-tuned") # tokenizer.push_to_hub("username/mistral-7b-instruction-tuned") # =========================== # When to merge vs keep separate? # =========================== # MERGE: for production deployment (no PEFT dependency needed) # KEEP SEPARATE: when you have multiple LoRA adapters for different tasks # e.g., base_model + lora_sentiment, base_model + lora_translation # ā swap adapters at runtime without loading model twice!
10. Perbandingan Metode ā Full vs LoRA vs QLoRA
10. Method Comparison ā Full vs LoRA vs QLoRA
| Aspek | Full Fine-Tune | LoRA (FP16) | QLoRA (4-bit) |
|---|---|---|---|
| Params trained | 100% | 0.1-1% | 0.1-1% |
| VRAM (7B) | ~120 GB | ~20 GB | ~12 GB |
| Performa | 100% (baseline) | ~97% | ~96% |
| Kecepatan train | Lambat | ~2Ć faster | ~3Ć faster |
| Saved size | 14.5 GB | ~55 MB adapter | ~55 MB adapter |
| GPU minimum | 2-4Ć A100 80GB | 1Ć A100 40GB | 1Ć T4 16GB |
| Cost (Colab) | Tidak mungkin | Colab Pro | Colab FREE! |
| Multi-task | 1 model per task | Swap adapters! | Swap adapters! |
| Kapan pakai | Unlimited budget | Good GPU tersedia | Default choice ā |
| Aspect | Full Fine-Tune | LoRA (FP16) | QLoRA (4-bit) |
|---|---|---|---|
| Params trained | 100% | 0.1-1% | 0.1-1% |
| VRAM (7B) | ~120 GB | ~20 GB | ~12 GB |
| Performance | 100% (baseline) | ~97% | ~96% |
| Training speed | Slow | ~2Ć faster | ~3Ć faster |
| Saved size | 14.5 GB | ~55 MB adapter | ~55 MB adapter |
| Minimum GPU | 2-4Ć A100 80GB | 1Ć A100 40GB | 1Ć T4 16GB |
| Cost (Colab) | Impossible | Colab Pro | Colab FREE! |
| Multi-task | 1 model per task | Swap adapters! | Swap adapters! |
| When to use | Unlimited budget | Good GPU available | Default choice ā |
11. Ringkasan Page 8
11. Page 8 Summary
| Konsep | Apa Itu | Kode Kunci |
|---|---|---|
| PEFT | Train 0.1% params | pip install peft |
| LoRA | W + BĆA (low-rank adapters) | LoraConfig(r=16, alpha=32) |
| QLoRA | 4-bit base + LoRA | BitsAndBytesConfig(load_in_4bit=True) |
| 4-bit Quant | 14.5GB ā 3.6GB | bnb_4bit_quant_type="nf4" |
| get_peft_model | Add LoRA to any model | get_peft_model(model, config) |
| SFTTrainer | Instruction fine-tuning | SFTTrainer(model, args, dataset) |
| Merge | Bake LoRA into base | model.merge_and_unload() |
| Paged AdamW 8-bit | Memory-efficient optimizer | optim="paged_adamw_8bit" |
| Concept | What It Is | Key Code |
|---|---|---|
| PEFT | Train 0.1% params | pip install peft |
| LoRA | W + BĆA (low-rank adapters) | LoraConfig(r=16, alpha=32) |
| QLoRA | 4-bit base + LoRA | BitsAndBytesConfig(load_in_4bit=True) |
| 4-bit Quant | 14.5GB ā 3.6GB | bnb_4bit_quant_type="nf4" |
| get_peft_model | Add LoRA to any model | get_peft_model(model, config) |
| SFTTrainer | Instruction fine-tuning | SFTTrainer(model, args, dataset) |
| Merge | Bake LoRA into base | model.merge_and_unload() |
| Paged AdamW 8-bit | Memory-efficient optimizer | optim="paged_adamw_8bit" |
Page 7 ā Spaces, Gradio & Demo Apps
Coming Next: Page 9 ā RLHF & Alignment
Dari fine-tuned model menjadi assistant yang helpful & safe! Page 9 membahas: Reinforcement Learning from Human Feedback (RLHF), DPO (Direct Preference Optimization) ā lebih mudah dari PPO, reward modeling, TRL library untuk RLHF training, dan alignment techniques yang membuat ChatGPT dari GPT-3.
Coming Next: Page 9 ā RLHF & Alignment
From fine-tuned model to helpful & safe assistant! Page 9 covers: Reinforcement Learning from Human Feedback (RLHF), DPO (Direct Preference Optimization) ā easier than PPO, reward modeling, TRL library for RLHF training, and alignment techniques that turned GPT-3 into ChatGPT.