📑 Daftar Isi — Page 2
📑 Table of Contents — Page 2
- Apa Itu Fine-Tuning? — Kenapa tidak train dari nol
- Di Mana Jalankan? — GPU, VRAM, Colab setup, OOM troubleshooting
- Datasets Library — Load, inspect, filter, dan preprocess
- Tokenisasi Dataset — Batch tokenize dengan .map()
- DataCollator — Dynamic padding untuk efisiensi
- TrainingArguments — SETIAP parameter dijelaskan
- Compute Metrics — F1, Precision, Recall custom
- Proyek: Fine-Tune DistilBERT pada IMDB — 93%+ akurasi
- Fine-Tune pada Custom CSV Dataset — Data Anda sendiri
- Push Model ke Hugging Face Hub — Share ke dunia
- Hyperparameter Search — Optuna integration
- Trainer vs Native PyTorch — Kapan pakai mana
- Ringkasan & Preview Page 3
- What Is Fine-Tuning? — Why not train from scratch
- Where to Run? — GPU, VRAM, Colab setup, OOM troubleshooting
- Datasets Library — Load, inspect, filter, and preprocess
- Dataset Tokenization — Batch tokenize with .map()
- DataCollator — Dynamic padding for efficiency
- TrainingArguments — EVERY parameter explained
- Compute Metrics — Custom F1, Precision, Recall
- Project: Fine-Tune DistilBERT on IMDB — 93%+ accuracy
- Fine-Tune on Custom CSV Dataset — Your own data
- Push Model to Hugging Face Hub — Share with the world
- Hyperparameter Search — Optuna integration
- Trainer vs Native PyTorch — When to use which
- Summary & Page 3 Preview
1. Apa Itu Fine-Tuning? — Dan Kenapa Ia Bekerja
1. What Is Fine-Tuning? — And Why It Works
Fine-tuning = mengambil model yang sudah di-pre-train pada data besar (BERT pada Wikipedia + BookCorpus = 3.3 miliar kata), lalu melatih ulang model tersebut pada dataset kecil Anda untuk tugas spesifik (misalnya: klasifikasi sentimen review). Model sudah memahami bahasa — Anda hanya perlu mengajarkan tugas-nya.
Fine-tuning = taking a model already pre-trained on massive data (BERT on Wikipedia + BookCorpus = 3.3 billion words), then retraining it on your small dataset for a specific task (e.g., review sentiment classification). The model already understands language — you just need to teach it the task.
Kenapa fine-tuning bekerja jauh lebih baik dari training dari nol?
Why does fine-tuning work so much better than training from scratch?
| Aspek | Train dari Nol | Fine-Tuning |
|---|---|---|
| Data dibutuhkan | Jutaan-miliaran sample | 1,000-50,000 sample cukup! |
| Waktu training | Hari-minggu (multi-GPU) | 15-60 menit (1 GPU) |
| Biaya | $10k-$10M+ | $0 (Colab gratis) |
| Akurasi (IMDB) | ~80-85% (BiLSTM) | 93%+ (fine-tuned BERT) |
| Bahasa dipahami? | Belajar dari nol | Sudah mengerti grammar, semantik, konteks |
| GPU minimum | Multi-GPU, V100/A100 | T4 gratis di Colab |
| Aspect | Train from Scratch | Fine-Tuning |
|---|---|---|
| Data needed | Millions-billions of samples | 1,000-50,000 samples enough! |
| Training time | Days-weeks (multi-GPU) | 15-60 minutes (1 GPU) |
| Cost | $10k-$10M+ | $0 (free Colab) |
| Accuracy (IMDB) | ~80-85% (BiLSTM) | 93%+ (fine-tuned BERT) |
| Language understood? | Learns from scratch | Already understands grammar, semantics, context |
| Minimum GPU | Multi-GPU, V100/A100 | Free T4 on Colab |
🎓 Fine-Tuning ≠ Prompt Engineering
Fine-tuning: Mengubah weights model (melatih ulang) → model jadi spesialis di tugas Anda. Butuh data + training. Hasilnya: model baru yang disimpan dan di-deploy.
Prompt engineering: Memberikan instruksi ke model tanpa mengubah weights. Tidak butuh training. Hasilnya: respons dari model yang sama (GPT-4, Claude, dll).
Kapan fine-tuning lebih baik dari prompting?
• Anda punya labeled dataset spesifik → fine-tune
• Anda butuh konsistensi 100% (produksi) → fine-tune
• Model kecil harus seakurat model besar → fine-tune
• Anda butuh speed (latency <50ms) → fine-tune model kecil
• Budget terbatas, tanpa API cost per request → fine-tune
🎓 Fine-Tuning ≠ Prompt Engineering
Fine-tuning: Modifies model weights (retrains) → model becomes a specialist for your task. Needs data + training. Result: a new model you save and deploy.
Prompt engineering: Gives instructions to a model without changing weights. No training needed. Result: response from the same model (GPT-4, Claude, etc.).
When is fine-tuning better than prompting?
• You have a specific labeled dataset → fine-tune
• You need 100% consistency (production) → fine-tune
• Small model must be as accurate as large → fine-tune
• You need speed (latency <50ms) → fine-tune small model
• Limited budget, no per-request API cost → fine-tune
1b. Di Mana Jalankan Fine-Tuning? — GPU, VRAM, dan Setup Lengkap
1b. Where Do You Run Fine-Tuning? — GPU, VRAM, and Complete Setup
Pertanyaan paling sering: "Ini dijalankan di mana? Komputer saya bisa? Butuh GPU berapa?" Jawaban singkat: fine-tuning BERT/DistilBERT bisa di Google Colab gratis. Tapi untuk model besar (LLaMA, Mistral), Anda butuh GPU lebih besar. Berikut panduan lengkapnya:
The most common question: "Where do I run this? Can my computer handle it? How much GPU do I need?" Short answer: fine-tuning BERT/DistilBERT works on free Google Colab. But for large models (LLaMA, Mistral), you need bigger GPUs. Here's the complete guide:
🧮 Berapa VRAM yang Dibutuhkan?
🧮 How Much VRAM Do You Need?
VRAM (Video RAM) = memori di GPU. Semakin besar model, semakin banyak VRAM yang dibutuhkan. Rule of thumb: VRAM ≈ 4× ukuran model untuk fine-tuning (karena menyimpan model + gradients + optimizer states + activations).
VRAM (Video RAM) = memory on the GPU. Bigger models need more VRAM. Rule of thumb: VRAM ≈ 4× model size for fine-tuning (stores model + gradients + optimizer states + activations).
| Model | Parameters | Model Size | VRAM Fine-Tune (FP16) | VRAM Inference | Bisa di Colab T4? |
|---|---|---|---|---|---|
| DistilBERT | 66M | ~250 MB | ~4 GB | ~1 GB | ✅ Ya, sangat nyaman |
| BERT base | 110M | ~420 MB | ~6 GB | ~2 GB | ✅ Ya, nyaman |
| BERT large | 340M | ~1.3 GB | ~12 GB | ~3 GB | ⚠️ Ketat (batch kecil) |
| RoBERTa base | 125M | ~475 MB | ~7 GB | ~2 GB | ✅ Ya |
| GPT-2 small | 117M | ~500 MB | ~6 GB | ~2 GB | ✅ Ya |
| GPT-2 medium | 345M | ~1.4 GB | ~13 GB | ~3 GB | ⚠️ Ketat |
| LLaMA 3.2 1B | 1B | ~4 GB | ~16 GB (LoRA) | ~5 GB | ⚠️ Hanya dengan LoRA |
| LLaMA 3.2 3B | 3B | ~12 GB | ~24 GB (LoRA) | ~7 GB | ❌ Butuh A100/L4 |
| Mistral 7B | 7B | ~14 GB | ~40 GB (LoRA) | ~15 GB | ❌ Butuh A100 40GB+ |
| LLaMA 3.1 70B | 70B | ~140 GB | ~160 GB (QLoRA) | ~40 GB | ❌ Multi-GPU A100 |
| Model | Parameters | Model Size | VRAM Fine-Tune (FP16) | VRAM Inference | Fits on Colab T4? |
|---|---|---|---|---|---|
| DistilBERT | 66M | ~250 MB | ~4 GB | ~1 GB | ✅ Yes, very comfortable |
| BERT base | 110M | ~420 MB | ~6 GB | ~2 GB | ✅ Yes, comfortable |
| BERT large | 340M | ~1.3 GB | ~12 GB | ~3 GB | ⚠️ Tight (small batch) |
| RoBERTa base | 125M | ~475 MB | ~7 GB | ~2 GB | ✅ Yes |
| GPT-2 small | 117M | ~500 MB | ~6 GB | ~2 GB | ✅ Yes |
| GPT-2 medium | 345M | ~1.4 GB | ~13 GB | ~3 GB | ⚠️ Tight |
| LLaMA 3.2 1B | 1B | ~4 GB | ~16 GB (LoRA) | ~5 GB | ⚠️ Only with LoRA |
| LLaMA 3.2 3B | 3B | ~12 GB | ~24 GB (LoRA) | ~7 GB | ❌ Need A100/L4 |
| Mistral 7B | 7B | ~14 GB | ~40 GB (LoRA) | ~15 GB | ❌ Need A100 40GB+ |
| LLaMA 3.1 70B | 70B | ~140 GB | ~160 GB (QLoRA) | ~40 GB | ❌ Multi-GPU A100 |
🥇 Opsi 1: Google Colab — REKOMENDASI untuk Page Ini
🥇 Option 1: Google Colab — RECOMMENDED for This Page
Semua kode di Page 2 ini didesain untuk berjalan di Google Colab gratis dengan GPU T4 (16GB VRAM). DistilBERT + IMDB fine-tuning = ~4GB VRAM, selesai dalam 15 menit. Berikut step-by-step lengkap:
All code in this Page 2 is designed to run on free Google Colab with T4 GPU (16GB VRAM). DistilBERT + IMDB fine-tuning = ~4GB VRAM, done in 15 minutes. Here's the complete step-by-step:
# ═══════════════════════════════════════════════════ # STEP 1: Buka Google Colab # → colab.research.google.com → New Notebook # ═══════════════════════════════════════════════════ # STEP 2: Aktifkan GPU # → Menu: Runtime → Change runtime type # → Hardware accelerator: T4 GPU # → Save # STEP 3: Verifikasi GPU (cell pertama!) !nvidia-smi # Harus muncul: Tesla T4, 15360MiB (16GB VRAM) # Jika "No GPU": ulangi Step 2 import torch print(f"PyTorch: {torch.__version__}") print(f"CUDA available: {torch.cuda.is_available()}") print(f"GPU: {torch.cuda.get_device_name(0)}") print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB") # CUDA available: True # GPU: Tesla T4 # VRAM: 15.8 GB # STEP 4: Install/update libraries !pip install -q transformers datasets accelerate evaluate # STEP 5: Cek VRAM tersedia sebelum training print(f"VRAM used: {torch.cuda.memory_allocated()/1e9:.1f} GB") print(f"VRAM free: {(torch.cuda.get_device_properties(0).total_mem - torch.cuda.memory_allocated())/1e9:.1f} GB") # VRAM used: 0.0 GB # VRAM free: 15.8 GB → PLENTY for DistilBERT! # STEP 6: Jalankan kode fine-tuning dari Section 7! # Copy-paste seluruh kode 14_imdb_finetune.py ke cell baru → Run # Training: ~15 menit untuk 3 epochs # Result: 93%+ accuracy ✅ # ═══════════════════════════════════════════════════ # PENTING: Colab Session Management # ═══════════════════════════════════════════════════ # • Session timeout: ~90 menit idle, ~12 jam max # • SEMUA file hilang saat session mati! # • SOLUSI: Save model ke Google Drive: from google.colab import drive drive.mount('/content/drive') trainer.save_model('/content/drive/MyDrive/my-model') # → Model aman di Google Drive, tidak hilang! # • Atau push ke Hugging Face Hub (Section 9): # trainer.push_to_hub() → tersimpan permanen di Hub
🖥️ Opsi 2: Komputer Lokal dengan GPU NVIDIA
🖥️ Option 2: Local Computer with NVIDIA GPU
Jika Anda punya GPU NVIDIA di desktop/laptop (RTX 3060+, RTX 4060+), Anda bisa fine-tune secara lokal. Keuntungan: tidak ada timeout, file persisten, lebih cepat dari Colab untuk training panjang.
If you have an NVIDIA GPU on your desktop/laptop (RTX 3060+, RTX 4060+), you can fine-tune locally. Advantages: no timeout, persistent files, faster than Colab for long training.
# 1. Cek GPU Anda nvidia-smi # Harus muncul GPU name + VRAM # RTX 3060: 12GB ✅ (cukup untuk BERT base) # RTX 4060: 8GB ⚠️ (cukup untuk DistilBERT, ketat untuk BERT) # RTX 4090: 24GB ✅ (mewah, bahkan untuk GPT-2 medium) # 2. Setup environment python -m venv hf-finetune source hf-finetune/bin/activate # 3. Install PyTorch dengan CUDA pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # 4. Install HF stack pip install transformers datasets accelerate evaluate # 5. Jalankan script python 14_imdb_finetune.py # Training berjalan di GPU lokal Anda!
☁️ Opsi 3: Cloud GPU (untuk Model Besar)
☁️ Option 3: Cloud GPU (for Large Models)
Untuk model > 1B parameters (LLaMA, Mistral), Colab T4 tidak cukup. Anda butuh GPU lebih besar:
For models > 1B parameters (LLaMA, Mistral), Colab T4 isn't enough. You need bigger GPUs:
| Platform | GPU | VRAM | Harga | Best For |
|---|---|---|---|---|
| Google Colab Free | T4 | 16 GB | Gratis | BERT, DistilBERT, GPT-2 ⭐ |
| Google Colab Pro | V100 / A100 | 16-40 GB | $10/bulan | BERT large, GPT-2 medium |
| Kaggle Notebooks | P100 / T4 | 16 GB | Gratis (30 jam/minggu) | Sama seperti Colab free |
| Lambda Cloud | A100 / H100 | 40-80 GB | $1.10-3.29/jam | LLaMA 7B, Mistral 7B |
| RunPod | A100 / H100 | 40-80 GB | $0.74-2.49/jam | LLaMA, heavy training |
| Vast.ai | Various | Various | $0.15-2.00/jam | Budget cloud GPU |
| AWS (p4d) | A100 ×8 | 320 GB | $32/jam | Enterprise, 70B+ models |
| Platform | GPU | VRAM | Price | Best For |
|---|---|---|---|---|
| Google Colab Free | T4 | 16 GB | Free | BERT, DistilBERT, GPT-2 ⭐ |
| Google Colab Pro | V100 / A100 | 16-40 GB | $10/month | BERT large, GPT-2 medium |
| Kaggle Notebooks | P100 / T4 | 16 GB | Free (30 hrs/week) | Same as free Colab |
| Lambda Cloud | A100 / H100 | 40-80 GB | $1.10-3.29/hr | LLaMA 7B, Mistral 7B |
| RunPod | A100 / H100 | 40-80 GB | $0.74-2.49/hr | LLaMA, heavy training |
| Vast.ai | Various | Various | $0.15-2.00/hr | Budget cloud GPU |
| AWS (p4d) | A100 ×8 | 320 GB | $32/hr | Enterprise, 70B+ models |
🆘 Troubleshooting: CUDA Out of Memory (OOM)
🆘 Troubleshooting: CUDA Out of Memory (OOM)
Error paling umum saat fine-tuning: CUDA out of memory. Artinya GPU VRAM Anda tidak cukup. Berikut solusinya, dari yang paling mudah:
The most common error during fine-tuning: CUDA out of memory. It means your GPU VRAM isn't enough. Here are solutions, from easiest:
# ═══════════════════════════════════════ # SOLUSI OOM — dari mudah ke advanced # ═══════════════════════════════════════ # ① Kurangi batch size (PALING MUDAH!) # OOM dengan batch_size=16? Coba 8, lalu 4, lalu 2 args = TrainingArguments( per_device_train_batch_size=8, # turunkan dari 16 gradient_accumulation_steps=4, # effective batch = 8×4 = 32 # ↑ PENTING: tambah gradient_accumulation agar effective batch tetap besar! # Tanpa ini, batch kecil = training tidak stabil ) # ② Aktifkan FP16 mixed precision args = TrainingArguments( fp16=True, # halve memory usage! (~50% less VRAM) ) # ③ Kurangi max_length tokenizer def tokenize(batch): return tokenizer(batch["text"], truncation=True, max_length=128, # turunkan dari 256/512 # Shorter sequence = less memory per sample # Trade-off: kehilangan info dari teks panjang ) # ④ Gunakan model lebih kecil # BERT base (110M) OOM? → Pakai DistilBERT (66M) # DistilBERT = 40% lebih kecil, 60% lebih cepat, 97% akurasi BERT model_name = "distilbert-base-uncased" # bukannya "bert-base-uncased" # ⑤ Gradient checkpointing (advanced — saves ~30% VRAM) model.gradient_checkpointing_enable() # Trade-off: 20-30% slower training, but uses less memory # How: recompute activations during backward pass instead of storing # ⑥ Clear GPU cache (jika error setelah error) import torch torch.cuda.empty_cache() import gc gc.collect() # Di Colab: Runtime → Restart Runtime → jalankan ulang # ⑦ Monitor VRAM selama training print(f"VRAM used: {torch.cuda.memory_allocated()/1e9:.2f} GB") print(f"VRAM peak: {torch.cuda.max_memory_allocated()/1e9:.2f} GB") print(f"VRAM total: {torch.cuda.get_device_properties(0).total_mem/1e9:.2f} GB") # ═══════════════════════════════════════ # CHEAT SHEET: Berapa VRAM saya butuhkan? # ═══════════════════════════════════════ # DistilBERT + batch 16 + max_len 256 + FP16 → ~4 GB ✅ Colab free # BERT base + batch 16 + max_len 256 + FP16 → ~6 GB ✅ Colab free # BERT base + batch 32 + max_len 512 + FP16 → ~13 GB ⚠️ ketat # RoBERTa large + batch 8 + max_len 256 → ~14 GB ⚠️ ketat # GPT-2 medium + batch 4 + max_len 512 + FP16 → ~14 GB ⚠️ ketat
🎉 TL;DR untuk Page 2 Ini:
1. Buka colab.research.google.com
2. Runtime → Change runtime type → T4 GPU
3. Cell 1: !pip install -q transformers datasets accelerate evaluate
4. Cell 2: !nvidia-smi (pastikan muncul Tesla T4)
5. Cell 3+: Copy-paste kode dari Section 7 (IMDB project) → Run All
6. Tunggu ~15 menit → 93%+ accuracy ✅
7. Save ke Google Drive: trainer.save_model('/content/drive/MyDrive/...')
Tidak perlu install apapun di laptop Anda. Tidak perlu GPU. Tidak perlu bayar. Semua berjalan di cloud Google gratis.
🎉 TL;DR for This Page 2:
1. Open colab.research.google.com
2. Runtime → Change runtime type → T4 GPU
3. Cell 1: !pip install -q transformers datasets accelerate evaluate
4. Cell 2: !nvidia-smi (verify Tesla T4)
5. Cell 3+: Copy-paste code from Section 7 (IMDB project) → Run All
6. Wait ~15 min → 93%+ accuracy ✅
7. Save to Google Drive: trainer.save_model('/content/drive/MyDrive/...')
No need to install anything on your laptop. No GPU needed. No payment. Everything runs on Google's free cloud.
💡 Tanpa GPU Sama Sekali?
Fine-tuning di CPU saja juga BISA — tapi sangat lambat. DistilBERT pada IMDB: GPU T4 = 15 menit, CPU = 6-8 jam. Untuk belajar dan eksperimen kecil (subset 1000 samples), CPU masih bisa dipakai. Untuk training penuh, gunakan Colab GPU.
💡 No GPU at All?
Fine-tuning on CPU only also WORKS — but very slow. DistilBERT on IMDB: GPU T4 = 15 minutes, CPU = 6-8 hours. For learning and small experiments (subset of 1000 samples), CPU is still usable. For full training, use Colab GPU.
2. Datasets Library — Load, Inspect, Filter, Preprocess
2. Datasets Library — Load, Inspect, Filter, Preprocess
Library datasets dari Hugging Face menyediakan akses ke 100,000+ dataset dari Hub, plus tools untuk memproses dataset besar secara efisien. Data disimpan dalam format Apache Arrow (memory-mapped) — artinya Anda bisa bekerja dengan dataset lebih besar dari RAM tanpa masalah.
The datasets library from Hugging Face provides access to 100,000+ datasets from the Hub, plus tools for efficiently processing large datasets. Data is stored in Apache Arrow format (memory-mapped) — meaning you can work with datasets larger than RAM without issues.
from datasets import load_dataset, Dataset, DatasetDict # =========================== # 1. Load dataset dari Hub (100k+ datasets!) # =========================== dataset = load_dataset("imdb") print(dataset) # DatasetDict({ # train: Dataset({ # features: ['text', 'label'], # num_rows: 25000 # }) # test: Dataset({ # features: ['text', 'label'], # num_rows: 25000 # }) # }) # =========================== # 2. Inspect data # =========================== print(dataset["train"][0]) # {'text': 'I rented this movie because...', 'label': 0} # label: 0 = negative, 1 = positive print(dataset["train"].features) # {'text': Value(dtype='string'), 'label': ClassLabel(names=['neg', 'pos'])} print(dataset["train"].column_names) # ['text', 'label'] print(dataset["train"].num_rows) # 25000 # Slicing (like pandas!) first_5 = dataset["train"][:5] # first 5 rows (dict of lists) texts = dataset["train"]["text"][:5] # first 5 texts labels = dataset["train"]["label"] # ALL labels (efficient!) # =========================== # 3. Filter data # =========================== # Keep only positive reviews positive_only = dataset["train"].filter(lambda x: x["label"] == 1) print(f"Positive reviews: {len(positive_only)}") # 12500 # Keep only short reviews short = dataset["train"].filter(lambda x: len(x["text"]) < 500) print(f"Short reviews: {len(short)}") # =========================== # 4. Map — apply function to every row # =========================== def add_length(example): example["text_length"] = len(example["text"]) return example dataset_with_length = dataset["train"].map(add_length) print(dataset_with_length[0]["text_length"]) # 843 # Batched map (MUCH faster for tokenization!) def batch_add_length(batch): batch["text_length"] = [len(t) for t in batch["text"]] return batch dataset_batched = dataset["train"].map(batch_add_length, batched=True) # =========================== # 5. Load dari berbagai sumber # =========================== # Dari Hub (paling umum) glue = load_dataset("glue", "mrpc") # GLUE benchmark, MRPC task squad = load_dataset("squad") # Question Answering wnut = load_dataset("wnut_17") # NER dataset ag_news = load_dataset("ag_news") # News classification # Dari CSV Anda (paling penting untuk custom data!) my_data = load_dataset("csv", data_files={ "train": "data/train.csv", "test": "data/test.csv" }) # Dari pandas DataFrame import pandas as pd df = pd.DataFrame({"text": ["Great!", "Bad!"], "label": [1, 0]}) my_dataset = Dataset.from_pandas(df) # Dari JSON json_ds = load_dataset("json", data_files="data.jsonl") # =========================== # 6. Train/val split # =========================== split = dataset["train"].train_test_split(test_size=0.2, seed=42) print(split) # DatasetDict({ # train: Dataset(20000 rows) # test: Dataset(5000 rows) # }) # =========================== # 7. Shuffle & select # =========================== shuffled = dataset["train"].shuffle(seed=42) small_train = dataset["train"].shuffle(seed=42).select(range(1000)) # first 1000 print(f"Small subset: {len(small_train)} rows") # 1000 # Great for quick experiments before full training!
🎓 Kenapa datasets Library, Bukan Pandas?
Memory: datasets menggunakan Apache Arrow (memory-mapped). Dataset 50GB? Tidak perlu 50GB RAM — cukup baca bagian yang dibutuhkan.
Speed: .map() dengan batched=True + multiprocessing → 10-100× lebih cepat dari pandas apply.
Integration: Output langsung compatible dengan Trainer API — tidak perlu konversi.
Hub: 100k+ datasets siap download dalam satu baris. Pandas tidak punya ini.
🎓 Why datasets Library, Not Pandas?
Memory: datasets uses Apache Arrow (memory-mapped). 50GB dataset? Doesn't need 50GB RAM — reads only what's needed.
Speed: .map() with batched=True + multiprocessing → 10-100× faster than pandas apply.
Integration: Output directly compatible with Trainer API — no conversion needed.
Hub: 100k+ datasets ready to download in one line. Pandas doesn't have this.
3. Tokenisasi Dataset — Batch Tokenize dengan .map()
3. Dataset Tokenization — Batch Tokenize with .map()
from transformers import AutoTokenizer from datasets import load_dataset # =========================== # 1. Load tokenizer & dataset # =========================== model_name = "distilbert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(model_name) dataset = load_dataset("imdb") # =========================== # 2. Define tokenization function # =========================== def tokenize_function(examples): """Tokenize a batch of texts. PENTING: function ini menerima BATCH (dict of lists), bukan single example! Karena kita pakai batched=True di .map() → jauh lebih cepat. """ return tokenizer( examples["text"], truncation=True, # potong jika > max_length max_length=256, # max token length # padding=True, # ❌ JANGAN pad di sini! # padding akan dilakukan oleh DataCollator (section 4) # → lebih efisien karena pad per batch, bukan global max ) # =========================== # 3. Apply tokenization to entire dataset # =========================== tokenized_dataset = dataset.map( tokenize_function, batched=True, # process batches (MUCH faster!) num_proc=4, # use 4 CPU cores remove_columns=["text"], # remove original text (not needed anymore) ) print(tokenized_dataset) # DatasetDict({ # train: Dataset({ # features: ['label', 'input_ids', 'attention_mask'], # num_rows: 25000 # }) # test: Dataset({ # features: ['label', 'input_ids', 'attention_mask'], # num_rows: 25000 # }) # }) # Check one example print(tokenized_dataset["train"][0].keys()) # dict_keys(['label', 'input_ids', 'attention_mask']) print(f"Token length: {len(tokenized_dataset['train'][0]['input_ids'])}") # Varies per example! (karena kita TIDAK pad di sini) # Some: 87 tokens, some: 256 tokens (truncated) # =========================== # 4. Set format for PyTorch # =========================== tokenized_dataset.set_format("torch") # convert to PyTorch tensors print(type(tokenized_dataset["train"][0]["input_ids"])) #✓
🎓 Kenapa TIDAK Pad di Tokenization Step?
Jika Anda pad semua sequences ke max_length=256 saat tokenisasi, maka review pendek (20 kata) akan memiliki 236 padding tokens. Buang-buang memori dan compute!
Lebih baik: biarkan sequence dengan panjang berbeda, lalu gunakan DataCollator (section 4) yang pad per batch ke panjang terpanjang di batch tersebut. Batch 1 mungkin max 87 tokens, batch 2 mungkin 156. Jauh lebih efisien!
🎓 Why NOT Pad at the Tokenization Step?
If you pad all sequences to max_length=256 during tokenization, a short review (20 words) will have 236 padding tokens. Wastes memory and compute!
Better: leave sequences with different lengths, then use DataCollator (section 4) that pads per batch to the longest in that batch. Batch 1 might be max 87 tokens, batch 2 might be 156. Far more efficient!
4. DataCollator — Dynamic Padding per Batch
4. DataCollator — Dynamic Padding per Batch
from transformers import DataCollatorWithPadding # =========================== # DataCollatorWithPadding — the smart way to pad # =========================== data_collator = DataCollatorWithPadding( tokenizer=tokenizer, padding=True, # pad to longest in batch # padding="max_length", # pad to max_length (fixed) # max_length=256, # only if padding="max_length" return_tensors="pt", # return PyTorch tensors ) # Simulasi: 3 examples dengan panjang berbeda examples = [ tokenized_dataset["train"][0], # mungkin 87 tokens tokenized_dataset["train"][1], # mungkin 156 tokens tokenized_dataset["train"][2], # mungkin 43 tokens ] batch = data_collator(examples) print(f"Padded batch shape: {batch['input_ids'].shape}") # torch.Size([3, 156]) ← padded to longest (156), NOT 256! print(f"Attention mask: {batch['attention_mask'][2][:10]}") # tensor([1, 1, 1, ..., 1, 0, 0, 0, ...]) ← 1=real, 0=padding # =========================== # Kenapa ini PENTING untuk performa? # =========================== # Static padding (max_length=256): # Batch of short reviews (avg 50 tokens): # → compute on 256 tokens × batch_size = WASTED 80% compute! # # Dynamic padding (DataCollator): # Batch of short reviews (avg 50 tokens): # → compute on ~55 tokens × batch_size = OPTIMAL! # → 4-5× faster training on datasets with variable length text!
5. TrainingArguments — SETIAP Parameter Dijelaskan
5. TrainingArguments — EVERY Parameter Explained
from transformers import TrainingArguments args = TrainingArguments( # ═══════════════════════════ # OUTPUT & ORGANIZATION # ═══════════════════════════ output_dir="./results", # checkpoints & logs saved here overwrite_output_dir=True, # overwrite if exists run_name="bert-imdb-v1", # name for W&B / TensorBoard # ═══════════════════════════ # TRAINING HYPERPARAMETERS # ═══════════════════════════ num_train_epochs=3, # total training epochs # max_steps=1000, # alternative: stop after N steps per_device_train_batch_size=16, # batch per GPU (T4: 16-32 for BERT) per_device_eval_batch_size=64, # eval batch (larger OK — no gradients) gradient_accumulation_steps=2, # effective batch = 16 × 2 = 32 # ↑ Simulates larger batch on small GPU! learning_rate=2e-5, # THE MOST IMPORTANT HYPERPARAMETER! # BERT/RoBERTa: 2e-5 to 5e-5 (sweet spot) # DistilBERT: 2e-5 to 5e-5 # Large models (LLaMA): 1e-5 to 2e-5 # JANGAN > 1e-4! Pre-trained weights akan rusak. weight_decay=0.01, # L2 regularization (prevents overfitting) warmup_ratio=0.1, # warmup 10% of training steps # warmup_steps=500, # alternative: fixed warmup steps lr_scheduler_type="linear", # linear decay after warmup # "cosine", "cosine_with_restarts", "polynomial", "constant" # ═══════════════════════════ # EVALUATION # ═══════════════════════════ eval_strategy="epoch", # evaluate every epoch # eval_strategy="steps", # evaluate every N steps # eval_steps=500, # (if strategy="steps") # ═══════════════════════════ # SAVING # ═══════════════════════════ save_strategy="epoch", # save checkpoint every epoch save_total_limit=2, # keep only last 2 checkpoints load_best_model_at_end=True, # load best model after training! metric_for_best_model="f1", # "best" = highest F1 # metric_for_best_model="eval_loss", # or lowest loss # ═══════════════════════════ # PERFORMANCE # ═══════════════════════════ fp16=True, # mixed precision (2× faster on T4!) # bf16=True, # bfloat16 (A100, H100) dataloader_num_workers=4, # parallel data loading dataloader_pin_memory=True, # faster CPU→GPU transfer # ═══════════════════════════ # LOGGING # ═══════════════════════════ logging_dir="./logs", # TensorBoard logs logging_steps=100, # log every 100 steps report_to="tensorboard", # or "wandb", "none" # ═══════════════════════════ # HUB INTEGRATION # ═══════════════════════════ push_to_hub=False, # auto-push to HF Hub after training # hub_model_id="username/my-model", # Hub repo name # hub_strategy="every_save", # push every checkpoint )
🎓 5 Parameter Terpenting untuk Fine-Tuning:
1. learning_rate = 2e-5 → Mulai dari sini. Terlalu besar (>1e-4) = rusak pre-trained weights. Terlalu kecil (<1e-6) = tidak belajar.
2. num_train_epochs = 3 → BERT biasanya konvergen dalam 2-4 epochs. Lebih lama = overfitting.
3. per_device_train_batch_size = 16 → Terbatas oleh GPU memory. T4 16GB: batch 16-32 untuk BERT. OOM? Kurangi batch, tambah gradient_accumulation_steps.
4. fp16 = True → Selalu aktifkan di GPU dengan Tensor Cores (T4+). 2× faster, 50% less memory.
5. weight_decay = 0.01 → Regularization standar. Tidak perlu diubah kecuali overfitting parah.
🎓 5 Most Important Parameters for Fine-Tuning:
1. learning_rate = 2e-5 → Start here. Too high (>1e-4) = destroys pre-trained weights. Too low (<1e-6) = doesn't learn.
2. num_train_epochs = 3 → BERT usually converges in 2-4 epochs. Longer = overfitting.
3. per_device_train_batch_size = 16 → Limited by GPU memory. T4 16GB: batch 16-32 for BERT. OOM? Reduce batch, increase gradient_accumulation_steps.
4. fp16 = True → Always enable on GPUs with Tensor Cores (T4+). 2× faster, 50% less memory.
5. weight_decay = 0.01 → Standard regularization. No need to change unless severe overfitting.
6. Compute Metrics — F1, Precision, Recall Custom
6. Compute Metrics — Custom F1, Precision, Recall
import numpy as np import evaluate # =========================== # 1. Load metrics from evaluate library # =========================== accuracy_metric = evaluate.load("accuracy") f1_metric = evaluate.load("f1") precision_metric = evaluate.load("precision") recall_metric = evaluate.load("recall") # =========================== # 2. Define compute_metrics function # Trainer calls this after every evaluation # =========================== def compute_metrics(eval_pred): """Compute accuracy, F1, precision, recall. Args: eval_pred: EvalPrediction object with: .predictions: model logits (batch, num_labels) .label_ids: true labels (batch,) Returns: dict of metric_name: value """ logits, labels = eval_pred predictions = np.argmax(logits, axis=-1) # logits → class index accuracy = accuracy_metric.compute(predictions=predictions, references=labels) f1 = f1_metric.compute(predictions=predictions, references=labels, average="weighted") precision = precision_metric.compute(predictions=predictions, references=labels, average="weighted") recall = recall_metric.compute(predictions=predictions, references=labels, average="weighted") return { "accuracy": accuracy["accuracy"], "f1": f1["f1"], "precision": precision["precision"], "recall": recall["recall"], } # This function will be passed to Trainer: # trainer = Trainer(..., compute_metrics=compute_metrics) # Output during training: # Epoch 1: {'accuracy': 0.921, 'f1': 0.921, 'precision': 0.923, 'recall': 0.921} # Epoch 2: {'accuracy': 0.934, 'f1': 0.934, 'precision': 0.935, 'recall': 0.934}
7. Proyek: Fine-Tune DistilBERT pada IMDB — 93%+
7. Project: Fine-Tune DistilBERT on IMDB — 93%+
#!/usr/bin/env python3 """ 🔥 Fine-Tune DistilBERT on IMDB Sentiment Analysis Expected: 93%+ accuracy in ~15 minutes on Google Colab T4 Combines: Sections 2-6 of this page """ import numpy as np import evaluate from datasets import load_dataset from transformers import ( AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding, ) # ═══════════════════════════════════════ # STEP 1: LOAD DATASET # ═══════════════════════════════════════ print("📊 Loading IMDB dataset...") dataset = load_dataset("imdb") print(f" Train: {len(dataset['train'])} | Test: {len(dataset['test'])}") # Train: 25000 | Test: 25000 # Optional: use subset for quick testing # dataset["train"] = dataset["train"].shuffle(42).select(range(5000)) # dataset["test"] = dataset["test"].shuffle(42).select(range(1000)) # ═══════════════════════════════════════ # STEP 2: LOAD TOKENIZER & MODEL # ═══════════════════════════════════════ print("🤖 Loading model & tokenizer...") model_name = "distilbert-base-uncased" # Alternatives: # "bert-base-uncased" → more accurate, slower # "roberta-base" → best accuracy for English # "xlm-roberta-base" → multilingual (100+ languages) # "indobenchmark/indobert-base-p1" → Indonesian! tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained( model_name, num_labels=2, # binary: positive/negative id2label={0: "NEGATIVE", 1: "POSITIVE"}, label2id={"NEGATIVE": 0, "POSITIVE": 1}, ) print(f" Model params: {model.num_parameters():,}") # 66,955,010 for DistilBERT (half of BERT!) # ═══════════════════════════════════════ # STEP 3: TOKENIZE DATASET # ═══════════════════════════════════════ print("✂️ Tokenizing...") def tokenize(batch): return tokenizer(batch["text"], truncation=True, max_length=256) tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"]) print(f" Tokenized features: {tokenized['train'].column_names}") # ═══════════════════════════════════════ # STEP 4: DATA COLLATOR # ═══════════════════════════════════════ data_collator = DataCollatorWithPadding(tokenizer=tokenizer) # ═══════════════════════════════════════ # STEP 5: METRICS # ═══════════════════════════════════════ accuracy = evaluate.load("accuracy") f1 = evaluate.load("f1") def compute_metrics(eval_pred): logits, labels = eval_pred preds = np.argmax(logits, axis=-1) return { "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"], "f1": f1.compute(predictions=preds, references=labels, average="weighted")["f1"], } # ═══════════════════════════════════════ # STEP 6: TRAINING ARGUMENTS # ═══════════════════════════════════════ args = TrainingArguments( output_dir="./imdb-distilbert", num_train_epochs=3, per_device_train_batch_size=16, per_device_eval_batch_size=64, gradient_accumulation_steps=2, # effective batch = 32 learning_rate=2e-5, weight_decay=0.01, warmup_ratio=0.1, fp16=True, eval_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, metric_for_best_model="f1", save_total_limit=2, logging_steps=100, report_to="none", # or "tensorboard" / "wandb" ) # ═══════════════════════════════════════ # STEP 7: CREATE TRAINER & TRAIN! # ═══════════════════════════════════════ print("🏋️ Starting training...") trainer = Trainer( model=model, args=args, train_dataset=tokenized["train"], eval_dataset=tokenized["test"], tokenizer=tokenizer, data_collator=data_collator, compute_metrics=compute_metrics, ) trainer.train() # ═══════════════════════════════════════ # STEP 8: EVALUATE # ═══════════════════════════════════════ results = trainer.evaluate() print(f"\n🎬 Final Results:") print(f" Accuracy: {results['eval_accuracy']:.1%}") print(f" F1 Score: {results['eval_f1']:.1%}") # 🎬 Final Results: # Accuracy: 93.2% # F1 Score: 93.2% # ═══════════════════════════════════════ # STEP 9: SAVE & TEST # ═══════════════════════════════════════ trainer.save_model("./imdb-distilbert-final") tokenizer.save_pretrained("./imdb-distilbert-final") # Test with pipeline! from transformers import pipeline pipe = pipeline("sentiment-analysis", model="./imdb-distilbert-final", device=0) tests = [ "This movie was absolutely incredible, best I've seen all year!", "Terrible film, waste of two hours of my life.", "It was okay. Not great, not terrible. Average at best.", ] for text in tests: result = pipe(text)[0] print(f" {result['label']:8s} ({result['score']:.1%}): {text[:60]}...") # POSITIVE (99.7%): This movie was absolutely incredible, best I've see... # NEGATIVE (99.8%): Terrible film, waste of two hours of my life... # NEGATIVE (68.3%): It was okay. Not great, not terrible. Average at... print("\n🏆 Fine-tuning complete!")
🎬 93.2% Akurasi dalam 15 Menit!
Perbandingan evolusi kita di tiga seri:
• Seri NN Page 5 (manual NumPy LSTM): ~80% (berjam-jam)
• Seri TF Page 5 (BiLSTM Keras): ~87% (30 menit)
• Seri TF Page 6 (BERT TF Hub): ~95% (setup kompleks)
• Seri HF Page 2 (Trainer API): 93.2% (15 menit, 40 baris!) 🏆
Dengan RoBERTa-base atau DeBERTa, bisa 95%+ — tinggal ganti model_name!
🎬 93.2% Accuracy in 15 Minutes!
Comparison of our evolution across three series:
• NN Series Page 5 (manual NumPy LSTM): ~80% (hours)
• TF Series Page 5 (BiLSTM Keras): ~87% (30 min)
• TF Series Page 6 (BERT TF Hub): ~95% (complex setup)
• HF Series Page 2 (Trainer API): 93.2% (15 min, 40 lines!) 🏆
With RoBERTa-base or DeBERTa, can reach 95%+ — just change model_name!
8. Fine-Tune pada Custom CSV Dataset — Data Anda Sendiri
8. Fine-Tune on Custom CSV Dataset — Your Own Data
from datasets import load_dataset, ClassLabel from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding import numpy as np, evaluate # ═══════════════════════════════════════ # STEP 1: Siapkan CSV Anda # ═══════════════════════════════════════ # Format CSV yang dibutuhkan: # ┌──────────────────────────────────────┬───────┐ # │ text │ label │ # ├──────────────────────────────────────┼───────┤ # │ Produk bagus, pengiriman cepat │ 1 │ # │ Barang rusak, sangat kecewa │ 0 │ # │ Lumayan lah untuk harganya │ 1 │ # └──────────────────────────────────────┴───────┘ # Minimal: 500-1000 samples per class # Ideal: 5000+ per class # Load from CSV dataset = load_dataset("csv", data_files={ "train": "data/train.csv", "test": "data/test.csv", }) # Jika hanya 1 file: split manual # dataset = load_dataset("csv", data_files="data/all_data.csv") # dataset = dataset["train"].train_test_split(test_size=0.2, seed=42) # ═══════════════════════════════════════ # STEP 2: Inspect & clean # ═══════════════════════════════════════ print(dataset["train"][:3]) print(f"Columns: {dataset['train'].column_names}") # Rename columns if needed # dataset = dataset.rename_column("review_text", "text") # dataset = dataset.rename_column("sentiment", "label") # Remove rows with None dataset = dataset.filter(lambda x: x["text"] is not None and x["label"] is not None) # Map string labels to integers (if needed) # label_map = {"positive": 1, "negative": 0, "neutral": 2} # dataset = dataset.map(lambda x: {"label": label_map[x["label"]]}) NUM_LABELS = len(set(dataset["train"]["label"])) print(f"Number of classes: {NUM_LABELS}") # ═══════════════════════════════════════ # STEP 3-7: Same as IMDB project above! # ═══════════════════════════════════════ model_name = "distilbert-base-uncased" # or "indobenchmark/indobert-base-p1" for Indonesian! tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=NUM_LABELS) tokenized = dataset.map(lambda b: tokenizer(b["text"], truncation=True, max_length=256), batched=True, remove_columns=["text"]) trainer = Trainer( model=model, args=TrainingArguments("./my-custom-model", num_train_epochs=3, learning_rate=2e-5, per_device_train_batch_size=16, fp16=True, eval_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, metric_for_best_model="f1"), train_dataset=tokenized["train"], eval_dataset=tokenized["test"], tokenizer=tokenizer, data_collator=DataCollatorWithPadding(tokenizer), compute_metrics=compute_metrics) trainer.train() trainer.save_model("./my-custom-model-final") print("🏆 Custom model ready!")
9. Push Model ke Hugging Face Hub — Share ke Dunia
9. Push Model to Hugging Face Hub — Share with the World
from huggingface_hub import login # =========================== # 1. Login to Hugging Face # =========================== login() # prompts for token (huggingface.co/settings/tokens) # or: login(token="hf_YOUR_TOKEN") # or: set HF_TOKEN environment variable # =========================== # 2. Push after training (Method A: Trainer) # =========================== trainer.push_to_hub( commit_message="Fine-tuned DistilBERT on IMDB sentiment", # model_name="my-imdb-classifier", # optional: custom name ) # → Uploads to: huggingface.co/YOUR_USERNAME/imdb-distilbert # Includes: model weights, tokenizer, config, training args # =========================== # 3. Push manually (Method B) # =========================== model.push_to_hub("my-imdb-classifier") tokenizer.push_to_hub("my-imdb-classifier") # =========================== # 4. Anyone can now use your model! # =========================== from transformers import pipeline # Siapa pun di dunia: pipe = pipeline("sentiment-analysis", model="YOUR_USERNAME/my-imdb-classifier") print(pipe("Great product!")) # [{'label': 'POSITIVE', 'score': 0.997}] # =========================== # 5. Add model card (README.md) # =========================== # Auto-generated by Trainer! Includes: # - Model description # - Training hyperparameters # - Evaluation results # - Framework versions # Edit at: huggingface.co/YOUR_USERNAME/my-imdb-classifier
10. Hyperparameter Search — Optuna Integration
10. Hyperparameter Search — Optuna Integration
# pip install optuna from transformers import Trainer, TrainingArguments # Define search space def hp_space(trial): return { "learning_rate": trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True), "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 5), "per_device_train_batch_size": trial.suggest_categorical( "per_device_train_batch_size", [8, 16, 32]), "weight_decay": trial.suggest_float("weight_decay", 0.0, 0.1), "warmup_ratio": trial.suggest_float("warmup_ratio", 0.0, 0.2), } # Run search best_trial = trainer.hyperparameter_search( hp_space=hp_space, direction="maximize", # maximize eval metric backend="optuna", # or "ray" n_trials=20, # try 20 combinations compute_objective=lambda m: m["eval_f1"], # optimize F1 ) print(f"Best trial: {best_trial}") # BestRun(run_id='7', objective=0.9412, # hyperparameters={'learning_rate': 3.2e-05, 'num_train_epochs': 3, # 'per_device_train_batch_size': 16, ...})
11. Trainer vs Native PyTorch Loop — Kapan Pakai Mana?
11. Trainer vs Native PyTorch Loop — When to Use Which?
| Aspek | Trainer API | Native PyTorch Loop |
|---|---|---|
| Kode | ~40 baris | ~150+ baris |
| FP16/BF16 | 1 flag: fp16=True | Manual AMP context manager |
| Multi-GPU | Otomatis! | Manual DistributedDataParallel |
| Logging | Built-in (TB, W&B) | Manual logging |
| Checkpointing | Otomatis | Manual save/load |
| HP Search | Built-in Optuna/Ray | Manual integration |
| Hub Push | 1 method call | Manual upload |
| Flexibility | High (callbacks) | Maximum (full control) |
| Kapan Pakai | Classification, NER, QA, Summarization — 90% tasks | GAN, RL, custom loss, research |
| Aspect | Trainer API | Native PyTorch Loop |
|---|---|---|
| Code | ~40 lines | ~150+ lines |
| FP16/BF16 | 1 flag: fp16=True | Manual AMP context manager |
| Multi-GPU | Automatic! | Manual DistributedDataParallel |
| Logging | Built-in (TB, W&B) | Manual logging |
| Checkpointing | Automatic | Manual save/load |
| HP Search | Built-in Optuna/Ray | Manual integration |
| Hub Push | 1 method call | Manual upload |
| Flexibility | High (callbacks) | Maximum (full control) |
| When to Use | Classification, NER, QA, Summarization — 90% tasks | GAN, RL, custom loss, research |
💡 Rule of Thumb: Selalu mulai dengan Trainer. Hanya pindah ke native PyTorch loop jika Anda butuh sesuatu yang Trainer benar-benar tidak bisa lakukan (sangat jarang). Page 3 akan membahas native loop untuk kasus-kasus advanced.
💡 Rule of Thumb: Always start with Trainer. Only switch to a native PyTorch loop if you need something Trainer truly can't do (very rare). Page 3 will cover the native loop for advanced cases.
12. Ringkasan Page 2
12. Page 2 Summary
| Konsep | Apa Itu | Kode Kunci |
|---|---|---|
| Fine-tuning | Adapt pre-trained ke tugas Anda | from_pretrained(name, num_labels=N) |
| Datasets | Load & proses data efisien | load_dataset("imdb") |
| .map() | Apply function ke dataset | dataset.map(fn, batched=True) |
| DataCollator | Dynamic padding per batch | DataCollatorWithPadding(tokenizer) |
| TrainingArguments | Semua hyperparameters | TrainingArguments(lr=2e-5, fp16=True) |
| compute_metrics | Custom evaluation | def compute_metrics(eval_pred) |
| Trainer | All-in-one training loop | Trainer(model, args, ...).train() |
| Custom CSV | Fine-tune pada data Anda | load_dataset("csv", data_files=...) |
| Push to Hub | Share model ke dunia | trainer.push_to_hub() |
| HP Search | Auto-tune hyperparameters | trainer.hyperparameter_search() |
| Concept | What It Is | Key Code |
|---|---|---|
| Fine-tuning | Adapt pre-trained to your task | from_pretrained(name, num_labels=N) |
| Datasets | Efficient data loading & processing | load_dataset("imdb") |
| .map() | Apply function to dataset | dataset.map(fn, batched=True) |
| DataCollator | Dynamic padding per batch | DataCollatorWithPadding(tokenizer) |
| TrainingArguments | All hyperparameters | TrainingArguments(lr=2e-5, fp16=True) |
| compute_metrics | Custom evaluation | def compute_metrics(eval_pred) |
| Trainer | All-in-one training loop | Trainer(model, args, ...).train() |
| Custom CSV | Fine-tune on your data | load_dataset("csv", data_files=...) |
| Push to Hub | Share model with the world | trainer.push_to_hub() |
| HP Search | Auto-tune hyperparameters | trainer.hyperparameter_search() |
Page 1 — Pengenalan Hugging Face & Pipeline
Coming Next: Page 3 — Fine-Tuning GPT & Text Generation
Dari BERT (encoder, classification) ke GPT (decoder, generation)! Page 3 membahas: perbedaan encoder vs decoder, causal language modeling, fine-tuning GPT-2 untuk text generation, instruction tuning, prompt templates, generation parameters (temperature, top-k, top-p, beam search), dan membangun chatbot sederhana.
Coming Next: Page 3 — Fine-Tuning GPT & Text Generation
From BERT (encoder, classification) to GPT (decoder, generation)! Page 3 covers: encoder vs decoder differences, causal language modeling, fine-tuning GPT-2 for text generation, instruction tuning, prompt templates, generation parameters (temperature, top-k, top-p, beam search), and building a simple chatbot.