📝 Artikel ini ditulis dalam Bahasa Indonesia & English
📝 This article is available in English & Bahasa Indonesia

🔥 Belajar Hugging Face — Page 2Learn Hugging Face — Page 2

Fine-Tuning BERT &
Trainer API

Fine-Tuning BERT &
Trainer API

Deep dive fine-tuning — dari model pre-trained menjadi classifier custom Anda. Page 2 membahas secara mendalam: apa itu fine-tuning dan kenapa ia bekerja, Datasets library (load, inspect, filter, preprocess, tokenize, format), DataCollator dan dynamic padding, Trainer API lengkap (TrainingArguments — setiap parameter dijelaskan), compute_metrics untuk custom evaluation (F1, precision, recall), fine-tuning DistilBERT pada IMDB (93%+ akurasi dalam 15 menit), fine-tuning pada custom dataset CSV Anda sendiri, push model ke Hugging Face Hub, hyperparameter search dengan Optuna, dan comparison: kapan pakai Trainer vs PyTorch native loop.

Deep dive into fine-tuning — from pre-trained model to your custom classifier. Page 2 covers in depth: what fine-tuning is and why it works, Datasets library (load, inspect, filter, preprocess, tokenize, format), DataCollator and dynamic padding, complete Trainer API (TrainingArguments — every parameter explained), compute_metrics for custom evaluation (F1, precision, recall), fine-tuning DistilBERT on IMDB (93%+ accuracy in 15 minutes), fine-tuning on your own custom CSV dataset, pushing models to Hugging Face Hub, hyperparameter search with Optuna, and comparison: when to use Trainer vs native PyTorch loop.

📅 MaretMarch 202640 menit baca40 min read
🏷 Fine-TuningBERTTrainerDatasetsIMDBDataCollatorHub PushHyperparameter
📚 Seri Belajar Hugging Face:Learn Hugging Face Series:

📑 Daftar Isi — Page 2

📑 Table of Contents — Page 2

  1. Apa Itu Fine-Tuning? — Kenapa tidak train dari nol
  2. Di Mana Jalankan? — GPU, VRAM, Colab setup, OOM troubleshooting
  3. Datasets Library — Load, inspect, filter, dan preprocess
  4. Tokenisasi Dataset — Batch tokenize dengan .map()
  5. DataCollator — Dynamic padding untuk efisiensi
  6. TrainingArguments — SETIAP parameter dijelaskan
  7. Compute Metrics — F1, Precision, Recall custom
  8. Proyek: Fine-Tune DistilBERT pada IMDB — 93%+ akurasi
  9. Fine-Tune pada Custom CSV Dataset — Data Anda sendiri
  10. Push Model ke Hugging Face Hub — Share ke dunia
  11. Hyperparameter Search — Optuna integration
  12. Trainer vs Native PyTorch — Kapan pakai mana
  13. Ringkasan & Preview Page 3
  1. What Is Fine-Tuning? — Why not train from scratch
  2. Where to Run? — GPU, VRAM, Colab setup, OOM troubleshooting
  3. Datasets Library — Load, inspect, filter, and preprocess
  4. Dataset Tokenization — Batch tokenize with .map()
  5. DataCollator — Dynamic padding for efficiency
  6. TrainingArguments — EVERY parameter explained
  7. Compute Metrics — Custom F1, Precision, Recall
  8. Project: Fine-Tune DistilBERT on IMDB — 93%+ accuracy
  9. Fine-Tune on Custom CSV Dataset — Your own data
  10. Push Model to Hugging Face Hub — Share with the world
  11. Hyperparameter Search — Optuna integration
  12. Trainer vs Native PyTorch — When to use which
  13. Summary & Page 3 Preview
🎓

1. Apa Itu Fine-Tuning? — Dan Kenapa Ia Bekerja

1. What Is Fine-Tuning? — And Why It Works

BERT sudah "membaca" seluruh Wikipedia. Anda tinggal mengajarkan tugas spesifik.
BERT already "read" all of Wikipedia. You just need to teach it your specific task.

Fine-tuning = mengambil model yang sudah di-pre-train pada data besar (BERT pada Wikipedia + BookCorpus = 3.3 miliar kata), lalu melatih ulang model tersebut pada dataset kecil Anda untuk tugas spesifik (misalnya: klasifikasi sentimen review). Model sudah memahami bahasa — Anda hanya perlu mengajarkan tugas-nya.

Fine-tuning = taking a model already pre-trained on massive data (BERT on Wikipedia + BookCorpus = 3.3 billion words), then retraining it on your small dataset for a specific task (e.g., review sentiment classification). The model already understands language — you just need to teach it the task.

Pre-Training vs Fine-Tuning — Dua Tahap Training TAHAP 1: Pre-Training (dilakukan oleh Google/Meta/OpenAI) ┌──────────────────────────────────────────────────────┐ │ Data: Wikipedia + BookCorpus (3.3 MILIAR kata) │ │ Task: Masked Language Modeling (tebak kata [MASK]) │ │ Cost: $100k-$10M+ (ratusan GPU, berminggu-minggu) │ │ Result: Model yang MEMAHAMI bahasa (grammar, │ │ semantik, konteks, nuansa) │ │ Anda TIDAK perlu melakukan ini! │ └──────────────────────────────────────────────────────┘ │ download dari Hub (1 baris kode!) ▼ TAHAP 2: Fine-Tuning (dilakukan oleh ANDA) ┌──────────────────────────────────────────────────────┐ │ Data: Dataset KECIL Anda (1k - 100k samples) │ │ Task: Tugas spesifik (sentiment, NER, QA, dll) │ │ Cost: $0 (Google Colab gratis, 15 menit!) │ │ Result: Model akurat untuk tugas Anda (93%+) │ │ Learning rate SANGAT KECIL (2e-5) → tweak halus │ └──────────────────────────────────────────────────────┘ Analogi: Pre-training = lulusan S2 Sastra (memahami bahasa secara umum) Fine-tuning = pelatihan kerja spesifik (reviewer film, customer support) → Tidak perlu mengajar ABC lagi, tinggal fokus pada tugas!

Kenapa fine-tuning bekerja jauh lebih baik dari training dari nol?

Why does fine-tuning work so much better than training from scratch?

AspekTrain dari NolFine-Tuning
Data dibutuhkanJutaan-miliaran sample1,000-50,000 sample cukup!
Waktu trainingHari-minggu (multi-GPU)15-60 menit (1 GPU)
Biaya$10k-$10M+$0 (Colab gratis)
Akurasi (IMDB)~80-85% (BiLSTM)93%+ (fine-tuned BERT)
Bahasa dipahami?Belajar dari nolSudah mengerti grammar, semantik, konteks
GPU minimumMulti-GPU, V100/A100T4 gratis di Colab
AspectTrain from ScratchFine-Tuning
Data neededMillions-billions of samples1,000-50,000 samples enough!
Training timeDays-weeks (multi-GPU)15-60 minutes (1 GPU)
Cost$10k-$10M+$0 (free Colab)
Accuracy (IMDB)~80-85% (BiLSTM)93%+ (fine-tuned BERT)
Language understood?Learns from scratchAlready understands grammar, semantics, context
Minimum GPUMulti-GPU, V100/A100Free T4 on Colab

🎓 Fine-Tuning ≠ Prompt Engineering
Fine-tuning: Mengubah weights model (melatih ulang) → model jadi spesialis di tugas Anda. Butuh data + training. Hasilnya: model baru yang disimpan dan di-deploy.
Prompt engineering: Memberikan instruksi ke model tanpa mengubah weights. Tidak butuh training. Hasilnya: respons dari model yang sama (GPT-4, Claude, dll).

Kapan fine-tuning lebih baik dari prompting?
• Anda punya labeled dataset spesifik → fine-tune
• Anda butuh konsistensi 100% (produksi) → fine-tune
• Model kecil harus seakurat model besar → fine-tune
• Anda butuh speed (latency <50ms) → fine-tune model kecil
• Budget terbatas, tanpa API cost per request → fine-tune

🎓 Fine-Tuning ≠ Prompt Engineering
Fine-tuning: Modifies model weights (retrains) → model becomes a specialist for your task. Needs data + training. Result: a new model you save and deploy.
Prompt engineering: Gives instructions to a model without changing weights. No training needed. Result: response from the same model (GPT-4, Claude, etc.).

When is fine-tuning better than prompting?
• You have a specific labeled dataset → fine-tune
• You need 100% consistency (production) → fine-tune
• Small model must be as accurate as large → fine-tune
• You need speed (latency <50ms) → fine-tune small model
• Limited budget, no per-request API cost → fine-tune

💻

1b. Di Mana Jalankan Fine-Tuning? — GPU, VRAM, dan Setup Lengkap

1b. Where Do You Run Fine-Tuning? — GPU, VRAM, and Complete Setup

Fine-tuning BUTUH GPU — tapi GPU gratis di Colab sudah cukup untuk BERT!
Fine-tuning NEEDS a GPU — but the free Colab GPU is enough for BERT!

Pertanyaan paling sering: "Ini dijalankan di mana? Komputer saya bisa? Butuh GPU berapa?" Jawaban singkat: fine-tuning BERT/DistilBERT bisa di Google Colab gratis. Tapi untuk model besar (LLaMA, Mistral), Anda butuh GPU lebih besar. Berikut panduan lengkapnya:

The most common question: "Where do I run this? Can my computer handle it? How much GPU do I need?" Short answer: fine-tuning BERT/DistilBERT works on free Google Colab. But for large models (LLaMA, Mistral), you need bigger GPUs. Here's the complete guide:

🧮 Berapa VRAM yang Dibutuhkan?

🧮 How Much VRAM Do You Need?

VRAM (Video RAM) = memori di GPU. Semakin besar model, semakin banyak VRAM yang dibutuhkan. Rule of thumb: VRAM ≈ 4× ukuran model untuk fine-tuning (karena menyimpan model + gradients + optimizer states + activations).

VRAM (Video RAM) = memory on the GPU. Bigger models need more VRAM. Rule of thumb: VRAM ≈ 4× model size for fine-tuning (stores model + gradients + optimizer states + activations).

ModelParametersModel SizeVRAM Fine-Tune (FP16)VRAM InferenceBisa di Colab T4?
DistilBERT66M~250 MB~4 GB~1 GB✅ Ya, sangat nyaman
BERT base110M~420 MB~6 GB~2 GB✅ Ya, nyaman
BERT large340M~1.3 GB~12 GB~3 GB⚠️ Ketat (batch kecil)
RoBERTa base125M~475 MB~7 GB~2 GB✅ Ya
GPT-2 small117M~500 MB~6 GB~2 GB✅ Ya
GPT-2 medium345M~1.4 GB~13 GB~3 GB⚠️ Ketat
LLaMA 3.2 1B1B~4 GB~16 GB (LoRA)~5 GB⚠️ Hanya dengan LoRA
LLaMA 3.2 3B3B~12 GB~24 GB (LoRA)~7 GB❌ Butuh A100/L4
Mistral 7B7B~14 GB~40 GB (LoRA)~15 GB❌ Butuh A100 40GB+
LLaMA 3.1 70B70B~140 GB~160 GB (QLoRA)~40 GB❌ Multi-GPU A100
ModelParametersModel SizeVRAM Fine-Tune (FP16)VRAM InferenceFits on Colab T4?
DistilBERT66M~250 MB~4 GB~1 GB✅ Yes, very comfortable
BERT base110M~420 MB~6 GB~2 GB✅ Yes, comfortable
BERT large340M~1.3 GB~12 GB~3 GB⚠️ Tight (small batch)
RoBERTa base125M~475 MB~7 GB~2 GB✅ Yes
GPT-2 small117M~500 MB~6 GB~2 GB✅ Yes
GPT-2 medium345M~1.4 GB~13 GB~3 GB⚠️ Tight
LLaMA 3.2 1B1B~4 GB~16 GB (LoRA)~5 GB⚠️ Only with LoRA
LLaMA 3.2 3B3B~12 GB~24 GB (LoRA)~7 GB❌ Need A100/L4
Mistral 7B7B~14 GB~40 GB (LoRA)~15 GB❌ Need A100 40GB+
LLaMA 3.1 70B70B~140 GB~160 GB (QLoRA)~40 GB❌ Multi-GPU A100

🥇 Opsi 1: Google Colab — REKOMENDASI untuk Page Ini

🥇 Option 1: Google Colab — RECOMMENDED for This Page

Semua kode di Page 2 ini didesain untuk berjalan di Google Colab gratis dengan GPU T4 (16GB VRAM). DistilBERT + IMDB fine-tuning = ~4GB VRAM, selesai dalam 15 menit. Berikut step-by-step lengkap:

All code in this Page 2 is designed to run on free Google Colab with T4 GPU (16GB VRAM). DistilBERT + IMDB fine-tuning = ~4GB VRAM, done in 15 minutes. Here's the complete step-by-step:

Google Colab — Step-by-Step Fine-Tuning Setuppython
# ═══════════════════════════════════════════════════
# STEP 1: Buka Google Colab
# → colab.research.google.com → New Notebook
# ═══════════════════════════════════════════════════

# STEP 2: Aktifkan GPU
# → Menu: Runtime → Change runtime type
# → Hardware accelerator: T4 GPU
# → Save

# STEP 3: Verifikasi GPU (cell pertama!)
!nvidia-smi
# Harus muncul: Tesla T4, 15360MiB (16GB VRAM)
# Jika "No GPU": ulangi Step 2

import torch
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
# CUDA available: True
# GPU: Tesla T4
# VRAM: 15.8 GB

# STEP 4: Install/update libraries
!pip install -q transformers datasets accelerate evaluate

# STEP 5: Cek VRAM tersedia sebelum training
print(f"VRAM used: {torch.cuda.memory_allocated()/1e9:.1f} GB")
print(f"VRAM free: {(torch.cuda.get_device_properties(0).total_mem - torch.cuda.memory_allocated())/1e9:.1f} GB")
# VRAM used: 0.0 GB
# VRAM free: 15.8 GB → PLENTY for DistilBERT!

# STEP 6: Jalankan kode fine-tuning dari Section 7!
# Copy-paste seluruh kode 14_imdb_finetune.py ke cell baru → Run
# Training: ~15 menit untuk 3 epochs
# Result: 93%+ accuracy ✅

# ═══════════════════════════════════════════════════
# PENTING: Colab Session Management
# ═══════════════════════════════════════════════════
# • Session timeout: ~90 menit idle, ~12 jam max
# • SEMUA file hilang saat session mati!
# • SOLUSI: Save model ke Google Drive:
from google.colab import drive
drive.mount('/content/drive')
trainer.save_model('/content/drive/MyDrive/my-model')
# → Model aman di Google Drive, tidak hilang!

# • Atau push ke Hugging Face Hub (Section 9):
# trainer.push_to_hub()  → tersimpan permanen di Hub

🖥️ Opsi 2: Komputer Lokal dengan GPU NVIDIA

🖥️ Option 2: Local Computer with NVIDIA GPU

Jika Anda punya GPU NVIDIA di desktop/laptop (RTX 3060+, RTX 4060+), Anda bisa fine-tune secara lokal. Keuntungan: tidak ada timeout, file persisten, lebih cepat dari Colab untuk training panjang.

If you have an NVIDIA GPU on your desktop/laptop (RTX 3060+, RTX 4060+), you can fine-tune locally. Advantages: no timeout, persistent files, faster than Colab for long training.

Terminal — Local GPU Setupbash
# 1. Cek GPU Anda
nvidia-smi
# Harus muncul GPU name + VRAM
# RTX 3060: 12GB ✅ (cukup untuk BERT base)
# RTX 4060: 8GB  ⚠️ (cukup untuk DistilBERT, ketat untuk BERT)
# RTX 4090: 24GB ✅ (mewah, bahkan untuk GPT-2 medium)

# 2. Setup environment
python -m venv hf-finetune
source hf-finetune/bin/activate

# 3. Install PyTorch dengan CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# 4. Install HF stack
pip install transformers datasets accelerate evaluate

# 5. Jalankan script
python 14_imdb_finetune.py
# Training berjalan di GPU lokal Anda!

☁️ Opsi 3: Cloud GPU (untuk Model Besar)

☁️ Option 3: Cloud GPU (for Large Models)

Untuk model > 1B parameters (LLaMA, Mistral), Colab T4 tidak cukup. Anda butuh GPU lebih besar:

For models > 1B parameters (LLaMA, Mistral), Colab T4 isn't enough. You need bigger GPUs:

PlatformGPUVRAMHargaBest For
Google Colab FreeT416 GBGratisBERT, DistilBERT, GPT-2 ⭐
Google Colab ProV100 / A10016-40 GB$10/bulanBERT large, GPT-2 medium
Kaggle NotebooksP100 / T416 GBGratis (30 jam/minggu)Sama seperti Colab free
Lambda CloudA100 / H10040-80 GB$1.10-3.29/jamLLaMA 7B, Mistral 7B
RunPodA100 / H10040-80 GB$0.74-2.49/jamLLaMA, heavy training
Vast.aiVariousVarious$0.15-2.00/jamBudget cloud GPU
AWS (p4d)A100 ×8320 GB$32/jamEnterprise, 70B+ models
PlatformGPUVRAMPriceBest For
Google Colab FreeT416 GBFreeBERT, DistilBERT, GPT-2 ⭐
Google Colab ProV100 / A10016-40 GB$10/monthBERT large, GPT-2 medium
Kaggle NotebooksP100 / T416 GBFree (30 hrs/week)Same as free Colab
Lambda CloudA100 / H10040-80 GB$1.10-3.29/hrLLaMA 7B, Mistral 7B
RunPodA100 / H10040-80 GB$0.74-2.49/hrLLaMA, heavy training
Vast.aiVariousVarious$0.15-2.00/hrBudget cloud GPU
AWS (p4d)A100 ×8320 GB$32/hrEnterprise, 70B+ models

🆘 Troubleshooting: CUDA Out of Memory (OOM)

🆘 Troubleshooting: CUDA Out of Memory (OOM)

Error paling umum saat fine-tuning: CUDA out of memory. Artinya GPU VRAM Anda tidak cukup. Berikut solusinya, dari yang paling mudah:

The most common error during fine-tuning: CUDA out of memory. It means your GPU VRAM isn't enough. Here are solutions, from easiest:

oom_troubleshoot.py — Fix CUDA Out of Memorypython
# ═══════════════════════════════════════
# SOLUSI OOM — dari mudah ke advanced
# ═══════════════════════════════════════

# ① Kurangi batch size (PALING MUDAH!)
# OOM dengan batch_size=16? Coba 8, lalu 4, lalu 2
args = TrainingArguments(
    per_device_train_batch_size=8,     # turunkan dari 16
    gradient_accumulation_steps=4,     # effective batch = 8×4 = 32
    # ↑ PENTING: tambah gradient_accumulation agar effective batch tetap besar!
    # Tanpa ini, batch kecil = training tidak stabil
)

# ② Aktifkan FP16 mixed precision
args = TrainingArguments(
    fp16=True,    # halve memory usage! (~50% less VRAM)
)

# ③ Kurangi max_length tokenizer
def tokenize(batch):
    return tokenizer(batch["text"],
        truncation=True,
        max_length=128,     # turunkan dari 256/512
        # Shorter sequence = less memory per sample
        # Trade-off: kehilangan info dari teks panjang
    )

# ④ Gunakan model lebih kecil
# BERT base (110M) OOM? → Pakai DistilBERT (66M)
# DistilBERT = 40% lebih kecil, 60% lebih cepat, 97% akurasi BERT
model_name = "distilbert-base-uncased"  # bukannya "bert-base-uncased"

# ⑤ Gradient checkpointing (advanced — saves ~30% VRAM)
model.gradient_checkpointing_enable()
# Trade-off: 20-30% slower training, but uses less memory
# How: recompute activations during backward pass instead of storing

# ⑥ Clear GPU cache (jika error setelah error)
import torch
torch.cuda.empty_cache()
import gc
gc.collect()
# Di Colab: Runtime → Restart Runtime → jalankan ulang

# ⑦ Monitor VRAM selama training
print(f"VRAM used: {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f"VRAM peak: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")
print(f"VRAM total: {torch.cuda.get_device_properties(0).total_mem/1e9:.2f} GB")

# ═══════════════════════════════════════
# CHEAT SHEET: Berapa VRAM saya butuhkan?
# ═══════════════════════════════════════
# DistilBERT + batch 16 + max_len 256 + FP16 → ~4 GB ✅ Colab free
# BERT base + batch 16 + max_len 256 + FP16  → ~6 GB ✅ Colab free
# BERT base + batch 32 + max_len 512 + FP16  → ~13 GB ⚠️ ketat
# RoBERTa large + batch 8 + max_len 256       → ~14 GB ⚠️ ketat
# GPT-2 medium + batch 4 + max_len 512 + FP16 → ~14 GB ⚠️ ketat
🛤️ Decision Tree: Di Mana Jalankan Fine-Tuning? Model apa yang mau di-fine-tune? │ ├── DistilBERT / BERT base / RoBERTa base / GPT-2 small │ (< 200M parameters) │ │ │ └── ✅ Google Colab FREE (T4, 16GB) │ → Selesai dalam 15-30 menit │ → SEMUA kode di Page 2 ini berjalan di sini! │ ├── BERT large / GPT-2 medium / DeBERTa large │ (200M - 500M parameters) │ │ │ ├── ⚠️ Colab free (batch kecil + gradient accum) │ └── ✅ Colab Pro ($10/bln, V100/A100) atau Kaggle (gratis) │ ├── LLaMA 1B-3B / Mistral 7B / Falcon 7B │ (1B - 7B parameters) │ │ │ ├── Harus pakai LoRA/QLoRA (Page 8) │ └── ✅ RunPod / Lambda (A100, $1-3/jam) │ atau Colab Pro+ dengan A100 │ └── LLaMA 70B / Mixtral 8x7B (> 10B parameters) │ └── ✅ Multi-GPU cloud (AWS p4d, $32/jam) Harus pakai QLoRA + DeepSpeed Page 2 ini: fokus DistilBERT/BERT → Colab free cukup! ✅

🎉 TL;DR untuk Page 2 Ini:
1. Buka colab.research.google.com
2. Runtime → Change runtime type → T4 GPU
3. Cell 1: !pip install -q transformers datasets accelerate evaluate
4. Cell 2: !nvidia-smi (pastikan muncul Tesla T4)
5. Cell 3+: Copy-paste kode dari Section 7 (IMDB project) → Run All
6. Tunggu ~15 menit → 93%+ accuracy
7. Save ke Google Drive: trainer.save_model('/content/drive/MyDrive/...')

Tidak perlu install apapun di laptop Anda. Tidak perlu GPU. Tidak perlu bayar. Semua berjalan di cloud Google gratis.

🎉 TL;DR for This Page 2:
1. Open colab.research.google.com
2. Runtime → Change runtime type → T4 GPU
3. Cell 1: !pip install -q transformers datasets accelerate evaluate
4. Cell 2: !nvidia-smi (verify Tesla T4)
5. Cell 3+: Copy-paste code from Section 7 (IMDB project) → Run All
6. Wait ~15 min → 93%+ accuracy
7. Save to Google Drive: trainer.save_model('/content/drive/MyDrive/...')

No need to install anything on your laptop. No GPU needed. No payment. Everything runs on Google's free cloud.

💡 Tanpa GPU Sama Sekali?
Fine-tuning di CPU saja juga BISA — tapi sangat lambat. DistilBERT pada IMDB: GPU T4 = 15 menit, CPU = 6-8 jam. Untuk belajar dan eksperimen kecil (subset 1000 samples), CPU masih bisa dipakai. Untuk training penuh, gunakan Colab GPU.

💡 No GPU at All?
Fine-tuning on CPU only also WORKS — but very slow. DistilBERT on IMDB: GPU T4 = 15 minutes, CPU = 6-8 hours. For learning and small experiments (subset of 1000 samples), CPU is still usable. For full training, use Colab GPU.

📊

2. Datasets Library — Load, Inspect, Filter, Preprocess

2. Datasets Library — Load, Inspect, Filter, Preprocess

Library untuk mengelola dataset secara efisien — streaming, memory-mapped, Arrow format
Library for efficient dataset management — streaming, memory-mapped, Arrow format

Library datasets dari Hugging Face menyediakan akses ke 100,000+ dataset dari Hub, plus tools untuk memproses dataset besar secara efisien. Data disimpan dalam format Apache Arrow (memory-mapped) — artinya Anda bisa bekerja dengan dataset lebih besar dari RAM tanpa masalah.

The datasets library from Hugging Face provides access to 100,000+ datasets from the Hub, plus tools for efficiently processing large datasets. Data is stored in Apache Arrow format (memory-mapped) — meaning you can work with datasets larger than RAM without issues.

09_datasets_library.py — Datasets Deep Dive 🔬python
from datasets import load_dataset, Dataset, DatasetDict

# ===========================
# 1. Load dataset dari Hub (100k+ datasets!)
# ===========================
dataset = load_dataset("imdb")
print(dataset)
# DatasetDict({
#     train: Dataset({
#         features: ['text', 'label'],
#         num_rows: 25000
#     })
#     test: Dataset({
#         features: ['text', 'label'],
#         num_rows: 25000
#     })
# })

# ===========================
# 2. Inspect data
# ===========================
print(dataset["train"][0])
# {'text': 'I rented this movie because...', 'label': 0}
# label: 0 = negative, 1 = positive

print(dataset["train"].features)
# {'text': Value(dtype='string'), 'label': ClassLabel(names=['neg', 'pos'])}

print(dataset["train"].column_names)  # ['text', 'label']
print(dataset["train"].num_rows)      # 25000

# Slicing (like pandas!)
first_5 = dataset["train"][:5]         # first 5 rows (dict of lists)
texts = dataset["train"]["text"][:5]  # first 5 texts
labels = dataset["train"]["label"]     # ALL labels (efficient!)

# ===========================
# 3. Filter data
# ===========================
# Keep only positive reviews
positive_only = dataset["train"].filter(lambda x: x["label"] == 1)
print(f"Positive reviews: {len(positive_only)}")  # 12500

# Keep only short reviews
short = dataset["train"].filter(lambda x: len(x["text"]) < 500)
print(f"Short reviews: {len(short)}")

# ===========================
# 4. Map — apply function to every row
# ===========================
def add_length(example):
    example["text_length"] = len(example["text"])
    return example

dataset_with_length = dataset["train"].map(add_length)
print(dataset_with_length[0]["text_length"])  # 843

# Batched map (MUCH faster for tokenization!)
def batch_add_length(batch):
    batch["text_length"] = [len(t) for t in batch["text"]]
    return batch

dataset_batched = dataset["train"].map(batch_add_length, batched=True)

# ===========================
# 5. Load dari berbagai sumber
# ===========================
# Dari Hub (paling umum)
glue = load_dataset("glue", "mrpc")         # GLUE benchmark, MRPC task
squad = load_dataset("squad")                # Question Answering
wnut = load_dataset("wnut_17")               # NER dataset
ag_news = load_dataset("ag_news")             # News classification

# Dari CSV Anda (paling penting untuk custom data!)
my_data = load_dataset("csv", data_files={
    "train": "data/train.csv",
    "test": "data/test.csv"
})

# Dari pandas DataFrame
import pandas as pd
df = pd.DataFrame({"text": ["Great!", "Bad!"], "label": [1, 0]})
my_dataset = Dataset.from_pandas(df)

# Dari JSON
json_ds = load_dataset("json", data_files="data.jsonl")

# ===========================
# 6. Train/val split
# ===========================
split = dataset["train"].train_test_split(test_size=0.2, seed=42)
print(split)
# DatasetDict({
#     train: Dataset(20000 rows)
#     test:  Dataset(5000 rows)
# })

# ===========================
# 7. Shuffle & select
# ===========================
shuffled = dataset["train"].shuffle(seed=42)
small_train = dataset["train"].shuffle(seed=42).select(range(1000))  # first 1000
print(f"Small subset: {len(small_train)} rows")  # 1000
# Great for quick experiments before full training!

🎓 Kenapa datasets Library, Bukan Pandas?
Memory: datasets menggunakan Apache Arrow (memory-mapped). Dataset 50GB? Tidak perlu 50GB RAM — cukup baca bagian yang dibutuhkan.
Speed: .map() dengan batched=True + multiprocessing → 10-100× lebih cepat dari pandas apply.
Integration: Output langsung compatible dengan Trainer API — tidak perlu konversi.
Hub: 100k+ datasets siap download dalam satu baris. Pandas tidak punya ini.

🎓 Why datasets Library, Not Pandas?
Memory: datasets uses Apache Arrow (memory-mapped). 50GB dataset? Doesn't need 50GB RAM — reads only what's needed.
Speed: .map() with batched=True + multiprocessing → 10-100× faster than pandas apply.
Integration: Output directly compatible with Trainer API — no conversion needed.
Hub: 100k+ datasets ready to download in one line. Pandas doesn't have this.

✂️

3. Tokenisasi Dataset — Batch Tokenize dengan .map()

3. Dataset Tokenization — Batch Tokenize with .map()

Langkah kritis: text → token IDs yang siap masuk model
Critical step: text → token IDs ready for the model
10_tokenize_dataset.py — Efficient Dataset Tokenizationpython
from transformers import AutoTokenizer
from datasets import load_dataset

# ===========================
# 1. Load tokenizer & dataset
# ===========================
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
dataset = load_dataset("imdb")

# ===========================
# 2. Define tokenization function
# ===========================
def tokenize_function(examples):
    """Tokenize a batch of texts.
    
    PENTING: function ini menerima BATCH (dict of lists), bukan single example!
    Karena kita pakai batched=True di .map() → jauh lebih cepat.
    """
    return tokenizer(
        examples["text"],
        truncation=True,           # potong jika > max_length
        max_length=256,             # max token length
        # padding=True,             # ❌ JANGAN pad di sini!
        # padding akan dilakukan oleh DataCollator (section 4)
        # → lebih efisien karena pad per batch, bukan global max
    )

# ===========================
# 3. Apply tokenization to entire dataset
# ===========================
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,               # process batches (MUCH faster!)
    num_proc=4,                 # use 4 CPU cores
    remove_columns=["text"],    # remove original text (not needed anymore)
)

print(tokenized_dataset)
# DatasetDict({
#     train: Dataset({
#         features: ['label', 'input_ids', 'attention_mask'],
#         num_rows: 25000
#     })
#     test: Dataset({
#         features: ['label', 'input_ids', 'attention_mask'],
#         num_rows: 25000
#     })
# })

# Check one example
print(tokenized_dataset["train"][0].keys())  # dict_keys(['label', 'input_ids', 'attention_mask'])
print(f"Token length: {len(tokenized_dataset['train'][0]['input_ids'])}")
# Varies per example! (karena kita TIDAK pad di sini)
# Some: 87 tokens, some: 256 tokens (truncated)

# ===========================
# 4. Set format for PyTorch
# ===========================
tokenized_dataset.set_format("torch")  # convert to PyTorch tensors
print(type(tokenized_dataset["train"][0]["input_ids"]))
# 

🎓 Kenapa TIDAK Pad di Tokenization Step?
Jika Anda pad semua sequences ke max_length=256 saat tokenisasi, maka review pendek (20 kata) akan memiliki 236 padding tokens. Buang-buang memori dan compute!

Lebih baik: biarkan sequence dengan panjang berbeda, lalu gunakan DataCollator (section 4) yang pad per batch ke panjang terpanjang di batch tersebut. Batch 1 mungkin max 87 tokens, batch 2 mungkin 156. Jauh lebih efisien!

🎓 Why NOT Pad at the Tokenization Step?
If you pad all sequences to max_length=256 during tokenization, a short review (20 words) will have 236 padding tokens. Wastes memory and compute!

Better: leave sequences with different lengths, then use DataCollator (section 4) that pads per batch to the longest in that batch. Batch 1 might be max 87 tokens, batch 2 might be 156. Far more efficient!

📦

4. DataCollator — Dynamic Padding per Batch

4. DataCollator — Dynamic Padding per Batch

Pad setiap batch ke panjang terpanjang di batch tersebut — bukan global max
Pad each batch to the longest in that batch — not the global max
11_data_collator.py — DataCollatorWithPaddingpython
from transformers import DataCollatorWithPadding

# ===========================
# DataCollatorWithPadding — the smart way to pad
# ===========================
data_collator = DataCollatorWithPadding(
    tokenizer=tokenizer,
    padding=True,               # pad to longest in batch
    # padding="max_length",     # pad to max_length (fixed)
    # max_length=256,           # only if padding="max_length"
    return_tensors="pt",        # return PyTorch tensors
)

# Simulasi: 3 examples dengan panjang berbeda
examples = [
    tokenized_dataset["train"][0],  # mungkin 87 tokens
    tokenized_dataset["train"][1],  # mungkin 156 tokens
    tokenized_dataset["train"][2],  # mungkin 43 tokens
]

batch = data_collator(examples)
print(f"Padded batch shape: {batch['input_ids'].shape}")
# torch.Size([3, 156])  ← padded to longest (156), NOT 256!
print(f"Attention mask: {batch['attention_mask'][2][:10]}")
# tensor([1, 1, 1, ..., 1, 0, 0, 0, ...])  ← 1=real, 0=padding

# ===========================
# Kenapa ini PENTING untuk performa?
# ===========================
# Static padding (max_length=256):
#   Batch of short reviews (avg 50 tokens):
#   → compute on 256 tokens × batch_size = WASTED 80% compute!
#
# Dynamic padding (DataCollator):
#   Batch of short reviews (avg 50 tokens):
#   → compute on ~55 tokens × batch_size = OPTIMAL!
#   → 4-5× faster training on datasets with variable length text!
⚙️

5. TrainingArguments — SETIAP Parameter Dijelaskan

5. TrainingArguments — EVERY Parameter Explained

Pusat kendali training: LR, batch size, epochs, evaluation, logging, saving, FP16
Training control center: LR, batch size, epochs, evaluation, logging, saving, FP16
12_training_arguments.py — TrainingArguments Encyclopedia 📖python
from transformers import TrainingArguments

args = TrainingArguments(
    # ═══════════════════════════
    # OUTPUT & ORGANIZATION
    # ═══════════════════════════
    output_dir="./results",              # checkpoints & logs saved here
    overwrite_output_dir=True,           # overwrite if exists
    run_name="bert-imdb-v1",             # name for W&B / TensorBoard

    # ═══════════════════════════
    # TRAINING HYPERPARAMETERS
    # ═══════════════════════════
    num_train_epochs=3,                  # total training epochs
    # max_steps=1000,                    # alternative: stop after N steps
    per_device_train_batch_size=16,      # batch per GPU (T4: 16-32 for BERT)
    per_device_eval_batch_size=64,       # eval batch (larger OK — no gradients)
    gradient_accumulation_steps=2,       # effective batch = 16 × 2 = 32
    # ↑ Simulates larger batch on small GPU!

    learning_rate=2e-5,                  # THE MOST IMPORTANT HYPERPARAMETER!
    # BERT/RoBERTa: 2e-5 to 5e-5 (sweet spot)
    # DistilBERT: 2e-5 to 5e-5
    # Large models (LLaMA): 1e-5 to 2e-5
    # JANGAN > 1e-4! Pre-trained weights akan rusak.

    weight_decay=0.01,                  # L2 regularization (prevents overfitting)
    warmup_ratio=0.1,                   # warmup 10% of training steps
    # warmup_steps=500,                  # alternative: fixed warmup steps
    lr_scheduler_type="linear",          # linear decay after warmup
    # "cosine", "cosine_with_restarts", "polynomial", "constant"

    # ═══════════════════════════
    # EVALUATION
    # ═══════════════════════════
    eval_strategy="epoch",              # evaluate every epoch
    # eval_strategy="steps",             # evaluate every N steps
    # eval_steps=500,                    # (if strategy="steps")

    # ═══════════════════════════
    # SAVING
    # ═══════════════════════════
    save_strategy="epoch",              # save checkpoint every epoch
    save_total_limit=2,                 # keep only last 2 checkpoints
    load_best_model_at_end=True,        # load best model after training!
    metric_for_best_model="f1",         # "best" = highest F1
    # metric_for_best_model="eval_loss", # or lowest loss

    # ═══════════════════════════
    # PERFORMANCE
    # ═══════════════════════════
    fp16=True,                          # mixed precision (2× faster on T4!)
    # bf16=True,                         # bfloat16 (A100, H100)
    dataloader_num_workers=4,           # parallel data loading
    dataloader_pin_memory=True,         # faster CPU→GPU transfer

    # ═══════════════════════════
    # LOGGING
    # ═══════════════════════════
    logging_dir="./logs",               # TensorBoard logs
    logging_steps=100,                  # log every 100 steps
    report_to="tensorboard",            # or "wandb", "none"

    # ═══════════════════════════
    # HUB INTEGRATION
    # ═══════════════════════════
    push_to_hub=False,                  # auto-push to HF Hub after training
    # hub_model_id="username/my-model", # Hub repo name
    # hub_strategy="every_save",        # push every checkpoint
)

🎓 5 Parameter Terpenting untuk Fine-Tuning:
1. learning_rate = 2e-5 → Mulai dari sini. Terlalu besar (>1e-4) = rusak pre-trained weights. Terlalu kecil (<1e-6) = tidak belajar.
2. num_train_epochs = 3 → BERT biasanya konvergen dalam 2-4 epochs. Lebih lama = overfitting.
3. per_device_train_batch_size = 16 → Terbatas oleh GPU memory. T4 16GB: batch 16-32 untuk BERT. OOM? Kurangi batch, tambah gradient_accumulation_steps.
4. fp16 = True → Selalu aktifkan di GPU dengan Tensor Cores (T4+). 2× faster, 50% less memory.
5. weight_decay = 0.01 → Regularization standar. Tidak perlu diubah kecuali overfitting parah.

🎓 5 Most Important Parameters for Fine-Tuning:
1. learning_rate = 2e-5 → Start here. Too high (>1e-4) = destroys pre-trained weights. Too low (<1e-6) = doesn't learn.
2. num_train_epochs = 3 → BERT usually converges in 2-4 epochs. Longer = overfitting.
3. per_device_train_batch_size = 16 → Limited by GPU memory. T4 16GB: batch 16-32 for BERT. OOM? Reduce batch, increase gradient_accumulation_steps.
4. fp16 = True → Always enable on GPUs with Tensor Cores (T4+). 2× faster, 50% less memory.
5. weight_decay = 0.01 → Standard regularization. No need to change unless severe overfitting.

📊

6. Compute Metrics — F1, Precision, Recall Custom

6. Compute Metrics — Custom F1, Precision, Recall

Accuracy saja tidak cukup — tambahkan F1 dan metric domain-spesifik
Accuracy alone isn't enough — add F1 and domain-specific metrics
13_compute_metrics.py — Custom Evaluation Metricspython
import numpy as np
import evaluate

# ===========================
# 1. Load metrics from evaluate library
# ===========================
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")
precision_metric = evaluate.load("precision")
recall_metric = evaluate.load("recall")

# ===========================
# 2. Define compute_metrics function
# Trainer calls this after every evaluation
# ===========================
def compute_metrics(eval_pred):
    """Compute accuracy, F1, precision, recall.
    
    Args:
        eval_pred: EvalPrediction object with:
            .predictions: model logits (batch, num_labels)
            .label_ids: true labels (batch,)
    
    Returns:
        dict of metric_name: value
    """
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)  # logits → class index

    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    f1 = f1_metric.compute(predictions=predictions, references=labels, average="weighted")
    precision = precision_metric.compute(predictions=predictions, references=labels, average="weighted")
    recall = recall_metric.compute(predictions=predictions, references=labels, average="weighted")

    return {
        "accuracy": accuracy["accuracy"],
        "f1": f1["f1"],
        "precision": precision["precision"],
        "recall": recall["recall"],
    }

# This function will be passed to Trainer:
# trainer = Trainer(..., compute_metrics=compute_metrics)
# Output during training:
# Epoch 1: {'accuracy': 0.921, 'f1': 0.921, 'precision': 0.923, 'recall': 0.921}
# Epoch 2: {'accuracy': 0.934, 'f1': 0.934, 'precision': 0.935, 'recall': 0.934}
🎬

7. Proyek: Fine-Tune DistilBERT pada IMDB — 93%+

7. Project: Fine-Tune DistilBERT on IMDB — 93%+

Gabungkan semua: dataset → tokenize → collator → args → metrics → Trainer → train!
Combine everything: dataset → tokenize → collator → args → metrics → Trainer → train!
14_imdb_finetune.py — Complete IMDB Fine-Tuning 🔥🔥🔥python
#!/usr/bin/env python3
"""
🔥 Fine-Tune DistilBERT on IMDB Sentiment Analysis
Expected: 93%+ accuracy in ~15 minutes on Google Colab T4
Combines: Sections 2-6 of this page
"""

import numpy as np
import evaluate
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
)

# ═══════════════════════════════════════
# STEP 1: LOAD DATASET
# ═══════════════════════════════════════
print("📊 Loading IMDB dataset...")
dataset = load_dataset("imdb")
print(f"  Train: {len(dataset['train'])} | Test: {len(dataset['test'])}")
# Train: 25000 | Test: 25000

# Optional: use subset for quick testing
# dataset["train"] = dataset["train"].shuffle(42).select(range(5000))
# dataset["test"] = dataset["test"].shuffle(42).select(range(1000))

# ═══════════════════════════════════════
# STEP 2: LOAD TOKENIZER & MODEL
# ═══════════════════════════════════════
print("🤖 Loading model & tokenizer...")
model_name = "distilbert-base-uncased"
# Alternatives:
# "bert-base-uncased"          → more accurate, slower
# "roberta-base"               → best accuracy for English
# "xlm-roberta-base"           → multilingual (100+ languages)
# "indobenchmark/indobert-base-p1"  → Indonesian!

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,                  # binary: positive/negative
    id2label={0: "NEGATIVE", 1: "POSITIVE"},
    label2id={"NEGATIVE": 0, "POSITIVE": 1},
)
print(f"  Model params: {model.num_parameters():,}")
# 66,955,010 for DistilBERT (half of BERT!)

# ═══════════════════════════════════════
# STEP 3: TOKENIZE DATASET
# ═══════════════════════════════════════
print("✂️ Tokenizing...")
def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, max_length=256)

tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])
print(f"  Tokenized features: {tokenized['train'].column_names}")

# ═══════════════════════════════════════
# STEP 4: DATA COLLATOR
# ═══════════════════════════════════════
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# ═══════════════════════════════════════
# STEP 5: METRICS
# ═══════════════════════════════════════
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1": f1.compute(predictions=preds, references=labels, average="weighted")["f1"],
    }

# ═══════════════════════════════════════
# STEP 6: TRAINING ARGUMENTS
# ═══════════════════════════════════════
args = TrainingArguments(
    output_dir="./imdb-distilbert",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    gradient_accumulation_steps=2,       # effective batch = 32
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    fp16=True,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    save_total_limit=2,
    logging_steps=100,
    report_to="none",                   # or "tensorboard" / "wandb"
)

# ═══════════════════════════════════════
# STEP 7: CREATE TRAINER & TRAIN!
# ═══════════════════════════════════════
print("🏋️ Starting training...")
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

# ═══════════════════════════════════════
# STEP 8: EVALUATE
# ═══════════════════════════════════════
results = trainer.evaluate()
print(f"\n🎬 Final Results:")
print(f"  Accuracy: {results['eval_accuracy']:.1%}")
print(f"  F1 Score: {results['eval_f1']:.1%}")
# 🎬 Final Results:
#   Accuracy: 93.2%
#   F1 Score: 93.2%

# ═══════════════════════════════════════
# STEP 9: SAVE & TEST
# ═══════════════════════════════════════
trainer.save_model("./imdb-distilbert-final")
tokenizer.save_pretrained("./imdb-distilbert-final")

# Test with pipeline!
from transformers import pipeline
pipe = pipeline("sentiment-analysis", model="./imdb-distilbert-final", device=0)

tests = [
    "This movie was absolutely incredible, best I've seen all year!",
    "Terrible film, waste of two hours of my life.",
    "It was okay. Not great, not terrible. Average at best.",
]
for text in tests:
    result = pipe(text)[0]
    print(f"  {result['label']:8s} ({result['score']:.1%}): {text[:60]}...")
# POSITIVE (99.7%): This movie was absolutely incredible, best I've see...
# NEGATIVE (99.8%): Terrible film, waste of two hours of my life...
# NEGATIVE (68.3%): It was okay. Not great, not terrible. Average at...

print("\n🏆 Fine-tuning complete!")

🎬 93.2% Akurasi dalam 15 Menit!
Perbandingan evolusi kita di tiga seri:
Seri NN Page 5 (manual NumPy LSTM): ~80% (berjam-jam)
Seri TF Page 5 (BiLSTM Keras): ~87% (30 menit)
Seri TF Page 6 (BERT TF Hub): ~95% (setup kompleks)
Seri HF Page 2 (Trainer API): 93.2% (15 menit, 40 baris!) 🏆
Dengan RoBERTa-base atau DeBERTa, bisa 95%+ — tinggal ganti model_name!

🎬 93.2% Accuracy in 15 Minutes!
Comparison of our evolution across three series:
NN Series Page 5 (manual NumPy LSTM): ~80% (hours)
TF Series Page 5 (BiLSTM Keras): ~87% (30 min)
TF Series Page 6 (BERT TF Hub): ~95% (complex setup)
HF Series Page 2 (Trainer API): 93.2% (15 min, 40 lines!) 🏆
With RoBERTa-base or DeBERTa, can reach 95%+ — just change model_name!

📁

8. Fine-Tune pada Custom CSV Dataset — Data Anda Sendiri

8. Fine-Tune on Custom CSV Dataset — Your Own Data

Dari CSV/Excel/JSON Anda sendiri ke model fine-tuned — template production
From your own CSV/Excel/JSON to a fine-tuned model — production template
15_custom_csv.py — Fine-Tune on YOUR Data 🔥python
from datasets import load_dataset, ClassLabel
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding
import numpy as np, evaluate

# ═══════════════════════════════════════
# STEP 1: Siapkan CSV Anda
# ═══════════════════════════════════════
# Format CSV yang dibutuhkan:
# ┌──────────────────────────────────────┬───────┐
# │ text                                 │ label │
# ├──────────────────────────────────────┼───────┤
# │ Produk bagus, pengiriman cepat       │ 1     │
# │ Barang rusak, sangat kecewa          │ 0     │
# │ Lumayan lah untuk harganya           │ 1     │
# └──────────────────────────────────────┴───────┘
# Minimal: 500-1000 samples per class
# Ideal: 5000+ per class

# Load from CSV
dataset = load_dataset("csv", data_files={
    "train": "data/train.csv",
    "test": "data/test.csv",
})
# Jika hanya 1 file: split manual
# dataset = load_dataset("csv", data_files="data/all_data.csv")
# dataset = dataset["train"].train_test_split(test_size=0.2, seed=42)

# ═══════════════════════════════════════
# STEP 2: Inspect & clean
# ═══════════════════════════════════════
print(dataset["train"][:3])
print(f"Columns: {dataset['train'].column_names}")

# Rename columns if needed
# dataset = dataset.rename_column("review_text", "text")
# dataset = dataset.rename_column("sentiment", "label")

# Remove rows with None
dataset = dataset.filter(lambda x: x["text"] is not None and x["label"] is not None)

# Map string labels to integers (if needed)
# label_map = {"positive": 1, "negative": 0, "neutral": 2}
# dataset = dataset.map(lambda x: {"label": label_map[x["label"]]})

NUM_LABELS = len(set(dataset["train"]["label"]))
print(f"Number of classes: {NUM_LABELS}")

# ═══════════════════════════════════════
# STEP 3-7: Same as IMDB project above!
# ═══════════════════════════════════════
model_name = "distilbert-base-uncased"  # or "indobenchmark/indobert-base-p1" for Indonesian!
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=NUM_LABELS)

tokenized = dataset.map(lambda b: tokenizer(b["text"], truncation=True, max_length=256),
                        batched=True, remove_columns=["text"])

trainer = Trainer(
    model=model,
    args=TrainingArguments("./my-custom-model", num_train_epochs=3, learning_rate=2e-5,
        per_device_train_batch_size=16, fp16=True, eval_strategy="epoch",
        save_strategy="epoch", load_best_model_at_end=True, metric_for_best_model="f1"),
    train_dataset=tokenized["train"], eval_dataset=tokenized["test"],
    tokenizer=tokenizer, data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics)

trainer.train()
trainer.save_model("./my-custom-model-final")
print("🏆 Custom model ready!")
🌐

9. Push Model ke Hugging Face Hub — Share ke Dunia

9. Push Model to Hugging Face Hub — Share with the World

Upload model Anda ke Hub — siapa pun bisa download dan pakai
Upload your model to the Hub — anyone can download and use it
16_push_to_hub.py — Share Your Modelpython
from huggingface_hub import login

# ===========================
# 1. Login to Hugging Face
# ===========================
login()  # prompts for token (huggingface.co/settings/tokens)
# or: login(token="hf_YOUR_TOKEN")
# or: set HF_TOKEN environment variable

# ===========================
# 2. Push after training (Method A: Trainer)
# ===========================
trainer.push_to_hub(
    commit_message="Fine-tuned DistilBERT on IMDB sentiment",
    # model_name="my-imdb-classifier",  # optional: custom name
)
# → Uploads to: huggingface.co/YOUR_USERNAME/imdb-distilbert
# Includes: model weights, tokenizer, config, training args

# ===========================
# 3. Push manually (Method B)
# ===========================
model.push_to_hub("my-imdb-classifier")
tokenizer.push_to_hub("my-imdb-classifier")

# ===========================
# 4. Anyone can now use your model!
# ===========================
from transformers import pipeline
# Siapa pun di dunia:
pipe = pipeline("sentiment-analysis", model="YOUR_USERNAME/my-imdb-classifier")
print(pipe("Great product!"))
# [{'label': 'POSITIVE', 'score': 0.997}]

# ===========================
# 5. Add model card (README.md)
# ===========================
# Auto-generated by Trainer! Includes:
# - Model description
# - Training hyperparameters
# - Evaluation results
# - Framework versions
# Edit at: huggingface.co/YOUR_USERNAME/my-imdb-classifier
🎛️

10. Hyperparameter Search — Optuna Integration

10. Hyperparameter Search — Optuna Integration

Otomatis cari LR, batch size, dan epochs terbaik — Trainer + Optuna built-in!
Automatically find best LR, batch size, and epochs — Trainer + Optuna built-in!
17_hyperparam_search.py — Automatic HP Tuningpython
# pip install optuna
from transformers import Trainer, TrainingArguments

# Define search space
def hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True),
        "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 5),
        "per_device_train_batch_size": trial.suggest_categorical(
            "per_device_train_batch_size", [8, 16, 32]),
        "weight_decay": trial.suggest_float("weight_decay", 0.0, 0.1),
        "warmup_ratio": trial.suggest_float("warmup_ratio", 0.0, 0.2),
    }

# Run search
best_trial = trainer.hyperparameter_search(
    hp_space=hp_space,
    direction="maximize",       # maximize eval metric
    backend="optuna",           # or "ray"
    n_trials=20,                # try 20 combinations
    compute_objective=lambda m: m["eval_f1"],  # optimize F1
)

print(f"Best trial: {best_trial}")
# BestRun(run_id='7', objective=0.9412,
#   hyperparameters={'learning_rate': 3.2e-05, 'num_train_epochs': 3,
#                    'per_device_train_batch_size': 16, ...})
⚖️

11. Trainer vs Native PyTorch Loop — Kapan Pakai Mana?

11. Trainer vs Native PyTorch Loop — When to Use Which?

Trainer = 90% kasus. Native loop = GAN, RL, research custom
Trainer = 90% of cases. Native loop = GAN, RL, custom research
AspekTrainer APINative PyTorch Loop
Kode~40 baris~150+ baris
FP16/BF161 flag: fp16=TrueManual AMP context manager
Multi-GPUOtomatis!Manual DistributedDataParallel
LoggingBuilt-in (TB, W&B)Manual logging
CheckpointingOtomatisManual save/load
HP SearchBuilt-in Optuna/RayManual integration
Hub Push1 method callManual upload
FlexibilityHigh (callbacks)Maximum (full control)
Kapan PakaiClassification, NER, QA, Summarization — 90% tasksGAN, RL, custom loss, research
AspectTrainer APINative PyTorch Loop
Code~40 lines~150+ lines
FP16/BF161 flag: fp16=TrueManual AMP context manager
Multi-GPUAutomatic!Manual DistributedDataParallel
LoggingBuilt-in (TB, W&B)Manual logging
CheckpointingAutomaticManual save/load
HP SearchBuilt-in Optuna/RayManual integration
Hub Push1 method callManual upload
FlexibilityHigh (callbacks)Maximum (full control)
When to UseClassification, NER, QA, Summarization — 90% tasksGAN, RL, custom loss, research

💡 Rule of Thumb: Selalu mulai dengan Trainer. Hanya pindah ke native PyTorch loop jika Anda butuh sesuatu yang Trainer benar-benar tidak bisa lakukan (sangat jarang). Page 3 akan membahas native loop untuk kasus-kasus advanced.

💡 Rule of Thumb: Always start with Trainer. Only switch to a native PyTorch loop if you need something Trainer truly can't do (very rare). Page 3 will cover the native loop for advanced cases.

📝

12. Ringkasan Page 2

12. Page 2 Summary

Semua yang sudah kita pelajari
Everything we learned
KonsepApa ItuKode Kunci
Fine-tuningAdapt pre-trained ke tugas Andafrom_pretrained(name, num_labels=N)
DatasetsLoad & proses data efisienload_dataset("imdb")
.map()Apply function ke datasetdataset.map(fn, batched=True)
DataCollatorDynamic padding per batchDataCollatorWithPadding(tokenizer)
TrainingArgumentsSemua hyperparametersTrainingArguments(lr=2e-5, fp16=True)
compute_metricsCustom evaluationdef compute_metrics(eval_pred)
TrainerAll-in-one training loopTrainer(model, args, ...).train()
Custom CSVFine-tune pada data Andaload_dataset("csv", data_files=...)
Push to HubShare model ke duniatrainer.push_to_hub()
HP SearchAuto-tune hyperparameterstrainer.hyperparameter_search()
ConceptWhat It IsKey Code
Fine-tuningAdapt pre-trained to your taskfrom_pretrained(name, num_labels=N)
DatasetsEfficient data loading & processingload_dataset("imdb")
.map()Apply function to datasetdataset.map(fn, batched=True)
DataCollatorDynamic padding per batchDataCollatorWithPadding(tokenizer)
TrainingArgumentsAll hyperparametersTrainingArguments(lr=2e-5, fp16=True)
compute_metricsCustom evaluationdef compute_metrics(eval_pred)
TrainerAll-in-one training loopTrainer(model, args, ...).train()
Custom CSVFine-tune on your dataload_dataset("csv", data_files=...)
Push to HubShare model with the worldtrainer.push_to_hub()
HP SearchAuto-tune hyperparameterstrainer.hyperparameter_search()
← Page Sebelumnya← Previous Page

Page 1 — Pengenalan Hugging Face & Pipeline

📘

Coming Next: Page 3 — Fine-Tuning GPT & Text Generation

Dari BERT (encoder, classification) ke GPT (decoder, generation)! Page 3 membahas: perbedaan encoder vs decoder, causal language modeling, fine-tuning GPT-2 untuk text generation, instruction tuning, prompt templates, generation parameters (temperature, top-k, top-p, beam search), dan membangun chatbot sederhana.

📘

Coming Next: Page 3 — Fine-Tuning GPT & Text Generation

From BERT (encoder, classification) to GPT (decoder, generation)! Page 3 covers: encoder vs decoder differences, causal language modeling, fine-tuning GPT-2 for text generation, instruction tuning, prompt templates, generation parameters (temperature, top-k, top-p, beam search), and building a simple chatbot.