Belajar Hugging Face Page 3 — Fine-Tuning GPT & Text Generation

📑 Daftar Isi — Page 3

📑 Table of Contents — Page 3

Encoder vs Decoder — BERT vs GPT: dua dunia berbeda
Causal Language Modeling — Bagaimana GPT belajar
Di Mana Jalankan GPT? — Colab setup, VRAM, model sizes
Text Generation dengan Pipeline — 1 baris → generate teks
Manual generate() — Full control: token by token
SETIAP Parameter Generation — temperature, top_k, top_p, beam, dll
Sampling Strategies Visual — Kenapa temperature 0.7 vs 1.5 berbeda drastis
Fine-Tuning GPT-2 pada Custom Corpus — Puisi, kode, dialog
Instruction Tuning — GPT → assistant yang patuh
Proyek: Chatbot CLI Sederhana — GPT-2 fine-tuned interaktif
Model Generatif Lain — Bloom, LLaMA, Mistral, Gemma
Ringkasan & Preview Page 4

Encoder vs Decoder — BERT vs GPT: two different worlds
Causal Language Modeling — How GPT learns
Where to Run GPT? — Colab setup, VRAM, model sizes
Text Generation with Pipeline — 1 line → generate text
Manual generate() — Full control: token by token
EVERY Generation Parameter — temperature, top_k, top_p, beam, etc
Sampling Strategies Visual — Why temperature 0.7 vs 1.5 differs drastically
Fine-Tuning GPT-2 on Custom Corpus — Poetry, code, dialogue
Instruction Tuning — GPT → instruction-following assistant
Project: Simple CLI Chatbot — Interactive fine-tuned GPT-2
Other Generative Models — Bloom, LLaMA, Mistral, Gemma
Summary & Page 4 Preview

⚖️

1. Encoder vs Decoder — BERT vs GPT: Dua Dunia Berbeda

1. Encoder vs Decoder — BERT vs GPT: Two Different Worlds

BERT membaca SEMUA kata sekaligus. GPT membaca satu per satu dari kiri ke kanan.

BERT reads ALL words at once. GPT reads one by one, left to right.

Di Page 2, kita fine-tune BERT (encoder) untuk memahami teks — klasifikasi sentimen, NER, QA. Sekarang kita beralih ke GPT (decoder) untuk menghasilkan teks — menulis cerita, menjawab pertanyaan, coding. Perbedaannya bukan hanya tugas — arsitekturnya fundamental berbeda.

In Page 2, we fine-tuned BERT (encoder) to understand text — sentiment classification, NER, QA. Now we switch to GPT (decoder) to generate text — writing stories, answering questions, coding. The difference isn't just the task — the architecture is fundamentally different.

BERT (Encoder) vs GPT (Decoder) — Perbedaan Fundamental BERT (Encoder) — Bidirectional ───────────────────────────────────────────────────── Input: "The movie was [MASK] good" Attention: SETIAP kata melihat SEMUA kata lain (bidirectional) "The" attends to → "The" "movie" "was" "[MASK]" "good" "movie" attends to → "The" "movie" "was" "[MASK]" "good" "was" attends to → "The" "movie" "was" "[MASK]" "good" "[MASK]" attends to → "The" "movie" "was" "[MASK]" "good" "good" attends to → "The" "movie" "was" "[MASK]" "good" Task: Tebak [MASK] → "really" (Masked Language Model) Use: Classification, NER, QA, Similarity ← MEMAHAMI teks GPT (Decoder) — Causal / Left-to-Right ───────────────────────────────────────────────────── Input: "The movie was really" Attention: Setiap kata HANYA melihat kata SEBELUMNYA (causal) "The" attends to → "The" "movie" attends to → "The" "movie" "was" attends to → "The" "movie" "was" "really" attends to → "The" "movie" "was" "really" → Predict next: "good" ← kata berikutnya! Task: Prediksi kata BERIKUTNYA → "good" (Causal Language Model) Use: Text generation, Chat, Code, Story ← MENGHASILKAN teks Key Insight: BERT melihat masa depan (bidirectional) → bagus untuk MEMAHAMI GPT TIDAK melihat masa depan (causal) → bagus untuk MENGHASILKAN (karena saat generate, kata berikutnya memang belum ada!)

Aspek	BERT (Encoder)	GPT (Decoder)	T5 (Encoder-Decoder)
Attention	Bidirectional (lihat semua)	Causal (lihat kiri saja)	Encoder: bi, Decoder: causal
Pre-training	Masked LM: tebak [MASK]	Next token: prediksi berikutnya	Span corruption
Output	Representation (embedding)	Next token probability	Sequence output
Best For	Classification, NER, QA	Generation, Chat, Code	Translation, Summarization
Contoh Model	BERT, RoBERTa, DeBERTa	GPT-2, LLaMA, Mistral	T5, BART, mBART
HF Auto Class	AutoModelForSequenceClassification	AutoModelForCausalLM	AutoModelForSeq2SeqLM
Page di Seri Ini	Page 2 (fine-tune BERT)	Page 3 (ini!)	Page 4 (T5, translation)

Aspect	BERT (Encoder)	GPT (Decoder)	T5 (Encoder-Decoder)
Attention	Bidirectional (sees all)	Causal (sees left only)	Encoder: bi, Decoder: causal
Pre-training	Masked LM: guess [MASK]	Next token: predict next	Span corruption
Output	Representation (embedding)	Next token probability	Sequence output
Best For	Classification, NER, QA	Generation, Chat, Code	Translation, Summarization
Model Examples	BERT, RoBERTa, DeBERTa	GPT-2, LLaMA, Mistral	T5, BART, mBART
HF Auto Class	AutoModelForSequenceClassification	AutoModelForCausalLM	AutoModelForSeq2SeqLM
Page in This Series	Page 2 (fine-tune BERT)	Page 3 (this!)	Page 4 (T5, translation)

🎓 Kenapa GPT Tidak Bisa "Melihat ke Depan"?
Bayangkan Anda sedang menulis kalimat — Anda menulis satu kata pada satu waktu, dari kiri ke kanan. Saat menulis kata ke-5, kata ke-6 belum ada! GPT bekerja persis seperti ini: ia memprediksi kata berikutnya berdasarkan kata-kata sebelumnya saja.

Jika GPT bisa melihat ke depan (seperti BERT), ia akan "menyontek" — tidak perlu belajar memprediksi, tinggal copy dari masa depan. Inilah kenapa causal attention mask sangat penting: memblokir informasi dari posisi masa depan selama training dan inference.

🎓 Why Can't GPT "See the Future"?
Imagine you're writing a sentence — you write one word at a time, left to right. When writing word 5, word 6 doesn't exist yet! GPT works exactly like this: it predicts the next word based on previous words only.

If GPT could see ahead (like BERT), it would "cheat" — no need to learn prediction, just copy from the future. This is why the causal attention mask is so important: it blocks information from future positions during training and inference.

🔮

2. Causal Language Modeling — Bagaimana GPT Belajar

2. Causal Language Modeling — How GPT Learns

Tugas paling sederhana yang paling powerful: prediksi kata berikutnya, berulang-ulang

The simplest yet most powerful task: predict the next word, over and over

Causal Language Modeling (CLM) adalah tugas training GPT: diberikan sequence kata, prediksi kata berikutnya. Ini diulang untuk setiap posisi dalam sequence. Contoh: dari kalimat "Saya suka makan nasi goreng", GPT belajar:

Causal Language Modeling (CLM) is GPT's training task: given a sequence of words, predict the next word. This is repeated for every position in the sequence. Example: from the sentence "I love eating fried rice", GPT learns:

Causal Language Modeling — Prediksi Kata Berikutnya Input: "Saya suka makan nasi goreng" Training targets (untuk SETIAP posisi): Input tokens: → Predict: ──────────────── ───────── [BOS] → "Saya" (dari nothing, prediksi kata pertama) [BOS] Saya → "suka" (dari "Saya", prediksi kata kedua) [BOS] Saya suka → "makan" (dari "Saya suka", prediksi kata ketiga) [BOS] Saya suka makan → "nasi" [BOS] Saya suka makan nasi → "goreng" [BOS] Saya suka makan nasi goreng → [EOS] Loss = CrossEntropy antara predicted token dan actual next token Dirata-rata untuk SEMUA posisi dalam sequence Inilah yang membuat CLM powerful: Dari SATU kalimat, GPT mendapat 6 training examples! Dari Wikipedia (3.3 miliar kata) → miliaran training examples. GPT belajar grammar, fakta, reasoning — semua dari prediksi kata.

18_causal_lm_concept.py — Causal LM Loss Calculationpython

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# ===========================
# 1. Load GPT-2
# ===========================
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# GPT-2 sizes:
# "gpt2"        → 117M params, ~500MB  (small, fits Colab easily)
# "gpt2-medium" → 345M params, ~1.4GB  (medium)
# "gpt2-large"  → 774M params, ~3.1GB  (large, tight on T4)
# "gpt2-xl"     → 1.5B params, ~6.2GB  (XL, needs >16GB VRAM)

# ===========================
# 2. Tokenize & compute loss
# ===========================
text = "The capital of France is Paris"
inputs = tokenizer(text, return_tensors="pt")

# For CLM: labels = input_ids (shifted internally by the model!)
# Model predicts token[i+1] from tokens[0:i]
outputs = model(**inputs, labels=inputs["input_ids"])

print(f"Loss: {outputs.loss.item():.4f}")
# Loss: 3.2145 (lower = better at predicting next tokens)
print(f"Perplexity: {torch.exp(outputs.loss).item():.2f}")
# Perplexity: 24.89 (lower = better, 1.0 = perfect prediction)

# ===========================
# 3. What the model sees internally:
# ===========================
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
# ['The', 'Ġcapital', 'Ġof', 'ĠFrance', 'Ġis', 'ĠParis']
# Ġ = space prefix (GPT-2 BPE tokenizer)

# Model internally shifts labels:
# Position 0: sees "The"         → should predict "capital"
# Position 1: sees "The capital"  → should predict "of"
# Position 2: sees "The capital of" → should predict "France"
# Position 3: sees "The capital of France" → should predict "is"
# Position 4: sees "The capital of France is" → should predict "Paris"
# Loss = average cross-entropy over all positions

# ===========================
# 4. Check what GPT-2 predicts at each position
# ===========================
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits  # (1, seq_len, vocab_size=50257)

for i in range(len(tokens) - 1):
    predicted_id = logits[0, i].argmax().item()
    predicted_token = tokenizer.decode(predicted_id)
    actual_token = tokens[i + 1]
    match = "✅" if predicted_token.strip() == actual_token.replace("Ġ", "") else "❌"
    context = " ".join(tokens[:i+1]).replace("Ġ", "")
    print(f"  '{context}' → predicted: '{predicted_token}' | actual: '{actual_token}' {match}")
# 'The'                        → predicted: ' first'   | actual: 'Ġcapital' ❌
# 'The capital'                → predicted: ' of'      | actual: 'Ġof'      ✅
# 'The capital of'             → predicted: ' the'     | actual: 'ĠFrance'  ❌
# 'The capital of France'      → predicted: ' is'      | actual: 'Ġis'      ✅
# 'The capital of France is'   → predicted: ' Paris'   | actual: 'ĠParis'   ✅
# GPT-2 knows Paris is the capital of France! 🎉

💻

3. Di Mana Jalankan GPT? — Setup, VRAM, dan Batasan

3. Where to Run GPT? — Setup, VRAM, and Limitations

GPT-2 small muat di Colab gratis. GPT-2 medium/large perlu trik. LLaMA butuh cloud.

GPT-2 small fits on free Colab. GPT-2 medium/large needs tricks. LLaMA needs cloud.

Model	Params	VRAM Inference	VRAM Fine-Tune FP16	Colab T4 (16GB)?
GPT-2 small	117M	~1 GB	~5 GB	✅ Sangat nyaman
GPT-2 medium	345M	~2 GB	~13 GB	⚠️ Batch kecil + grad accum
GPT-2 large	774M	~4 GB	>16 GB	❌ Butuh gradient checkpointing
Bloom-560M	560M	~2 GB	~14 GB	⚠️ Ketat
LLaMA 3.2 1B	1B	~4 GB	~16 GB (LoRA)	⚠️ LoRA only (Page 8)
Mistral 7B	7B	~15 GB	~40 GB	❌ Butuh A100

Model	Params	VRAM Inference	VRAM Fine-Tune FP16	Colab T4 (16GB)?
GPT-2 small	117M	~1 GB	~5 GB	✅ Very comfortable
GPT-2 medium	345M	~2 GB	~13 GB	⚠️ Small batch + grad accum
GPT-2 large	774M	~4 GB	>16 GB	❌ Needs gradient checkpointing
Bloom-560M	560M	~2 GB	~14 GB	⚠️ Tight
LLaMA 3.2 1B	1B	~4 GB	~16 GB (LoRA)	⚠️ LoRA only (Page 8)
Mistral 7B	7B	~15 GB	~40 GB	❌ Needs A100

Colab Setup untuk GPT-2python

# Cell 1: Verify GPU
!nvidia-smi
import torch
print(f"GPU: {torch.cuda.get_device_name(0)}, VRAM: {torch.cuda.get_device_properties(0).total_mem/1e9:.1f} GB")

# Cell 2: Install
!pip install -q transformers datasets accelerate

# Cell 3: Test GPT-2 inference (< 1 menit download)
from transformers import pipeline
generator = pipeline("text-generation", model="gpt2", device=0)
print(generator("Artificial intelligence will", max_new_tokens=30)[0]["generated_text"])
# ✅ Ready! GPT-2 small = ~500MB, fits easily on T4

# PENTING: GPT-2 tokenizer TIDAK punya pad token!
# Harus set manual sebelum fine-tuning:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # ← WAJIB untuk GPT-2!
# Tanpa ini: error "Cannot handle padding" saat fine-tuning

💡 Semua kode di Page 3 ini menggunakan GPT-2 small (117M) yang berjalan nyaman di Google Colab gratis. Inference: <1 detik per generation. Fine-tuning: ~10-20 menit. Mau model lebih pintar? Ganti ke "gpt2-medium" (sama syntax, butuh batch lebih kecil) atau tunggu Page 8 (LoRA untuk LLaMA/Mistral).

💡 All code in this Page 3 uses GPT-2 small (117M) which runs comfortably on free Google Colab. Inference: <1 second per generation. Fine-tuning: ~10-20 minutes. Want a smarter model? Switch to "gpt2-medium" (same syntax, needs smaller batch) or wait for Page 8 (LoRA for LLaMA/Mistral).

🚀

4. Text Generation dengan Pipeline — 1 Baris Magic

4. Text Generation with Pipeline — 1-Line Magic

Berikan prompt, GPT melanjutkan — semudah itu

Give a prompt, GPT continues it — that simple

19_text_generation_pipeline.py — Pipeline Generation 🔥python

from transformers import pipeline

# ===========================
# 1. Basic generation
# ===========================
generator = pipeline("text-generation", model="gpt2", device=0)

result = generator("The future of artificial intelligence is", max_new_tokens=50)
print(result[0]["generated_text"])
# "The future of artificial intelligence is not just about the technology,
#  but about how we use it. The question is whether we can build systems..."

# ===========================
# 2. Multiple completions
# ===========================
results = generator(
    "Once upon a time in Jakarta,",
    max_new_tokens=80,
    num_return_sequences=3,     # generate 3 different completions!
    do_sample=True,              # enable random sampling
    temperature=0.8,             # creativity level
)
for i, r in enumerate(results):
    print(f"\\n--- Completion {i+1} ---")
    print(r["generated_text"])
# Setiap completion berbeda! (karena sampling random)

# ===========================
# 3. Different generation strategies
# ===========================
# Deterministic (greedy — selalu sama)
result_greedy = generator("AI is", max_new_tokens=20, do_sample=False)

# Creative (high temperature sampling)
result_creative = generator("AI is", max_new_tokens=20,
    do_sample=True, temperature=1.2, top_p=0.9)

# Focused (low temperature)
result_focused = generator("AI is", max_new_tokens=20,
    do_sample=True, temperature=0.3)

print(f"Greedy:   {result_greedy[0]['generated_text']}")
print(f"Creative: {result_creative[0]['generated_text']}")
print(f"Focused:  {result_focused[0]['generated_text']}")
# Greedy:   "AI is a very important part of the future of the world."
# Creative: "AI is an existential rollercoaster of digital consciousness..."
# Focused:  "AI is a field of computer science that focuses on..."

🔬

5. Manual generate() — Full Control Token by Token

Pipeline membungkus generate(). Sekarang kita akses langsung untuk control penuh.

Pipeline wraps generate(). Now we access it directly for full control.

20_manual_generate.py — model.generate() Deep Divepython

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")

# ===========================
# 1. Basic generate()
# ===========================
prompt = "Indonesia is a beautiful country with"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=50,       # generate 50 NEW tokens
        do_sample=True,           # enable sampling
        temperature=0.7,          # creativity
        top_k=50,                 # consider top 50 tokens
        top_p=0.9,                # nucleus sampling
        repetition_penalty=1.2,   # penalize repetition
        pad_token_id=tokenizer.eos_token_id,
    )

# Decode — skip the prompt tokens
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_text)

# ===========================
# 2. Stream generation (token by token) — like ChatGPT!
# ===========================
from transformers import TextIteratorStreamer
from threading import Thread

streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)

inputs = tokenizer("The meaning of life is", return_tensors="pt").to("cuda")
gen_kwargs = {**inputs, "max_new_tokens": 100, "streamer": streamer,
              "do_sample": True, "temperature": 0.7}

# Run in separate thread (non-blocking)
thread = Thread(target=model.generate, kwargs=gen_kwargs)
thread.start()

# Print tokens as they arrive!
for text in streamer:
    print(text, end="", flush=True)
# "The meaning of life is" ← appears token by token, like ChatGPT!

# ===========================
# 3. Access generation logits (for custom post-processing)
# ===========================
outputs = model.generate(
    **inputs,
    max_new_tokens=5,
    output_scores=True,          # return logits for each step!
    return_dict_in_generate=True,
)

# outputs.scores = tuple of (vocab_size,) tensors, one per generated token
for i, scores in enumerate(outputs.scores):
    probs = torch.softmax(scores[0], dim=-1)
    top5 = torch.topk(probs, 5)
    print(f"\\nStep {i+1} top-5 candidates:")
    for prob, idx in zip(top5.values, top5.indices):
        token = tokenizer.decode(idx)
        print(f"  '{token}': {prob:.1%}")

🎛️

6. SETIAP Parameter Generation — Dijelaskan Detail

6. EVERY Generation Parameter — Explained in Detail

temperature, top_k, top_p, repetition_penalty, beam search — apa efeknya, kapan pakai

temperature, top_k, top_p, repetition_penalty, beam search — effects and when to use

21_generation_params.py — Parameter Encyclopedia 📖python

# ═══════════════════════════════════════════════════
# SETIAP PARAMETER GENERATION — EXPLAINED
# ═══════════════════════════════════════════════════

output = model.generate(
    **inputs,

    # ── LENGTH CONTROL ──
    max_new_tokens=100,     # generate MAX 100 new tokens
    min_new_tokens=10,      # generate MIN 10 tokens (prevent empty output)
    # max_length=150,        # alternative: total length (prompt + generated)

    # ── SAMPLING vs GREEDY ──
    do_sample=True,         # True=random sampling, False=greedy (deterministic)
    # Greedy: always pick highest probability token → boring, repetitive
    # Sampling: randomly pick from probability distribution → creative, varied

    # ── TEMPERATURE ── (hanya berlaku jika do_sample=True)
    temperature=0.7,        # Controls randomness of sampling
    # temperature=0.1 → almost greedy (very focused, repetitive)
    # temperature=0.7 → balanced (creative but coherent) ← RECOMMENDED
    # temperature=1.0 → standard (model's natural distribution)
    # temperature=1.5 → very random (wild, often incoherent)
    # temperature=2.0 → chaos (mostly nonsense)
    #
    # HOW IT WORKS:
    # logits_adjusted = logits / temperature
    # probs = softmax(logits_adjusted)
    # Low temp → sharper distribution → top token dominates
    # High temp → flatter distribution → more variety

    # ── TOP-K SAMPLING ── (hanya jika do_sample=True)
    top_k=50,               # Only consider top K highest-probability tokens
    # top_k=1   → greedy (only top 1 token)
    # top_k=10  → conservative (limited vocabulary)
    # top_k=50  → balanced ← DEFAULT
    # top_k=0   → disabled (consider ALL tokens)
    #
    # PROBLEM: top_k=50 treats all distributions equally.
    # If model is very confident: top 5 tokens have 95% probability
    # → tokens 6-50 are almost random noise!
    # SOLUTION: use top_p instead (or together)

    # ── TOP-P (NUCLEUS) SAMPLING ── (hanya jika do_sample=True)
    top_p=0.9,              # Keep tokens until cumulative probability reaches P
    # top_p=0.9 → keep tokens that sum to 90% probability
    # If model confident: might keep only 3 tokens (they sum to 90%)
    # If model uncertain: might keep 50 tokens (all needed for 90%)
    # ADAPTS to model's confidence! Better than fixed top_k.
    #
    # top_p=1.0 → disabled (keep all tokens)
    # top_p=0.95 → slightly conservative
    # top_p=0.9  → balanced ← RECOMMENDED
    # top_p=0.5  → very focused

    # ── REPETITION CONTROL ──
    repetition_penalty=1.2, # Penalize tokens that already appeared
    # 1.0 = no penalty (can repeat freely)
    # 1.1 = mild (some repetition OK)
    # 1.2 = moderate ← RECOMMENDED for most use cases
    # 1.5 = strong (almost never repeats)
    # 2.0+ = too strong (forced to use rare words)

    no_repeat_ngram_size=3, # Never repeat the same 3-word phrase
    # 0 = disabled, 2 = no repeated bigrams, 3 = no repeated trigrams

    # ── BEAM SEARCH ── (alternative to sampling)
    # num_beams=5,           # explore 5 paths simultaneously
    # early_stopping=True,   # stop when all beams finish
    # length_penalty=1.0,    # >1 = prefer longer, <1 = prefer shorter
    # Beam search: more coherent but LESS creative than sampling
    # Good for: translation, summarization
    # Bad for: creative writing, chat (too boring)
    # NOTE: beam search is INCOMPATIBLE with do_sample=True!

    # ── STOP CONDITIONS ──
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    # stop_strings=["Human:", "\n\n"],  # stop at these strings
)

Parameter	Default	Recommended	Efek
temperature	1.0	0.7	↓ = fokus, ↑ = kreatif
top_k	50	50	Jumlah token kandidat tetap
top_p	1.0	0.9	Kumulatif probability threshold (adaptif)
repetition_penalty	1.0	1.1-1.3	Cegah pengulangan kata
no_repeat_ngram	0	3	Cegah frasa berulang
num_beams	1	1 (sampling) / 5 (translation)	Pencarian lebih luas tapi deterministic
max_new_tokens	20	50-500	Panjang output maksimum

Parameter	Default	Recommended	Effect
temperature	1.0	0.7	↓ = focused, ↑ = creative
top_k	50	50	Fixed number of candidate tokens
top_p	1.0	0.9	Cumulative probability threshold (adaptive)
repetition_penalty	1.0	1.1-1.3	Prevent word repetition
no_repeat_ngram	0	3	Prevent phrase repetition
num_beams	1	1 (sampling) / 5 (translation)	Wider search but deterministic
max_new_tokens	20	50-500	Maximum output length

🎓 Resep Cepat untuk Berbagai Use Case:
Chatbot: temperature=0.7, top_p=0.9, repetition_penalty=1.2
Creative writing: temperature=0.9, top_p=0.95, top_k=100
Code generation: temperature=0.2, top_p=0.9 (harus akurat!)
Factual text: do_sample=False (greedy, deterministic)
Translation: num_beams=5, do_sample=False, length_penalty=1.0

🎓 Quick Recipes for Various Use Cases:
Chatbot: temperature=0.7, top_p=0.9, repetition_penalty=1.2
Creative writing: temperature=0.9, top_p=0.95, top_k=100
Code generation: temperature=0.2, top_p=0.9 (needs accuracy!)
Factual text: do_sample=False (greedy, deterministic)
Translation: num_beams=5, do_sample=False, length_penalty=1.0

📊

7. Sampling Strategies Visual — Temperature & Top-P Divisualisasikan

7. Sampling Strategies Visual — Temperature & Top-P Visualized

Melihat secara visual bagaimana parameter mengubah distribusi probabilitas

Visually seeing how parameters change the probability distribution

Temperature Effect — Distribusi Probabilitas untuk Next Token Context: "The cat sat on the" Model's raw logits → softmax → probabilities: Temperature = 0.1 (hampir greedy): "mat" ████████████████████████████████████ 92% "floor" ███ 5% "bed" █ 2% "table" ▏ 1% lainnya ▏ 0% → Hampir selalu "mat" — boring tapi safe Temperature = 0.7 (balanced, RECOMMENDED): "mat" ████████████████████ 52% "floor" █████████ 22% "bed" █████ 12% "table" ███ 8% "roof" █ 3% lainnya █ 3% → Biasanya "mat" tapi kadang "floor"/"bed" — natural! Temperature = 1.5 (sangat kreatif): "mat" ██████ 18% "floor" █████ 15% "bed" ████ 12% "table" ████ 11% "roof" ███ 10% "moon" ██ 8% "pizza" ██ 7% lainnya █████████ 19% → Bisa "moon" atau "pizza" — kreatif tapi sering aneh!

Top-P (Nucleus) vs Top-K — Adaptive vs Fixed Skenario 1: Model YAKIN (confident) "The capital of France is" → next token: "Paris" ████████████████████████ 90% "the" ██ 5% "a" █ 3% "located" ▏ 1% ... Top-K=50: keeps 50 tokens (45 tokens almost 0% — WASTED!) Top-P=0.9: keeps HANYA 1 token ("Paris"=90% ≥ 0.9) → efficient! Skenario 2: Model TIDAK YAKIN (uncertain) "I went to the" → next token: "store" ██████ 18% "park" █████ 15% "hospital" ████ 12% "school" ████ 11% "gym" ███ 10% "library" ███ 9% "office" ██ 8% ... Top-K=50: keeps 50 tokens (OK here, but K=50 is arbitrary) Top-P=0.9: keeps ~8 tokens (sum to 90%) → adapts to uncertainty! Takeaway: Top-P ADAPTS to model confidence. Top-K does not. Best practice: use BOTH — top_p=0.9, top_k=50

🔥

8. Fine-Tuning GPT-2 pada Custom Corpus — Buat GPT Anda Sendiri

8. Fine-Tuning GPT-2 on Custom Corpus — Build Your Own GPT

Ajarkan GPT-2 gaya penulisan Anda: puisi, kode, dialog, bahasa tertentu

Teach GPT-2 your writing style: poetry, code, dialogue, specific language

Fine-tuning GPT-2 = memberikan contoh teks, lalu GPT belajar memprediksi kata berikutnya di domain Anda. Setelah fine-tuning, GPT bisa generate teks yang mirip dengan training data Anda. Contoh: fine-tune pada puisi → GPT jadi "penyair". Fine-tune pada kode Python → GPT jadi "programmer".

Fine-tuning GPT-2 = giving example text, then GPT learns to predict next words in your domain. After fine-tuning, GPT can generate text similar to your training data. Example: fine-tune on poetry → GPT becomes a "poet". Fine-tune on Python code → GPT becomes a "programmer".

22_finetune_gpt2.py — Fine-Tune GPT-2 Complete 🔥🔥🔥python

from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    TrainingArguments, Trainer, DataCollatorForLanguageModeling
)
from datasets import load_dataset

# ═══════════════════════════════════════
# STEP 1: LOAD MODEL & TOKENIZER
# ═══════════════════════════════════════
model_name = "gpt2"  # 117M params, fits Colab T4
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# CRITICAL: GPT-2 has NO pad token! Must set it!
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id

print(f"Model params: {model.num_parameters():,}")  # 124,439,808

# ═══════════════════════════════════════
# STEP 2: LOAD & PREPARE DATASET
# ═══════════════════════════════════════
# Option A: From Hugging Face Hub
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
# Option B: From your own text file
# dataset = load_dataset("text", data_files="my_corpus.txt")
# Option C: From CSV with "text" column
# dataset = load_dataset("csv", data_files="poems.csv")

print(dataset)
print(f"Sample: {dataset['train'][0]['text'][:100]}...")

# ═══════════════════════════════════════
# STEP 3: TOKENIZE
# ═══════════════════════════════════════
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,          # GPT-2 max context = 1024
        return_overflowing_tokens=True,  # split long texts into chunks!
        return_length=True,
    )

tokenized = dataset.map(tokenize_function, batched=True,
                         remove_columns=dataset["train"].column_names)

# Filter out very short sequences
tokenized = tokenized.filter(lambda x: len(x["input_ids"]) > 10)
print(f"Training examples: {len(tokenized['train'])}")

# ═══════════════════════════════════════
# STEP 4: DATA COLLATOR (special for CLM!)
# ═══════════════════════════════════════
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,                  # False = Causal LM (GPT)
    # mlm=True → Masked LM (BERT) — NOT for GPT!
)
# DataCollatorForLanguageModeling automatically:
# 1. Pads sequences in each batch
# 2. Creates labels = input_ids (shifted by 1 internally)
# 3. Sets label=-100 for padding tokens (ignored in loss)

# ═══════════════════════════════════════
# STEP 5: TRAINING
# ═══════════════════════════════════════
args = TrainingArguments(
    output_dir="./gpt2-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,       # smaller batch for generation models
    gradient_accumulation_steps=8,       # effective batch = 4 × 8 = 32
    learning_rate=5e-5,                  # slightly higher than BERT (5e-5 vs 2e-5)
    weight_decay=0.01,
    warmup_ratio=0.1,
    fp16=True,
    logging_steps=50,
    save_strategy="epoch",
    save_total_limit=2,
    prediction_loss_only=True,           # don't compute metrics (CLM only needs loss)
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

print("🏋️ Training GPT-2...")
trainer.train()

# ═══════════════════════════════════════
# STEP 6: EVALUATE (Perplexity)
# ═══════════════════════════════════════
import math
eval_results = trainer.evaluate()
perplexity = math.exp(eval_results["eval_loss"])
print(f"Perplexity: {perplexity:.2f}")
# Lower = better. GPT-2 base on WikiText: ~30. Fine-tuned: ~20-25.
# Human text: ~20-50 depending on domain.

# ═══════════════════════════════════════
# STEP 7: SAVE & GENERATE!
# ═══════════════════════════════════════
trainer.save_model("./gpt2-finetuned-final")
tokenizer.save_pretrained("./gpt2-finetuned-final")

# Test generation!
from transformers import pipeline
gen = pipeline("text-generation", model="./gpt2-finetuned-final", device=0)

prompts = ["In the field of machine learning,", "The history of Indonesia"]
for p in prompts:
    result = gen(p, max_new_tokens=80, do_sample=True, temperature=0.7,
                 top_p=0.9, repetition_penalty=1.2)
    print(f"\\nPrompt: {p}")
    print(f"Output: {result[0]['generated_text']}")

print("\\n🏆 GPT-2 fine-tuning complete!")

🎓 Perbedaan Kunci dari Fine-Tuning BERT (Page 2):
1. Auto Class: AutoModelForCausalLM bukan AutoModelForSequenceClassification
2. Data Collator: DataCollatorForLanguageModeling(mlm=False) bukan DataCollatorWithPadding
3. Labels: Otomatis (labels = shifted input_ids). Tidak perlu kolom "label" di dataset.
4. Pad token: tokenizer.pad_token = tokenizer.eos_token — GPT-2 tidak punya pad token default!
5. Batch size: Lebih kecil (4-8 vs 16-32) karena sequence panjang = lebih banyak VRAM.
6. LR: Sedikit lebih tinggi (5e-5 vs 2e-5) — GPT fine-tuning umumnya butuh LR lebih besar.
7. Metric: Perplexity (bukan accuracy/F1) — karena tidak ada "label benar/salah" di text generation.

🎓 Key Differences from BERT Fine-Tuning (Page 2):
1. Auto Class: AutoModelForCausalLM not AutoModelForSequenceClassification
2. Data Collator: DataCollatorForLanguageModeling(mlm=False) not DataCollatorWithPadding
3. Labels: Automatic (labels = shifted input_ids). No "label" column needed in dataset.
4. Pad token: tokenizer.pad_token = tokenizer.eos_token — GPT-2 has no default pad token!
5. Batch size: Smaller (4-8 vs 16-32) because long sequences = more VRAM.
6. LR: Slightly higher (5e-5 vs 2e-5) — GPT fine-tuning generally needs larger LR.
7. Metric: Perplexity (not accuracy/F1) — because there's no "right/wrong label" in text generation.

📋

9. Instruction Tuning — GPT → Assistant yang Patuh

9. Instruction Tuning — GPT → Obedient Assistant

Dari text completion biasa menjadi model yang mengikuti instruksi — fondasi ChatGPT

From plain text completion to an instruction-following model — the foundation of ChatGPT

GPT-2 biasa hanya melanjutkan teks — ia tidak "menjawab pertanyaan" atau "mengikuti instruksi". Instruction tuning mengajarkan GPT untuk memahami format instruksi dan memberikan respons yang sesuai. Ini adalah teknik yang membuat GPT-3 menjadi ChatGPT.

Plain GPT-2 only continues text — it doesn't "answer questions" or "follow instructions". Instruction tuning teaches GPT to understand instruction formats and give appropriate responses. This is the technique that turned GPT-3 into ChatGPT.

23_instruction_tuning.py — Format Data untuk Instruction Tuningpython

# ===========================
# 1. Format data instruction tuning
# ===========================
# Setiap training example = instruksi + respons dalam satu string

# Format Alpaca-style (paling populer):
training_examples = [
    """### Instruction:
Summarize the following text in one sentence.

### Input:
Hugging Face is a company that provides tools and platforms for machine learning. They are best known for their Transformers library, which provides thousands of pre-trained models for natural language processing, computer vision, and audio tasks.

### Response:
Hugging Face is an ML company known for their Transformers library offering thousands of pre-trained models for NLP, vision, and audio.""",

    """### Instruction:
Translate the following English text to Indonesian.

### Input:
I love learning about artificial intelligence.

### Response:
Saya suka belajar tentang kecerdasan buatan.""",

    """### Instruction:
What is the capital of Japan?

### Response:
The capital of Japan is Tokyo.""",
]

# Format ChatML (used by many chat models):
chat_examples = [
    """<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is Python?<|im_end|>
<|im_start|>assistant
Python is a high-level programming language known for its readability and versatility.<|im_end|>""",
]

# ===========================
# 2. Prepare dataset
# ===========================
from datasets import Dataset

# Dari list of strings:
dataset = Dataset.from_dict({"text": training_examples})

# Atau dari JSONL file:
# {"instruction": "...", "input": "...", "output": "..."}
# dataset = load_dataset("json", data_files="instructions.jsonl")

# Format setiap row menjadi satu string
def format_instruction(example):
    if example.get("input"):
        text = f"""### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}{tokenizer.eos_token}"""
    else:
        text = f"""### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}{tokenizer.eos_token}"""
    return {"text": text}

# dataset = dataset.map(format_instruction)
# Then tokenize & train exactly like Section 8!

# ===========================
# 3. Inference with instruction format
# ===========================
def ask(instruction, input_text=""):
    if input_text:
        prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
    else:
        prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"

    result = gen(prompt, max_new_tokens=200, do_sample=True,
                 temperature=0.7, top_p=0.9,
                 repetition_penalty=1.2)
    response = result[0]["generated_text"][len(prompt):]
    # Stop at next "###" (prevent generating another instruction)
    if "###" in response:
        response = response[:response.index("###")]
    return response.strip()

print(ask("What is the largest planet in our solar system?"))
# "Jupiter is the largest planet in our solar system."

💬

10. Proyek: Chatbot CLI Sederhana — GPT-2 Interaktif

10. Project: Simple CLI Chatbot — Interactive GPT-2

Terminal chatbot yang bisa Anda ajak bicara — menggunakan GPT-2 fine-tuned

Terminal chatbot you can talk to — using fine-tuned GPT-2

24_chatbot_cli.py — Interactive CLI Chatbot 🔥python

from transformers import pipeline

# Load fine-tuned model (atau GPT-2 biasa untuk demo)
gen = pipeline("text-generation",
    model="./gpt2-finetuned-final",  # atau "gpt2" untuk demo
    device=0)

def chat(user_input, history=""):
    """Generate chatbot response."""
    prompt = history + f"### Human: {user_input}\n### Assistant:"

    result = gen(
        prompt,
        max_new_tokens=150,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.3,
        pad_token_id=gen.tokenizer.eos_token_id,
    )

    full_text = result[0]["generated_text"]
    response = full_text[len(prompt):].strip()

    # Stop at next "### Human:" or newline
    for stop in ["### Human:", "###", "\n\n"]:
        if stop in response:
            response = response[:response.index(stop)]

    # Update history for context
    new_history = prompt + " " + response + "\n"
    return response.strip(), new_history

# ═══════════════════════════════════════
# Interactive loop
# ═══════════════════════════════════════
print("🤖 GPT-2 Chatbot (type 'quit' to exit)")
print("=" * 50)

history = ""
while True:
    user_input = input("\\n👤 You: ")
    if user_input.lower() in ["quit", "exit", "q"]:
        print("👋 Bye!")
        break

    response, history = chat(user_input, history)
    print(f"🤖 Bot: {response}")

# ═══════════════════════════════════════
# Sample conversation:
# 👤 You: What is machine learning?
# 🤖 Bot: Machine learning is a subset of AI that enables systems
#         to learn from data without being explicitly programmed.
# 👤 You: Give me an example.
# 🤖 Bot: A spam filter that learns to identify spam emails by
#         analyzing thousands of examples is a common ML application.
# ═══════════════════════════════════════

💡 Catatan Realistis: GPT-2 (117M) adalah model kecil — jawabannya sering kurang akurat dan koheren dibandingkan ChatGPT (175B+). Ini untuk belajar konsep. Untuk chatbot production, gunakan LLaMA/Mistral 7B+ dengan LoRA fine-tuning (Page 8) atau API model besar (GPT-4, Claude).

💡 Realistic Note: GPT-2 (117M) is a small model — its answers are often less accurate and coherent compared to ChatGPT (175B+). This is for learning concepts. For production chatbots, use LLaMA/Mistral 7B+ with LoRA fine-tuning (Page 8) or large model APIs (GPT-4, Claude).

🌍

11. Model Generatif Lain — Bloom, LLaMA, Mistral, Gemma

11. Other Generative Models — Bloom, LLaMA, Mistral, Gemma

GPT-2 = belajar. Untuk production, ada model yang jauh lebih powerful.

GPT-2 = learning. For production, there are far more powerful models.

Model	Params	Bahasa	License	Best For
GPT-2	117M-1.5B	English	MIT (free)	Belajar, eksperimen ⭐
Bloom	560M-176B	46 bahasa	Open RAIL-M	Multilingual generation
LLaMA 3.2	1B-90B	Multi + ID	Meta License	State-of-the-art open ⭐
Mistral	7B-8x22B	Multi	Apache 2.0	Best ratio size/quality ⭐
Gemma 2	2B-27B	Multi	Gemma License	Google's open model
Qwen 2.5	0.5B-72B	Multi + ID	Apache 2.0	Strong multilingual + code
Phi-3	3.8B-14B	English	MIT	Small but powerful

Model	Params	Languages	License	Best For
GPT-2	117M-1.5B	English	MIT (free)	Learning, experiments ⭐
Bloom	560M-176B	46 languages	Open RAIL-M	Multilingual generation
LLaMA 3.2	1B-90B	Multi + ID	Meta License	State-of-the-art open ⭐
Mistral	7B-8x22B	Multi	Apache 2.0	Best size/quality ratio ⭐
Gemma 2	2B-27B	Multi	Gemma License	Google's open model
Qwen 2.5	0.5B-72B	Multi + ID	Apache 2.0	Strong multilingual + code
Phi-3	3.8B-14B	English	MIT	Small but powerful

🎓 Roadmap Model Generatif di Seri Ini:
Page 3 (ini): GPT-2 (117M) — belajar konsep CLM, generation params, fine-tuning
Page 8: LoRA & QLoRA — fine-tune LLaMA/Mistral 7B di Colab!
Page 9: RLHF — align model dengan preferensi manusia (ChatGPT method)
Anda sedang membangun fondasi untuk fine-tuning model besar di page-page berikutnya.

🎓 Generative Model Roadmap in This Series:
Page 3 (this): GPT-2 (117M) — learn CLM concepts, generation params, fine-tuning
Page 8: LoRA & QLoRA — fine-tune LLaMA/Mistral 7B on Colab!
Page 9: RLHF — align models with human preferences (ChatGPT method)
You're building the foundation for fine-tuning large models in upcoming pages.

📝

12. Ringkasan Page 3

12. Page 3 Summary

Semua yang sudah kita pelajari

Everything we learned

Konsep	Apa Itu	Kode Kunci
Encoder vs Decoder	BERT (bidirectional) vs GPT (causal)	`AutoModelForCausalLM`
Causal LM	Prediksi kata berikutnya	`labels=input_ids (auto-shifted)`
Pipeline Generation	1-line text generation	`pipeline("text-generation")`
model.generate()	Full control generation	`model.generate(**inputs, ...)`
Temperature	Kontrol kreativitas (0.1-2.0)	`temperature=0.7`
Top-P (Nucleus)	Adaptive probability cutoff	`top_p=0.9`
Top-K	Fixed candidate count	`top_k=50`
Repetition Penalty	Cegah pengulangan	`repetition_penalty=1.2`
Fine-Tune GPT-2	Custom corpus → custom GPT	`DataCollatorForLanguageModeling(mlm=False)`
Instruction Tuning	GPT → instruction follower	`"### Instruction:\n...\n### Response:\n"`
Streaming	Token-by-token output	`TextIteratorStreamer`
Perplexity	Generation quality metric	`exp(eval_loss)`

Concept	What It Is	Key Code
Encoder vs Decoder	BERT (bidirectional) vs GPT (causal)	`AutoModelForCausalLM`
Causal LM	Next token prediction	`labels=input_ids (auto-shifted)`
Pipeline Generation	1-line text generation	`pipeline("text-generation")`
model.generate()	Full control generation	`model.generate(**inputs, ...)`
Temperature	Creativity control (0.1-2.0)	`temperature=0.7`
Top-P (Nucleus)	Adaptive probability cutoff	`top_p=0.9`
Top-K	Fixed candidate count	`top_k=50`
Repetition Penalty	Prevent repetition	`repetition_penalty=1.2`
Fine-Tune GPT-2	Custom corpus → custom GPT	`DataCollatorForLanguageModeling(mlm=False)`
Instruction Tuning	GPT → instruction follower	`"### Instruction:\n...\n### Response:\n"`
Streaming	Token-by-token output	`TextIteratorStreamer`
Perplexity	Generation quality metric	`exp(eval_loss)`

← Page Sebelumnya← Previous Page

Page 2 — Fine-Tuning BERT & Trainer API

📘

Coming Next: Page 4 — Token Classification & NER

Dari klasifikasi kalimat (Page 2) ke klasifikasi per-token! Page 4 membahas: Named Entity Recognition (NER) — identifikasi orang, tempat, organisasi, POS Tagging, BIO/IOB2 labeling scheme, tokenization alignment (subword → word labels), fine-tuning BERT untuk NER pada custom dataset, evaluasi per-entity (seqeval), dan building a NER pipeline production.

📘

Coming Next: Page 4 — Token Classification & NER

From sentence classification (Page 2) to per-token classification! Page 4 covers: Named Entity Recognition (NER) — identifying people, places, organizations, POS Tagging, BIO/IOB2 labeling scheme, tokenization alignment (subword → word labels), fine-tuning BERT for NER on custom datasets, per-entity evaluation (seqeval), and building a production NER pipeline.