Belajar Hugging Face Page 5 — Question Answering & Seq2Seq

📑 Daftar Isi — Page 5

📑 Table of Contents — Page 5

Extractive QA — Menemukan jawaban di dalam konteks
Bagaimana QA Model Bekerja — Start/end logits → span
SQuAD Dataset — Benchmark QA klasik
Long Context Problem — Stride & overflow handling
Fine-Tune BERT untuk QA — Complete SQuAD pipeline
Evaluasi QA — Exact Match & F1
Encoder-Decoder Architecture — T5 & BART
T5 Text-to-Text Framework — Semua tugas = text→text
Fine-Tune T5 untuk Summarization — CNN/DailyMail
Fine-Tune T5 untuk Translation — English↔Indonesian
BLEU & ROUGE Metrics — Evaluasi generasi teks
Di Mana Jalankan? — VRAM untuk T5/BART
Ringkasan & Preview Page 6

Extractive QA — Finding answers within context
How QA Models Work — Start/end logits → span
SQuAD Dataset — Classic QA benchmark
Long Context Problem — Stride & overflow handling
Fine-Tune BERT for QA — Complete SQuAD pipeline
QA Evaluation — Exact Match & F1
Encoder-Decoder Architecture — T5 & BART
T5 Text-to-Text Framework — All tasks = text→text
Fine-Tune T5 for Summarization — CNN/DailyMail
Fine-Tune T5 for Translation — English↔Indonesian
BLEU & ROUGE Metrics — Evaluating text generation
Where to Run? — VRAM for T5/BART
Summary & Page 6 Preview

❓

1. Extractive QA — Menemukan Jawaban di Dalam Konteks

1. Extractive QA — Finding Answers Within Context

Diberikan pertanyaan + konteks paragraf → temukan potongan teks yang menjawab pertanyaan

Given a question + context paragraph → find the text span that answers the question

Extractive QA = model diberikan pertanyaan dan konteks (paragraf), lalu harus menemukan potongan teks di dalam konteks yang menjawab pertanyaan. Model tidak menghasilkan teks baru — ia hanya menunjuk posisi awal dan akhir jawaban di konteks. Ini berbeda dari generative QA (ChatGPT-style) yang menulis jawaban baru.

Extractive QA = the model is given a question and context (paragraph), then must find the text span within the context that answers the question. The model doesn't generate new text — it only points to the start and end positions of the answer in the context. This differs from generative QA (ChatGPT-style) which writes new answers.

Extractive QA — Model "Menunjuk" Jawaban di Konteks Question: "What is the capital of Indonesia?" Context: "Indonesia is a country in Southeast Asia. Jakarta is the capital and largest city. The country has over 17,000 islands and a population of 270 million people." Model output: Start position: 42 (karakter ke-42 = "J" dari "Jakarta") End position: 49 (karakter ke-49 = "a" dari "Jakarta") → Answer: "Jakarta" ← DIEKSTRAK dari konteks, bukan di-generate! Extractive: jawaban HARUS ada di konteks (copy span) Generative: jawaban bisa ditulis baru (seperti ChatGPT) Abstractive: jawaban diringkas/diparafrase dari konteks

34_qa_pipeline.py — QA Pipeline Instantpython

from transformers import pipeline

# ===========================
# QA pipeline — zero training!
# ===========================
qa = pipeline("question-answering", device=0)

context = """
Indonesia is a country in Southeast Asia and Oceania between the Indian
and Pacific oceans. Jakarta is the capital and most populous city. The
country has over 17,000 islands with a population of 270 million,
making it the world's fourth most populous country. Indonesia became
independent from the Netherlands on August 17, 1945.
"""

# Ask multiple questions on the SAME context
questions = [
    "What is the capital of Indonesia?",
    "How many islands does Indonesia have?",
    "When did Indonesia become independent?",
    "What is the population of Indonesia?",
]

for q in questions:
    result = qa(question=q, context=context)
    print(f"Q: {q}")
    print(f"A: {result['answer']} (score: {result['score']:.1%}, pos: {result['start']}-{result['end']})")
    print()
# Q: What is the capital of Indonesia?
# A: Jakarta (score: 97.2%, pos: 92-99)
#
# Q: How many islands does Indonesia have?
# A: over 17,000 (score: 85.3%, pos: 131-143)
#
# Q: When did Indonesia become independent?
# A: August 17, 1945 (score: 95.1%, pos: 233-249)
#
# Q: What is the population of Indonesia?
# A: 270 million (score: 88.7%, pos: 161-172)

🔬

2. Bagaimana QA Model Bekerja — Start/End Logits

2. How QA Models Work — Start/End Logits

Model memprediksi DUA hal: posisi MULAI jawaban dan posisi AKHIR jawaban

Model predicts TWO things: answer START position and answer END position

35_qa_internals.py — QA Model Internals 🔬python

import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_name = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name).to("cuda")

question = "What is the capital?"
context = "Indonesia is a country in Southeast Asia. Jakarta is the capital city."

# ===========================
# 1. Tokenize question + context as a PAIR
# ===========================
inputs = tokenizer(question, context, return_tensors="pt").to("cuda")

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print("Tokens:", tokens)
# ['[CLS]', 'What', 'is', 'the', 'capital', '?', '[SEP]',
#  'Indonesia', 'is', 'a', 'country', 'in', 'Southeast', 'Asia', '.',
#  'Jakarta', 'is', 'the', 'capital', 'city', '.', '[SEP]']
#  ← question →              ← context →
# token_type_ids: [0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
# 0=question, 1=context

# ===========================
# 2. Forward pass → start_logits & end_logits
# ===========================
with torch.no_grad():
    outputs = model(**inputs)

print(f"start_logits shape: {outputs.start_logits.shape}")  # (1, 22)
print(f"end_logits shape:   {outputs.end_logits.shape}")    # (1, 22)
# Satu score per token! Model menilai setiap token:
# "Seberapa mungkin token ini adalah AWAL jawaban?"
# "Seberapa mungkin token ini adalah AKHIR jawaban?"

# ===========================
# 3. Find answer span
# ===========================
start_idx = outputs.start_logits.argmax().item()
end_idx = outputs.end_logits.argmax().item()

print(f"Start token [{start_idx}]: '{tokens[start_idx]}'")  # 'Jakarta'
print(f"End token   [{end_idx}]: '{tokens[end_idx]}'")      # 'Jakarta'

# Decode answer
answer_ids = inputs["input_ids"][0][start_idx:end_idx + 1]
answer = tokenizer.decode(answer_ids)
print(f"Answer: '{answer}'")  # 'Jakarta'

# ===========================
# 4. Confidence score
# ===========================
start_probs = torch.softmax(outputs.start_logits, dim=-1)
end_probs = torch.softmax(outputs.end_logits, dim=-1)
score = (start_probs[0, start_idx] * end_probs[0, end_idx]).item()
print(f"Confidence: {score:.1%}")  # 94.2%

QA Model Internal — Start & End Logits Visualized Tokens: [CLS] What is the capital ? [SEP] Indonesia is a country ... Jakarta is the capital city . [SEP] Index: 0 1 2 3 4 5 6 7 8 9 10 ... 15 16 17 18 19 20 21 start_logits: (seberapa mungkin token ini = AWAL jawaban) [CLS] : -5.2 ██ What : -4.8 ██ ... Indonesia : -2.1 ████ ... Jakarta : 8.7 ████████████████████████████████████ ← TERTINGGI! is : -3.1 ███ the : -4.2 ██ capital : 2.3 ██████████ ... end_logits: (seberapa mungkin token ini = AKHIR jawaban) ... Jakarta : 9.1 █████████████████████████████████████ ← TERTINGGI! is : -2.8 ███ the : -3.5 ██ capital : 1.8 ████████ city : 0.5 █████ ... → start=15 (Jakarta), end=15 (Jakarta) → Answer = tokens[15:16] = "Jakarta" ✅

📊

3. SQuAD Dataset — Benchmark QA Klasik

3. SQuAD Dataset — Classic QA Benchmark

100k+ question-answer pairs dari Wikipedia — standar evaluasi QA dunia

100k+ question-answer pairs from Wikipedia — global QA evaluation standard

36_squad_dataset.py — Explore SQuADpython

from datasets import load_dataset

dataset = load_dataset("squad")
print(dataset)
# DatasetDict({
#     train: Dataset({features: ['id','title','context','question','answers'], num_rows: 87599})
#     validation: Dataset({num_rows: 10570})
# })

example = dataset["train"][0]
print(f"Title:    {example['title']}")
print(f"Context:  {example['context'][:150]}...")
print(f"Question: {example['question']}")
print(f"Answers:  {example['answers']}")
# Title:    University_of_Notre_Dame
# Context:  "Architecturally, the most striking of the univer..."
# Question: "To whom did the Virgin Mary allegedly appear in 1858..."
# Answers:  {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}

# KEY FORMAT:
# answers = {'text': ['jawaban'], 'answer_start': [posisi_karakter]}
# answer_start = CHARACTER position (bukan token position!)
# Bisa ada beberapa jawaban valid (SQuAD v2 juga punya unanswerable)

# SQuAD v2 (dengan unanswerable questions):
# squad_v2 = load_dataset("squad_v2")
# Beberapa pertanyaan TIDAK punya jawaban di konteks!
# Model harus belajar menjawab "I don't know" (empty answer)

📏

4. Long Context Problem — Stride & Overflow Handling

BERT max = 512 tokens. Tapi banyak konteks > 512 tokens. Bagaimana?

BERT max = 512 tokens. But many contexts > 512 tokens. How?

Masalah: BERT hanya bisa menerima 512 tokens. Tapi konteks Wikipedia bisa 1000+ tokens. Jika kita truncate, jawaban mungkin terpotong! Solusi: sliding window (stride) — pecah konteks menjadi beberapa overlapping chunks.

Problem: BERT can only accept 512 tokens. But Wikipedia contexts can be 1000+ tokens. If we truncate, the answer might get cut off! Solution: sliding window (stride) — split context into overlapping chunks.

Sliding Window — Memecah Long Context Context panjang: [==========================================] (800 tokens) BERT max: 512 tokens (termasuk question ~30 tokens → context max ~480) Tanpa stride: Truncate → jawaban di akhir HILANG! ❌ [============================TERPOTONG] (512 tokens, sisa hilang) Dengan stride=128: Overlapping chunks → jawaban PASTI ada ✅ Chunk 1: [===================] (token 0-480) Chunk 2: [===================] (token 352-832) overlap! ↑ stride=128 ↑ overlap region Jawaban ada di token 400? → Ada di Chunk 1 DAN Chunk 2! Model predicts pada KEDUA chunks, ambil yang confidence-nya tertinggi.

37_long_context.py — Stride Tokenization for QApython

# Tokenize with stride (overlap) for long contexts
def tokenize_qa_with_stride(examples):
    tokenized = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",     # truncate CONTEXT only, not question!
        max_length=384,
        stride=128,                   # overlap between chunks
        return_overflowing_tokens=True, # return ALL chunks!
        return_offsets_mapping=True,    # char↔token mapping (for answer position)
        padding="max_length",
    )
    # 1 example bisa jadi 2-3 chunks jika konteks panjang!
    # overflow_to_sample_mapping: chunk → original example index
    return tokenized

# Satu contoh dengan konteks panjang bisa menghasilkan:
# - 1 chunk jika konteks pendek (<384 tokens)
# - 2-3 chunks jika konteks panjang (800-1200 tokens)
# - Chunks overlap 128 tokens → jawaban tidak terpotong!

🔥

5. Fine-Tune BERT untuk QA pada SQuAD

5. Fine-Tune BERT for QA on SQuAD

Complete pipeline — paling kompleks di seri ini karena answer position handling

Complete pipeline — most complex in this series due to answer position handling

38_qa_finetune.py — SQuAD Fine-Tuning 🔥🔥python

from transformers import (
    AutoTokenizer, AutoModelForQuestionAnswering,
    TrainingArguments, Trainer, DefaultDataCollator
)
from datasets import load_dataset

# ═══════════════════════════════════════
# STEP 1: LOAD
# ═══════════════════════════════════════
model_name = "distilbert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
dataset = load_dataset("squad")

# ═══════════════════════════════════════
# STEP 2: TOKENIZE + FIND ANSWER POSITION IN TOKENS
# This is the HARDEST part of QA fine-tuning!
# ═══════════════════════════════════════
def preprocess_training(examples):
    tokenized = tokenizer(
        examples["question"], examples["context"],
        truncation="only_second", max_length=384,
        stride=128, return_overflowing_tokens=True,
        return_offsets_mapping=True, padding="max_length",
    )

    sample_map = tokenized.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized.pop("offset_mapping")

    start_positions, end_positions = [], []

    for i, offsets in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answers = examples["answers"][sample_idx]
        start_char = answers["answer_start"][0]
        end_char = start_char + len(answers["text"][0])

        # Find context token range (skip question tokens)
        sequence_ids = tokenized.sequence_ids(i)
        ctx_start = 0
        while sequence_ids[ctx_start] != 1: ctx_start += 1
        ctx_end = len(sequence_ids) - 1
        while sequence_ids[ctx_end] != 1: ctx_end -= 1

        # Check if answer is within this chunk
        if offsets[ctx_start][0] > end_char or offsets[ctx_end][1] < start_char:
            # Answer not in this chunk → CLS token (no answer)
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Find token positions of answer
            s = ctx_start
            while s <= ctx_end and offsets[s][0] <= start_char: s += 1
            start_positions.append(s - 1)

            e = ctx_end
            while e >= ctx_start and offsets[e][1] >= end_char: e -= 1
            end_positions.append(e + 1)

    tokenized["start_positions"] = start_positions
    tokenized["end_positions"] = end_positions
    return tokenized

tokenized = dataset["train"].map(preprocess_training, batched=True,
    remove_columns=dataset["train"].column_names)

# ═══════════════════════════════════════
# STEP 3: TRAIN
# ═══════════════════════════════════════
args = TrainingArguments(
    output_dir="./qa-distilbert",
    num_train_epochs=3, per_device_train_batch_size=16,
    learning_rate=2e-5, weight_decay=0.01, fp16=True,
    eval_strategy="epoch", save_strategy="epoch",
    load_best_model_at_end=True, report_to="none",
)

trainer = Trainer(
    model=model, args=args, train_dataset=tokenized,
    data_collator=DefaultDataCollator(), tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("./qa-distilbert-final")

# Test with pipeline!
from transformers import pipeline
qa_pipe = pipeline("question-answering", model="./qa-distilbert-final", device=0)
result = qa_pipe(question="What is the capital?", context="Jakarta is the capital of Indonesia.")
print(f"Answer: {result['answer']} ({result['score']:.1%})")
# Answer: Jakarta (95.3%)

🎓 Kenapa QA Fine-Tuning Lebih Kompleks dari Text Classification?
1. Input = pair (question + context), bukan single text
2. Label = positions (start_char → start_token, end_char → end_token), bukan class index
3. Char→token mapping: jawaban di dataset = posisi karakter, model butuh posisi token
4. Long context: stride/overflow → satu contoh jadi beberapa chunks
5. Answer in chunk?: harus cek apakah jawaban ada di chunk ini atau tidak
Ini tugas fine-tuning paling kompleks di seri HF — tapi hasilnya powerful!

🎓 Why Is QA Fine-Tuning More Complex Than Text Classification?
1. Input = pair (question + context), not single text
2. Label = positions (start_char → start_token, end_char → end_token), not class index
3. Char→token mapping: dataset answer = character position, model needs token position
4. Long context: stride/overflow → one example becomes multiple chunks
5. Answer in chunk?: must check if answer is in this chunk or not
This is the most complex fine-tuning task in the HF series — but the result is powerful!

📊

6. Evaluasi QA — Exact Match & F1 Score

6. QA Evaluation — Exact Match & F1 Score

Dua metric: EM (jawaban PERSIS sama) dan F1 (overlap kata)

Two metrics: EM (answer EXACTLY matches) and F1 (word overlap)

39_qa_eval.py — QA Evaluation Metricspython

# ===========================
# QA uses 2 metrics:
# ===========================
#
# 1. Exact Match (EM): jawaban predicted == jawaban gold?
#    Predicted: "Jakarta"  Gold: "Jakarta"  → EM = 1.0 ✅
#    Predicted: "Jakarta"  Gold: "the city of Jakarta" → EM = 0.0 ❌
#    (strict! harus PERSIS sama setelah normalisasi)
#
# 2. F1 Score: overlap kata antara predicted dan gold
#    Predicted: "Jakarta"  Gold: "the city of Jakarta"
#    Overlap: {"Jakarta"} = 1 kata
#    Precision: 1/1 = 100%  (semua predicted kata ada di gold)
#    Recall: 1/4 = 25%      (hanya 1 dari 4 gold kata di-predict)
#    F1 = 2 × (1.0 × 0.25) / (1.0 + 0.25) = 0.40
#
# Typical results for BERT on SQuAD v1.1:
# EM: ~80%  F1: ~88%
# (F1 selalu >= EM karena partial credit)

import evaluate
squad_metric = evaluate.load("squad")

predictions = [{"id": "1", "prediction_text": "Jakarta"}]
references = [{"id": "1", "answers": {"text": ["Jakarta"], "answer_start": [0]}}]

result = squad_metric.compute(predictions=predictions, references=references)
print(result)  # {'exact_match': 100.0, 'f1': 100.0}

🔄

7. Encoder-Decoder (Seq2Seq) Architecture — T5 & BART

Arsitektur ketiga: gabungan encoder (memahami) + decoder (menghasilkan)

The third architecture: combining encoder (understanding) + decoder (generating)

3 Arsitektur Transformer di HF — Recap Lengkap 1. Encoder Only (BERT) ← Page 2 & 4 Input → [Encoder] → Representation → Classification Head Best for: Classification, NER, QA Models: BERT, RoBERTa, DeBERTa, DistilBERT HF: AutoModelForSequenceClassification / TokenClassification / QA 2. Decoder Only (GPT) ← Page 3 Prompt → [Decoder] → Next Token → Next Token → ... Best for: Text generation, chat, code Models: GPT-2, LLaMA, Mistral, Gemma HF: AutoModelForCausalLM 3. Encoder-Decoder (T5/BART) ← Page 5 (INI!) Input → [Encoder] → Representation → [Decoder] → Output Sequence Best for: Translation, summarization, text-to-text Models: T5, BART, mBART, mT5, FLAN-T5 HF: AutoModelForSeq2SeqLM Encoder-Decoder menggabungkan KEDUA kekuatan: Encoder: memahami input secara bidirectional (seperti BERT) Decoder: menghasilkan output secara autoregressive (seperti GPT) → Perfect untuk tasks yang butuh MEMAHAMI input lalu MENGHASILKAN output berbeda

🔤

8. T5 Text-to-Text Framework — Semua Tugas = Text→Text

8. T5 Text-to-Text Framework — All Tasks = Text→Text

T5 memperlakukan SEMUA tugas NLP sebagai "masukkan teks, keluarkan teks" — elegant!

T5 treats ALL NLP tasks as "input text, output text" — elegant!

40_t5_framework.py — T5 Unified Text-to-Textpython

from transformers import pipeline

# ===========================
# T5 treats EVERY task as text → text!
# Just add a PREFIX to tell T5 what task to do.
# ===========================

# Translation
translator = pipeline("text2text-generation", model="google-t5/t5-small")
result = translator("translate English to French: I love machine learning.")
print(result[0]["generated_text"])
# "J'aime l'apprentissage automatique."

# Summarization
result = translator("summarize: Hugging Face is a company that provides tools...", max_length=50)
print(result[0]["generated_text"])

# Sentiment (as text-to-text!)
result = translator("sst2 sentence: I love this movie.")
print(result[0]["generated_text"])
# "positive"

# T5 sizes (all use same text-to-text format):
# t5-small:  60M params,  ~240MB   (fits Colab easily!)
# t5-base:   220M params, ~890MB
# t5-large:  770M params, ~3GB
# t5-3b:     3B params,   ~12GB
# t5-11b:    11B params,  ~42GB
# flan-t5-*: instruction-tuned versions (MUCH better!)

# FLAN-T5 = T5 + instruction tuning (1.8k tasks!)
flan = pipeline("text2text-generation", model="google/flan-t5-small")
result = flan("What is the capital of Indonesia?")
print(result[0]["generated_text"])
# "Jakarta" ← no prefix needed! FLAN-T5 understands instructions.

📝

9. Fine-Tune T5 untuk Summarization — CNN/DailyMail

9. Fine-Tune T5 for Summarization — CNN/DailyMail

Ajarkan T5 meringkas artikel berita menjadi beberapa kalimat

Teach T5 to summarize news articles into a few sentences

41_t5_summarization.py — T5 Summarization Fine-Tuning 🔥python

from transformers import (
    AutoTokenizer, AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments, Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)
from datasets import load_dataset

# ═══════════════════════════════════════
# STEP 1: LOAD
# ═══════════════════════════════════════
model_name = "google-t5/t5-small"  # 60M params, fits Colab!
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

dataset = load_dataset("cnn_dailymail", "3.0.0")
# train: 287k articles, val: 13k, test: 11k

# Use subset for quick training
train_data = dataset["train"].shuffle(42).select(range(10000))
val_data = dataset["validation"].shuffle(42).select(range(1000))

# ═══════════════════════════════════════
# STEP 2: TOKENIZE (input + target!)
# ═══════════════════════════════════════
def preprocess(examples):
    # T5 needs prefix!
    inputs = ["summarize: " + doc for doc in examples["article"]]
    targets = examples["highlights"]

    # Tokenize inputs
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)

    # Tokenize targets (labels!)
    labels = tokenizer(text_target=targets, max_length=128, truncation=True)
    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

tokenized_train = train_data.map(preprocess, batched=True,
    remove_columns=train_data.column_names)
tokenized_val = val_data.map(preprocess, batched=True,
    remove_columns=val_data.column_names)

# ═══════════════════════════════════════
# STEP 3: DATA COLLATOR (Seq2Seq specific!)
# ═══════════════════════════════════════
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
# Pads BOTH input_ids AND labels!
# Labels padding → -100 (ignored in loss)

# ═══════════════════════════════════════
# STEP 4: TRAIN
# ═══════════════════════════════════════
args = Seq2SeqTrainingArguments(
    output_dir="./t5-summarization",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=4,
    learning_rate=3e-5,
    weight_decay=0.01,
    fp16=True,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    predict_with_generate=True,  # ← CRITICAL for Seq2Seq eval!
    generation_max_length=128,
    report_to="none",
)

trainer = Seq2SeqTrainer(
    model=model, args=args,
    train_dataset=tokenized_train, eval_dataset=tokenized_val,
    tokenizer=tokenizer, data_collator=data_collator,
)
trainer.train()

# Test!
from transformers import pipeline
summarizer = pipeline("summarization", model="./t5-summarization", device=0)
article = "Hugging Face has raised $235 million in a Series D funding round..."
summary = summarizer(article, max_length=60, min_length=20)
print(summary[0]["summary_text"])

🎓 Perbedaan Kunci Seq2Seq vs Classification Fine-Tuning:
1. Model: AutoModelForSeq2SeqLM (bukan ForSequenceClassification)
2. Labels: = tokenized TARGET text (bukan integer class)
3. Collator: DataCollatorForSeq2Seq (pad input + labels)
4. Trainer: Seq2SeqTrainer + Seq2SeqTrainingArguments
5. Eval: predict_with_generate=True → model generates text saat eval
6. Prefix: T5 butuh prefix ("summarize:", "translate:"). BART tidak.
7. Metric: ROUGE (bukan accuracy/F1)

🎓 Key Differences: Seq2Seq vs Classification Fine-Tuning:
1. Model: AutoModelForSeq2SeqLM (not ForSequenceClassification)
2. Labels: = tokenized TARGET text (not integer class)
3. Collator: DataCollatorForSeq2Seq (pads input + labels)
4. Trainer: Seq2SeqTrainer + Seq2SeqTrainingArguments
5. Eval: predict_with_generate=True → model generates text during eval
6. Prefix: T5 needs prefix ("summarize:", "translate:"). BART doesn't.
7. Metric: ROUGE (not accuracy/F1)

🌐

10. Fine-Tune T5 untuk Translation — English↔Indonesian

10. Fine-Tune T5 for Translation — English↔Indonesian

Template translation — ganti dataset & prefix, sisanya identik dengan summarization

Translation template — change dataset & prefix, rest is identical to summarization

42_t5_translation.py — Translation Fine-Tuningpython

# Translation = SAME pipeline as summarization!
# Just change: dataset, prefix, and metrics.

# ===========================
# 1. Use pre-trained translation model (no fine-tuning!)
# ===========================
from transformers import pipeline

# Helsinki-NLP models: dedicated translation models
en_to_id = pipeline("translation", model="Helsinki-NLP/opus-mt-en-id")
result = en_to_id("I am learning artificial intelligence with Hugging Face.")
print(result[0]["translation_text"])
# "Saya belajar kecerdasan buatan dengan Hugging Face."

id_to_en = pipeline("translation", model="Helsinki-NLP/opus-mt-id-en")
result = id_to_en("Jakarta adalah ibukota Indonesia.")
print(result[0]["translation_text"])
# "Jakarta is the capital of Indonesia."

# ===========================
# 2. Fine-tune T5 for custom translation
# ===========================
def preprocess_translation(examples):
    # Add T5 prefix for translation
    inputs = ["translate English to Indonesian: " + s for s in examples["en"]]
    targets = examples["id"]

    model_inputs = tokenizer(inputs, max_length=128, truncation=True)
    labels = tokenizer(text_target=targets, max_length=128, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Use parallel corpus (English-Indonesian pairs)
# dataset = load_dataset("opus100", "en-id")
# Or your own CSV: columns "en" and "id"
# dataset = load_dataset("csv", data_files="parallel_corpus.csv")

# Rest is IDENTICAL to summarization pipeline!
# tokenize → DataCollatorForSeq2Seq → Seq2SeqTrainer → train

📊

11. BLEU & ROUGE Metrics — Evaluasi Generasi Teks

11. BLEU & ROUGE Metrics — Evaluating Text Generation

BLEU untuk translation, ROUGE untuk summarization — dua standar industri

BLEU for translation, ROUGE for summarization — two industry standards

43_bleu_rouge.py — Generation Evaluation Metricspython

import evaluate

# ===========================
# 1. BLEU — for Translation
# Measures n-gram PRECISION (how many predicted n-grams are in reference?)
# ===========================
bleu = evaluate.load("bleu")

predictions = ["Jakarta is the capital of Indonesia"]
references = [["Jakarta is the capital city of Indonesia"]]
# references = list of LIST (multiple valid translations per example!)

result = bleu.compute(predictions=predictions, references=references)
print(f"BLEU: {result['bleu']:.2%}")  # ~71%
# BLEU range: 0-100%. >40% = good translation, >60% = very good

# ===========================
# 2. ROUGE — for Summarization
# Measures n-gram RECALL (how many reference n-grams are in prediction?)
# ===========================
rouge = evaluate.load("rouge")

predictions = ["HF raised $235M at $4.5B valuation"]
references = ["Hugging Face raised $235 million in Series D funding"]

result = rouge.compute(predictions=predictions, references=references)
print(f"ROUGE-1: {result['rouge1']:.2%}")   # unigram overlap
print(f"ROUGE-2: {result['rouge2']:.2%}")   # bigram overlap
print(f"ROUGE-L: {result['rougeL']:.2%}")   # longest common subsequence
# ROUGE-1: ~55%, ROUGE-2: ~25%, ROUGE-L: ~45% (typical for summarization)

# ===========================
# BLEU vs ROUGE — kapan pakai mana?
# ===========================
# BLEU: translation (precision-focused — are predicted words correct?)
# ROUGE: summarization (recall-focused — are reference ideas covered?)
# Both: imperfect! High BLEU/ROUGE ≠ good text. Low ≠ bad text.
# Always combine with human evaluation for production quality.

Metric	Fokus	Task	Range	Good Score
BLEU	Precision (predicted vs reference)	Translation	0-100%	>40%
ROUGE-1	Unigram recall	Summarization	0-100%	>40%
ROUGE-2	Bigram recall	Summarization	0-100%	>20%
ROUGE-L	Longest common subsequence	Summarization	0-100%	>35%
Exact Match	Exact string match	QA	0-100%	>75%
F1 (QA)	Word overlap	QA	0-100%	>85%

Metric	Focus	Task	Range	Good Score
BLEU	Precision (predicted vs reference)	Translation	0-100%	>40%
ROUGE-1	Unigram recall	Summarization	0-100%	>40%
ROUGE-2	Bigram recall	Summarization	0-100%	>20%
ROUGE-L	Longest common subsequence	Summarization	0-100%	>35%
Exact Match	Exact string match	QA	0-100%	>75%
F1 (QA)	Word overlap	QA	0-100%	>85%

💻

12. Di Mana Jalankan? — VRAM untuk T5/BART

12. Where to Run? — VRAM for T5/BART

Model	Params	VRAM Fine-Tune FP16	Colab T4?
T5-small	60M	~4 GB	✅ Sangat nyaman
FLAN-T5-small	80M	~5 GB	✅ Nyaman
T5-base	220M	~10 GB	✅ OK
BART-base	140M	~7 GB	✅ OK
BART-large-CNN	400M	~14 GB	⚠️ Batch kecil
T5-large	770M	>16 GB	❌ Butuh A100
FLAN-T5-large	780M	>16 GB	❌ Butuh A100

Model	Params	VRAM Fine-Tune FP16	Colab T4?
T5-small	60M	~4 GB	✅ Very comfortable
FLAN-T5-small	80M	~5 GB	✅ Comfortable
T5-base	220M	~10 GB	✅ OK
BART-base	140M	~7 GB	✅ OK
BART-large-CNN	400M	~14 GB	⚠️ Small batch
T5-large	770M	>16 GB	❌ Needs A100
FLAN-T5-large	780M	>16 GB	❌ Needs A100

🎉 Semua kode di Page 5 menggunakan T5-small/DistilBERT yang berjalan nyaman di Google Colab gratis (T4 16GB). QA fine-tuning: ~10 menit. T5 summarization: ~20 menit. Setup: sama seperti Page 2.

🎉 All code in Page 5 uses T5-small/DistilBERT which runs comfortably on free Google Colab (T4 16GB). QA fine-tuning: ~10 min. T5 summarization: ~20 min. Setup: same as Page 2.

📝

13. Ringkasan Page 5

13. Page 5 Summary

Konsep	Apa Itu	Kode Kunci
Extractive QA	Temukan jawaban di konteks	`AutoModelForQuestionAnswering`
Start/End Logits	Model tunjuk posisi awal+akhir	`outputs.start_logits.argmax()`
SQuAD	87k QA pairs benchmark	`load_dataset("squad")`
Stride	Sliding window untuk long context	`stride=128, return_overflowing_tokens`
Encoder-Decoder	Understand (enc) + Generate (dec)	`AutoModelForSeq2SeqLM`
T5	Text-to-text untuk semua tugas	`"summarize: ..." / "translate: ..."`
Seq2SeqTrainer	Trainer untuk Seq2Seq models	`predict_with_generate=True`
DataCollatorForSeq2Seq	Pad input + labels	`DataCollatorForSeq2Seq(tokenizer, model)`
BLEU	Translation quality (precision)	`evaluate.load("bleu")`
ROUGE	Summarization quality (recall)	`evaluate.load("rouge")`

Concept	What It Is	Key Code
Extractive QA	Find answer in context	`AutoModelForQuestionAnswering`
Start/End Logits	Model points to start+end position	`outputs.start_logits.argmax()`
SQuAD	87k QA pairs benchmark	`load_dataset("squad")`
Stride	Sliding window for long context	`stride=128, return_overflowing_tokens`
Encoder-Decoder	Understand (enc) + Generate (dec)	`AutoModelForSeq2SeqLM`
T5	Text-to-text for all tasks	`"summarize: ..." / "translate: ..."`
Seq2SeqTrainer	Trainer for Seq2Seq models	`predict_with_generate=True`
DataCollatorForSeq2Seq	Pad input + labels	`DataCollatorForSeq2Seq(tokenizer, model)`
BLEU	Translation quality (precision)	`evaluate.load("bleu")`
ROUGE	Summarization quality (recall)	`evaluate.load("rouge")`

← Page Sebelumnya← Previous Page

Page 4 — Token Classification & NER

📘

Coming Next: Page 6 — Sentence Embeddings & Semantic Search

Ubah kalimat menjadi vektor bermakna! Page 6 membahas: Sentence Transformers library, semantic similarity (cosine similarity), fine-tuning embedding models, FAISS vector search untuk jutaan dokumen, building a semantic search engine, retrieval-augmented generation (RAG) foundations, dan cross-encoder vs bi-encoder.

📘

Coming Next: Page 6 — Sentence Embeddings & Semantic Search

Turn sentences into meaningful vectors! Page 6 covers: Sentence Transformers library, semantic similarity (cosine similarity), fine-tuning embedding models, FAISS vector search for millions of documents, building a semantic search engine, retrieval-augmented generation (RAG) foundations, and cross-encoder vs bi-encoder.

Question Answering
& Seq2Seq (T5/BART)

Question Answering
& Seq2Seq (T5/BART)

📑 Daftar Isi — Page 5

📑 Table of Contents — Page 5

1. Extractive QA — Menemukan Jawaban di Dalam Konteks

1. Extractive QA — Finding Answers Within Context

2. Bagaimana QA Model Bekerja — Start/End Logits

2. How QA Models Work — Start/End Logits

3. SQuAD Dataset — Benchmark QA Klasik

3. SQuAD Dataset — Classic QA Benchmark

4. Long Context Problem — Stride & Overflow Handling

4. Long Context Problem — Stride & Overflow Handling

5. Fine-Tune BERT untuk QA pada SQuAD

5. Fine-Tune BERT for QA on SQuAD

6. Evaluasi QA — Exact Match & F1 Score

6. QA Evaluation — Exact Match & F1 Score

7. Encoder-Decoder (Seq2Seq) Architecture — T5 & BART

7. Encoder-Decoder (Seq2Seq) Architecture — T5 & BART

8. T5 Text-to-Text Framework — Semua Tugas = Text→Text

8. T5 Text-to-Text Framework — All Tasks = Text→Text

9. Fine-Tune T5 untuk Summarization — CNN/DailyMail

9. Fine-Tune T5 for Summarization — CNN/DailyMail

10. Fine-Tune T5 untuk Translation — English↔Indonesian

10. Fine-Tune T5 for Translation — English↔Indonesian

11. BLEU & ROUGE Metrics — Evaluasi Generasi Teks

11. BLEU & ROUGE Metrics — Evaluating Text Generation

12. Di Mana Jalankan? — VRAM untuk T5/BART

12. Where to Run? — VRAM for T5/BART

13. Ringkasan Page 5

13. Page 5 Summary

Page 4 — Token Classification & NER

Coming Next: Page 6 — Sentence Embeddings & Semantic Search

Coming Next: Page 6 — Sentence Embeddings & Semantic Search