πŸ“ Artikel ini ditulis dalam Bahasa Indonesia & English
πŸ“ This article is available in English & Bahasa Indonesia

❓ Belajar Hugging Face β€” Page 5Learn Hugging Face β€” Page 5

Question Answering
& Seq2Seq (T5/BART)

Question Answering
& Seq2Seq (T5/BART)

Dua tugas NLP paling powerful yang menggabungkan pemahaman dan generasi. Page 5 membahas super detail: Extractive QA β€” menemukan jawaban di dalam konteks (SQuAD dataset), bagaimana model QA bekerja (start/end logits β†’ span extraction), masalah long context yang melebihi max_length (stride/overflow handling), fine-tuning BERT untuk QA pada SQuAD, evaluasi QA (Exact Match & F1 score), Encoder-Decoder (Seq2Seq) architecture β€” T5 dan BART, text-to-text framework T5 ("translate English to French: ..."), fine-tuning T5 untuk translation dan summarization, BLEU dan ROUGE metrics, di mana menjalankan (Colab T4 cukup untuk T5-small), dan abstractive vs extractive QA.

Two of the most powerful NLP tasks combining understanding and generation. Page 5 covers in super detail: Extractive QA β€” finding answers within context (SQuAD dataset), how QA models work (start/end logits β†’ span extraction), long context exceeding max_length (stride/overflow handling), fine-tuning BERT for QA on SQuAD, QA evaluation (Exact Match & F1 score), Encoder-Decoder (Seq2Seq) architecture β€” T5 and BART, T5 text-to-text framework ("translate English to French: ..."), fine-tuning T5 for translation and summarization, BLEU and ROUGE metrics, where to run (Colab T4 enough for T5-small), and abstractive vs extractive QA.

πŸ“… MaretMarch 2026⏱ 42 menit baca42 min read
🏷 QASQuADT5BARTSeq2SeqTranslationSummarizationBLEUROUGE
πŸ“š Seri Belajar Hugging Face:Learn Hugging Face Series:

πŸ“‘ Daftar Isi β€” Page 5

πŸ“‘ Table of Contents β€” Page 5

  1. Extractive QA β€” Menemukan jawaban di dalam konteks
  2. Bagaimana QA Model Bekerja β€” Start/end logits β†’ span
  3. SQuAD Dataset β€” Benchmark QA klasik
  4. Long Context Problem β€” Stride & overflow handling
  5. Fine-Tune BERT untuk QA β€” Complete SQuAD pipeline
  6. Evaluasi QA β€” Exact Match & F1
  7. Encoder-Decoder Architecture β€” T5 & BART
  8. T5 Text-to-Text Framework — Semua tugas = text→text
  9. Fine-Tune T5 untuk Summarization β€” CNN/DailyMail
  10. Fine-Tune T5 untuk Translation β€” English↔Indonesian
  11. BLEU & ROUGE Metrics β€” Evaluasi generasi teks
  12. Di Mana Jalankan? β€” VRAM untuk T5/BART
  13. Ringkasan & Preview Page 6
  1. Extractive QA β€” Finding answers within context
  2. How QA Models Work β€” Start/end logits β†’ span
  3. SQuAD Dataset β€” Classic QA benchmark
  4. Long Context Problem β€” Stride & overflow handling
  5. Fine-Tune BERT for QA β€” Complete SQuAD pipeline
  6. QA Evaluation β€” Exact Match & F1
  7. Encoder-Decoder Architecture β€” T5 & BART
  8. T5 Text-to-Text Framework — All tasks = text→text
  9. Fine-Tune T5 for Summarization β€” CNN/DailyMail
  10. Fine-Tune T5 for Translation β€” English↔Indonesian
  11. BLEU & ROUGE Metrics β€” Evaluating text generation
  12. Where to Run? β€” VRAM for T5/BART
  13. Summary & Page 6 Preview
❓

1. Extractive QA β€” Menemukan Jawaban di Dalam Konteks

1. Extractive QA β€” Finding Answers Within Context

Diberikan pertanyaan + konteks paragraf β†’ temukan potongan teks yang menjawab pertanyaan
Given a question + context paragraph β†’ find the text span that answers the question

Extractive QA = model diberikan pertanyaan dan konteks (paragraf), lalu harus menemukan potongan teks di dalam konteks yang menjawab pertanyaan. Model tidak menghasilkan teks baru β€” ia hanya menunjuk posisi awal dan akhir jawaban di konteks. Ini berbeda dari generative QA (ChatGPT-style) yang menulis jawaban baru.

Extractive QA = the model is given a question and context (paragraph), then must find the text span within the context that answers the question. The model doesn't generate new text β€” it only points to the start and end positions of the answer in the context. This differs from generative QA (ChatGPT-style) which writes new answers.

Extractive QA β€” Model "Menunjuk" Jawaban di Konteks Question: "What is the capital of Indonesia?" Context: "Indonesia is a country in Southeast Asia. Jakarta is the capital and largest city. The country has over 17,000 islands and a population of 270 million people." Model output: Start position: 42 (karakter ke-42 = "J" dari "Jakarta") End position: 49 (karakter ke-49 = "a" dari "Jakarta") β†’ Answer: "Jakarta" ← DIEKSTRAK dari konteks, bukan di-generate! Extractive: jawaban HARUS ada di konteks (copy span) Generative: jawaban bisa ditulis baru (seperti ChatGPT) Abstractive: jawaban diringkas/diparafrase dari konteks
34_qa_pipeline.py β€” QA Pipeline Instantpython
from transformers import pipeline

# ===========================
# QA pipeline β€” zero training!
# ===========================
qa = pipeline("question-answering", device=0)

context = """
Indonesia is a country in Southeast Asia and Oceania between the Indian
and Pacific oceans. Jakarta is the capital and most populous city. The
country has over 17,000 islands with a population of 270 million,
making it the world's fourth most populous country. Indonesia became
independent from the Netherlands on August 17, 1945.
"""

# Ask multiple questions on the SAME context
questions = [
    "What is the capital of Indonesia?",
    "How many islands does Indonesia have?",
    "When did Indonesia become independent?",
    "What is the population of Indonesia?",
]

for q in questions:
    result = qa(question=q, context=context)
    print(f"Q: {q}")
    print(f"A: {result['answer']} (score: {result['score']:.1%}, pos: {result['start']}-{result['end']})")
    print()
# Q: What is the capital of Indonesia?
# A: Jakarta (score: 97.2%, pos: 92-99)
#
# Q: How many islands does Indonesia have?
# A: over 17,000 (score: 85.3%, pos: 131-143)
#
# Q: When did Indonesia become independent?
# A: August 17, 1945 (score: 95.1%, pos: 233-249)
#
# Q: What is the population of Indonesia?
# A: 270 million (score: 88.7%, pos: 161-172)
πŸ”¬

2. Bagaimana QA Model Bekerja β€” Start/End Logits

2. How QA Models Work β€” Start/End Logits

Model memprediksi DUA hal: posisi MULAI jawaban dan posisi AKHIR jawaban
Model predicts TWO things: answer START position and answer END position
35_qa_internals.py β€” QA Model Internals πŸ”¬python
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_name = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name).to("cuda")

question = "What is the capital?"
context = "Indonesia is a country in Southeast Asia. Jakarta is the capital city."

# ===========================
# 1. Tokenize question + context as a PAIR
# ===========================
inputs = tokenizer(question, context, return_tensors="pt").to("cuda")

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print("Tokens:", tokens)
# ['[CLS]', 'What', 'is', 'the', 'capital', '?', '[SEP]',
#  'Indonesia', 'is', 'a', 'country', 'in', 'Southeast', 'Asia', '.',
#  'Jakarta', 'is', 'the', 'capital', 'city', '.', '[SEP]']
#  ← question β†’              ← context β†’
# token_type_ids: [0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
# 0=question, 1=context

# ===========================
# 2. Forward pass β†’ start_logits & end_logits
# ===========================
with torch.no_grad():
    outputs = model(**inputs)

print(f"start_logits shape: {outputs.start_logits.shape}")  # (1, 22)
print(f"end_logits shape:   {outputs.end_logits.shape}")    # (1, 22)
# Satu score per token! Model menilai setiap token:
# "Seberapa mungkin token ini adalah AWAL jawaban?"
# "Seberapa mungkin token ini adalah AKHIR jawaban?"

# ===========================
# 3. Find answer span
# ===========================
start_idx = outputs.start_logits.argmax().item()
end_idx = outputs.end_logits.argmax().item()

print(f"Start token [{start_idx}]: '{tokens[start_idx]}'")  # 'Jakarta'
print(f"End token   [{end_idx}]: '{tokens[end_idx]}'")      # 'Jakarta'

# Decode answer
answer_ids = inputs["input_ids"][0][start_idx:end_idx + 1]
answer = tokenizer.decode(answer_ids)
print(f"Answer: '{answer}'")  # 'Jakarta'

# ===========================
# 4. Confidence score
# ===========================
start_probs = torch.softmax(outputs.start_logits, dim=-1)
end_probs = torch.softmax(outputs.end_logits, dim=-1)
score = (start_probs[0, start_idx] * end_probs[0, end_idx]).item()
print(f"Confidence: {score:.1%}")  # 94.2%
QA Model Internal β€” Start & End Logits Visualized Tokens: [CLS] What is the capital ? [SEP] Indonesia is a country ... Jakarta is the capital city . [SEP] Index: 0 1 2 3 4 5 6 7 8 9 10 ... 15 16 17 18 19 20 21 start_logits: (seberapa mungkin token ini = AWAL jawaban) [CLS] : -5.2 β–ˆβ–ˆ What : -4.8 β–ˆβ–ˆ ... Indonesia : -2.1 β–ˆβ–ˆβ–ˆβ–ˆ ... Jakarta : 8.7 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ ← TERTINGGI! is : -3.1 β–ˆβ–ˆβ–ˆ the : -4.2 β–ˆβ–ˆ capital : 2.3 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ ... end_logits: (seberapa mungkin token ini = AKHIR jawaban) ... Jakarta : 9.1 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ ← TERTINGGI! is : -2.8 β–ˆβ–ˆβ–ˆ the : -3.5 β–ˆβ–ˆ capital : 1.8 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ city : 0.5 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ ... β†’ start=15 (Jakarta), end=15 (Jakarta) β†’ Answer = tokens[15:16] = "Jakarta" βœ…
πŸ“Š

3. SQuAD Dataset β€” Benchmark QA Klasik

3. SQuAD Dataset β€” Classic QA Benchmark

100k+ question-answer pairs dari Wikipedia β€” standar evaluasi QA dunia
100k+ question-answer pairs from Wikipedia β€” global QA evaluation standard
36_squad_dataset.py β€” Explore SQuADpython
from datasets import load_dataset

dataset = load_dataset("squad")
print(dataset)
# DatasetDict({
#     train: Dataset({features: ['id','title','context','question','answers'], num_rows: 87599})
#     validation: Dataset({num_rows: 10570})
# })

example = dataset["train"][0]
print(f"Title:    {example['title']}")
print(f"Context:  {example['context'][:150]}...")
print(f"Question: {example['question']}")
print(f"Answers:  {example['answers']}")
# Title:    University_of_Notre_Dame
# Context:  "Architecturally, the most striking of the univer..."
# Question: "To whom did the Virgin Mary allegedly appear in 1858..."
# Answers:  {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}

# KEY FORMAT:
# answers = {'text': ['jawaban'], 'answer_start': [posisi_karakter]}
# answer_start = CHARACTER position (bukan token position!)
# Bisa ada beberapa jawaban valid (SQuAD v2 juga punya unanswerable)

# SQuAD v2 (dengan unanswerable questions):
# squad_v2 = load_dataset("squad_v2")
# Beberapa pertanyaan TIDAK punya jawaban di konteks!
# Model harus belajar menjawab "I don't know" (empty answer)
πŸ“

4. Long Context Problem β€” Stride & Overflow Handling

4. Long Context Problem β€” Stride & Overflow Handling

BERT max = 512 tokens. Tapi banyak konteks > 512 tokens. Bagaimana?
BERT max = 512 tokens. But many contexts > 512 tokens. How?

Masalah: BERT hanya bisa menerima 512 tokens. Tapi konteks Wikipedia bisa 1000+ tokens. Jika kita truncate, jawaban mungkin terpotong! Solusi: sliding window (stride) β€” pecah konteks menjadi beberapa overlapping chunks.

Problem: BERT can only accept 512 tokens. But Wikipedia contexts can be 1000+ tokens. If we truncate, the answer might get cut off! Solution: sliding window (stride) β€” split context into overlapping chunks.

Sliding Window β€” Memecah Long Context Context panjang: [==========================================] (800 tokens) BERT max: 512 tokens (termasuk question ~30 tokens β†’ context max ~480) Tanpa stride: Truncate β†’ jawaban di akhir HILANG! ❌ [============================TERPOTONG] (512 tokens, sisa hilang) Dengan stride=128: Overlapping chunks β†’ jawaban PASTI ada βœ… Chunk 1: [===================] (token 0-480) Chunk 2: [===================] (token 352-832) overlap! ↑ stride=128 ↑ overlap region Jawaban ada di token 400? β†’ Ada di Chunk 1 DAN Chunk 2! Model predicts pada KEDUA chunks, ambil yang confidence-nya tertinggi.
37_long_context.py β€” Stride Tokenization for QApython
# Tokenize with stride (overlap) for long contexts
def tokenize_qa_with_stride(examples):
    tokenized = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",     # truncate CONTEXT only, not question!
        max_length=384,
        stride=128,                   # overlap between chunks
        return_overflowing_tokens=True, # return ALL chunks!
        return_offsets_mapping=True,    # char↔token mapping (for answer position)
        padding="max_length",
    )
    # 1 example bisa jadi 2-3 chunks jika konteks panjang!
    # overflow_to_sample_mapping: chunk β†’ original example index
    return tokenized

# Satu contoh dengan konteks panjang bisa menghasilkan:
# - 1 chunk jika konteks pendek (<384 tokens)
# - 2-3 chunks jika konteks panjang (800-1200 tokens)
# - Chunks overlap 128 tokens β†’ jawaban tidak terpotong!
πŸ”₯

5. Fine-Tune BERT untuk QA pada SQuAD

5. Fine-Tune BERT for QA on SQuAD

Complete pipeline β€” paling kompleks di seri ini karena answer position handling
Complete pipeline β€” most complex in this series due to answer position handling
38_qa_finetune.py β€” SQuAD Fine-Tuning πŸ”₯πŸ”₯python
from transformers import (
    AutoTokenizer, AutoModelForQuestionAnswering,
    TrainingArguments, Trainer, DefaultDataCollator
)
from datasets import load_dataset

# ═══════════════════════════════════════
# STEP 1: LOAD
# ═══════════════════════════════════════
model_name = "distilbert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
dataset = load_dataset("squad")

# ═══════════════════════════════════════
# STEP 2: TOKENIZE + FIND ANSWER POSITION IN TOKENS
# This is the HARDEST part of QA fine-tuning!
# ═══════════════════════════════════════
def preprocess_training(examples):
    tokenized = tokenizer(
        examples["question"], examples["context"],
        truncation="only_second", max_length=384,
        stride=128, return_overflowing_tokens=True,
        return_offsets_mapping=True, padding="max_length",
    )

    sample_map = tokenized.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized.pop("offset_mapping")

    start_positions, end_positions = [], []

    for i, offsets in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answers = examples["answers"][sample_idx]
        start_char = answers["answer_start"][0]
        end_char = start_char + len(answers["text"][0])

        # Find context token range (skip question tokens)
        sequence_ids = tokenized.sequence_ids(i)
        ctx_start = 0
        while sequence_ids[ctx_start] != 1: ctx_start += 1
        ctx_end = len(sequence_ids) - 1
        while sequence_ids[ctx_end] != 1: ctx_end -= 1

        # Check if answer is within this chunk
        if offsets[ctx_start][0] > end_char or offsets[ctx_end][1] < start_char:
            # Answer not in this chunk β†’ CLS token (no answer)
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Find token positions of answer
            s = ctx_start
            while s <= ctx_end and offsets[s][0] <= start_char: s += 1
            start_positions.append(s - 1)

            e = ctx_end
            while e >= ctx_start and offsets[e][1] >= end_char: e -= 1
            end_positions.append(e + 1)

    tokenized["start_positions"] = start_positions
    tokenized["end_positions"] = end_positions
    return tokenized

tokenized = dataset["train"].map(preprocess_training, batched=True,
    remove_columns=dataset["train"].column_names)

# ═══════════════════════════════════════
# STEP 3: TRAIN
# ═══════════════════════════════════════
args = TrainingArguments(
    output_dir="./qa-distilbert",
    num_train_epochs=3, per_device_train_batch_size=16,
    learning_rate=2e-5, weight_decay=0.01, fp16=True,
    eval_strategy="epoch", save_strategy="epoch",
    load_best_model_at_end=True, report_to="none",
)

trainer = Trainer(
    model=model, args=args, train_dataset=tokenized,
    data_collator=DefaultDataCollator(), tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("./qa-distilbert-final")

# Test with pipeline!
from transformers import pipeline
qa_pipe = pipeline("question-answering", model="./qa-distilbert-final", device=0)
result = qa_pipe(question="What is the capital?", context="Jakarta is the capital of Indonesia.")
print(f"Answer: {result['answer']} ({result['score']:.1%})")
# Answer: Jakarta (95.3%)

πŸŽ“ Kenapa QA Fine-Tuning Lebih Kompleks dari Text Classification?
1. Input = pair (question + context), bukan single text
2. Label = positions (start_char β†’ start_token, end_char β†’ end_token), bukan class index
3. Char→token mapping: jawaban di dataset = posisi karakter, model butuh posisi token
4. Long context: stride/overflow β†’ satu contoh jadi beberapa chunks
5. Answer in chunk?: harus cek apakah jawaban ada di chunk ini atau tidak
Ini tugas fine-tuning paling kompleks di seri HF β€” tapi hasilnya powerful!

πŸŽ“ Why Is QA Fine-Tuning More Complex Than Text Classification?
1. Input = pair (question + context), not single text
2. Label = positions (start_char β†’ start_token, end_char β†’ end_token), not class index
3. Char→token mapping: dataset answer = character position, model needs token position
4. Long context: stride/overflow β†’ one example becomes multiple chunks
5. Answer in chunk?: must check if answer is in this chunk or not
This is the most complex fine-tuning task in the HF series β€” but the result is powerful!

πŸ“Š

6. Evaluasi QA β€” Exact Match & F1 Score

6. QA Evaluation β€” Exact Match & F1 Score

Dua metric: EM (jawaban PERSIS sama) dan F1 (overlap kata)
Two metrics: EM (answer EXACTLY matches) and F1 (word overlap)
39_qa_eval.py β€” QA Evaluation Metricspython
# ===========================
# QA uses 2 metrics:
# ===========================
#
# 1. Exact Match (EM): jawaban predicted == jawaban gold?
#    Predicted: "Jakarta"  Gold: "Jakarta"  β†’ EM = 1.0 βœ…
#    Predicted: "Jakarta"  Gold: "the city of Jakarta" β†’ EM = 0.0 ❌
#    (strict! harus PERSIS sama setelah normalisasi)
#
# 2. F1 Score: overlap kata antara predicted dan gold
#    Predicted: "Jakarta"  Gold: "the city of Jakarta"
#    Overlap: {"Jakarta"} = 1 kata
#    Precision: 1/1 = 100%  (semua predicted kata ada di gold)
#    Recall: 1/4 = 25%      (hanya 1 dari 4 gold kata di-predict)
#    F1 = 2 Γ— (1.0 Γ— 0.25) / (1.0 + 0.25) = 0.40
#
# Typical results for BERT on SQuAD v1.1:
# EM: ~80%  F1: ~88%
# (F1 selalu >= EM karena partial credit)

import evaluate
squad_metric = evaluate.load("squad")

predictions = [{"id": "1", "prediction_text": "Jakarta"}]
references = [{"id": "1", "answers": {"text": ["Jakarta"], "answer_start": [0]}}]

result = squad_metric.compute(predictions=predictions, references=references)
print(result)  # {'exact_match': 100.0, 'f1': 100.0}
πŸ”„

7. Encoder-Decoder (Seq2Seq) Architecture β€” T5 & BART

7. Encoder-Decoder (Seq2Seq) Architecture β€” T5 & BART

Arsitektur ketiga: gabungan encoder (memahami) + decoder (menghasilkan)
The third architecture: combining encoder (understanding) + decoder (generating)
3 Arsitektur Transformer di HF β€” Recap Lengkap 1. Encoder Only (BERT) ← Page 2 & 4 Input β†’ [Encoder] β†’ Representation β†’ Classification Head Best for: Classification, NER, QA Models: BERT, RoBERTa, DeBERTa, DistilBERT HF: AutoModelForSequenceClassification / TokenClassification / QA 2. Decoder Only (GPT) ← Page 3 Prompt β†’ [Decoder] β†’ Next Token β†’ Next Token β†’ ... Best for: Text generation, chat, code Models: GPT-2, LLaMA, Mistral, Gemma HF: AutoModelForCausalLM 3. Encoder-Decoder (T5/BART) ← Page 5 (INI!) Input β†’ [Encoder] β†’ Representation β†’ [Decoder] β†’ Output Sequence Best for: Translation, summarization, text-to-text Models: T5, BART, mBART, mT5, FLAN-T5 HF: AutoModelForSeq2SeqLM Encoder-Decoder menggabungkan KEDUA kekuatan: Encoder: memahami input secara bidirectional (seperti BERT) Decoder: menghasilkan output secara autoregressive (seperti GPT) β†’ Perfect untuk tasks yang butuh MEMAHAMI input lalu MENGHASILKAN output berbeda
πŸ”€

8. T5 Text-to-Text Framework — Semua Tugas = Text→Text

8. T5 Text-to-Text Framework — All Tasks = Text→Text

T5 memperlakukan SEMUA tugas NLP sebagai "masukkan teks, keluarkan teks" β€” elegant!
T5 treats ALL NLP tasks as "input text, output text" β€” elegant!
40_t5_framework.py β€” T5 Unified Text-to-Textpython
from transformers import pipeline

# ===========================
# T5 treats EVERY task as text β†’ text!
# Just add a PREFIX to tell T5 what task to do.
# ===========================

# Translation
translator = pipeline("text2text-generation", model="google-t5/t5-small")
result = translator("translate English to French: I love machine learning.")
print(result[0]["generated_text"])
# "J'aime l'apprentissage automatique."

# Summarization
result = translator("summarize: Hugging Face is a company that provides tools...", max_length=50)
print(result[0]["generated_text"])

# Sentiment (as text-to-text!)
result = translator("sst2 sentence: I love this movie.")
print(result[0]["generated_text"])
# "positive"

# T5 sizes (all use same text-to-text format):
# t5-small:  60M params,  ~240MB   (fits Colab easily!)
# t5-base:   220M params, ~890MB
# t5-large:  770M params, ~3GB
# t5-3b:     3B params,   ~12GB
# t5-11b:    11B params,  ~42GB
# flan-t5-*: instruction-tuned versions (MUCH better!)

# FLAN-T5 = T5 + instruction tuning (1.8k tasks!)
flan = pipeline("text2text-generation", model="google/flan-t5-small")
result = flan("What is the capital of Indonesia?")
print(result[0]["generated_text"])
# "Jakarta" ← no prefix needed! FLAN-T5 understands instructions.
πŸ“

9. Fine-Tune T5 untuk Summarization β€” CNN/DailyMail

9. Fine-Tune T5 for Summarization β€” CNN/DailyMail

Ajarkan T5 meringkas artikel berita menjadi beberapa kalimat
Teach T5 to summarize news articles into a few sentences
41_t5_summarization.py β€” T5 Summarization Fine-Tuning πŸ”₯python
from transformers import (
    AutoTokenizer, AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments, Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)
from datasets import load_dataset

# ═══════════════════════════════════════
# STEP 1: LOAD
# ═══════════════════════════════════════
model_name = "google-t5/t5-small"  # 60M params, fits Colab!
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

dataset = load_dataset("cnn_dailymail", "3.0.0")
# train: 287k articles, val: 13k, test: 11k

# Use subset for quick training
train_data = dataset["train"].shuffle(42).select(range(10000))
val_data = dataset["validation"].shuffle(42).select(range(1000))

# ═══════════════════════════════════════
# STEP 2: TOKENIZE (input + target!)
# ═══════════════════════════════════════
def preprocess(examples):
    # T5 needs prefix!
    inputs = ["summarize: " + doc for doc in examples["article"]]
    targets = examples["highlights"]

    # Tokenize inputs
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)

    # Tokenize targets (labels!)
    labels = tokenizer(text_target=targets, max_length=128, truncation=True)
    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

tokenized_train = train_data.map(preprocess, batched=True,
    remove_columns=train_data.column_names)
tokenized_val = val_data.map(preprocess, batched=True,
    remove_columns=val_data.column_names)

# ═══════════════════════════════════════
# STEP 3: DATA COLLATOR (Seq2Seq specific!)
# ═══════════════════════════════════════
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
# Pads BOTH input_ids AND labels!
# Labels padding β†’ -100 (ignored in loss)

# ═══════════════════════════════════════
# STEP 4: TRAIN
# ═══════════════════════════════════════
args = Seq2SeqTrainingArguments(
    output_dir="./t5-summarization",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=4,
    learning_rate=3e-5,
    weight_decay=0.01,
    fp16=True,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    predict_with_generate=True,  # ← CRITICAL for Seq2Seq eval!
    generation_max_length=128,
    report_to="none",
)

trainer = Seq2SeqTrainer(
    model=model, args=args,
    train_dataset=tokenized_train, eval_dataset=tokenized_val,
    tokenizer=tokenizer, data_collator=data_collator,
)
trainer.train()

# Test!
from transformers import pipeline
summarizer = pipeline("summarization", model="./t5-summarization", device=0)
article = "Hugging Face has raised $235 million in a Series D funding round..."
summary = summarizer(article, max_length=60, min_length=20)
print(summary[0]["summary_text"])

πŸŽ“ Perbedaan Kunci Seq2Seq vs Classification Fine-Tuning:
1. Model: AutoModelForSeq2SeqLM (bukan ForSequenceClassification)
2. Labels: = tokenized TARGET text (bukan integer class)
3. Collator: DataCollatorForSeq2Seq (pad input + labels)
4. Trainer: Seq2SeqTrainer + Seq2SeqTrainingArguments
5. Eval: predict_with_generate=True β†’ model generates text saat eval
6. Prefix: T5 butuh prefix ("summarize:", "translate:"). BART tidak.
7. Metric: ROUGE (bukan accuracy/F1)

πŸŽ“ Key Differences: Seq2Seq vs Classification Fine-Tuning:
1. Model: AutoModelForSeq2SeqLM (not ForSequenceClassification)
2. Labels: = tokenized TARGET text (not integer class)
3. Collator: DataCollatorForSeq2Seq (pads input + labels)
4. Trainer: Seq2SeqTrainer + Seq2SeqTrainingArguments
5. Eval: predict_with_generate=True β†’ model generates text during eval
6. Prefix: T5 needs prefix ("summarize:", "translate:"). BART doesn't.
7. Metric: ROUGE (not accuracy/F1)

🌐

10. Fine-Tune T5 untuk Translation β€” English↔Indonesian

10. Fine-Tune T5 for Translation β€” English↔Indonesian

Template translation β€” ganti dataset & prefix, sisanya identik dengan summarization
Translation template β€” change dataset & prefix, rest is identical to summarization
42_t5_translation.py β€” Translation Fine-Tuningpython
# Translation = SAME pipeline as summarization!
# Just change: dataset, prefix, and metrics.

# ===========================
# 1. Use pre-trained translation model (no fine-tuning!)
# ===========================
from transformers import pipeline

# Helsinki-NLP models: dedicated translation models
en_to_id = pipeline("translation", model="Helsinki-NLP/opus-mt-en-id")
result = en_to_id("I am learning artificial intelligence with Hugging Face.")
print(result[0]["translation_text"])
# "Saya belajar kecerdasan buatan dengan Hugging Face."

id_to_en = pipeline("translation", model="Helsinki-NLP/opus-mt-id-en")
result = id_to_en("Jakarta adalah ibukota Indonesia.")
print(result[0]["translation_text"])
# "Jakarta is the capital of Indonesia."

# ===========================
# 2. Fine-tune T5 for custom translation
# ===========================
def preprocess_translation(examples):
    # Add T5 prefix for translation
    inputs = ["translate English to Indonesian: " + s for s in examples["en"]]
    targets = examples["id"]

    model_inputs = tokenizer(inputs, max_length=128, truncation=True)
    labels = tokenizer(text_target=targets, max_length=128, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Use parallel corpus (English-Indonesian pairs)
# dataset = load_dataset("opus100", "en-id")
# Or your own CSV: columns "en" and "id"
# dataset = load_dataset("csv", data_files="parallel_corpus.csv")

# Rest is IDENTICAL to summarization pipeline!
# tokenize β†’ DataCollatorForSeq2Seq β†’ Seq2SeqTrainer β†’ train
πŸ“Š

11. BLEU & ROUGE Metrics β€” Evaluasi Generasi Teks

11. BLEU & ROUGE Metrics β€” Evaluating Text Generation

BLEU untuk translation, ROUGE untuk summarization β€” dua standar industri
BLEU for translation, ROUGE for summarization β€” two industry standards
43_bleu_rouge.py β€” Generation Evaluation Metricspython
import evaluate

# ===========================
# 1. BLEU β€” for Translation
# Measures n-gram PRECISION (how many predicted n-grams are in reference?)
# ===========================
bleu = evaluate.load("bleu")

predictions = ["Jakarta is the capital of Indonesia"]
references = [["Jakarta is the capital city of Indonesia"]]
# references = list of LIST (multiple valid translations per example!)

result = bleu.compute(predictions=predictions, references=references)
print(f"BLEU: {result['bleu']:.2%}")  # ~71%
# BLEU range: 0-100%. >40% = good translation, >60% = very good

# ===========================
# 2. ROUGE β€” for Summarization
# Measures n-gram RECALL (how many reference n-grams are in prediction?)
# ===========================
rouge = evaluate.load("rouge")

predictions = ["HF raised $235M at $4.5B valuation"]
references = ["Hugging Face raised $235 million in Series D funding"]

result = rouge.compute(predictions=predictions, references=references)
print(f"ROUGE-1: {result['rouge1']:.2%}")   # unigram overlap
print(f"ROUGE-2: {result['rouge2']:.2%}")   # bigram overlap
print(f"ROUGE-L: {result['rougeL']:.2%}")   # longest common subsequence
# ROUGE-1: ~55%, ROUGE-2: ~25%, ROUGE-L: ~45% (typical for summarization)

# ===========================
# BLEU vs ROUGE β€” kapan pakai mana?
# ===========================
# BLEU: translation (precision-focused β€” are predicted words correct?)
# ROUGE: summarization (recall-focused β€” are reference ideas covered?)
# Both: imperfect! High BLEU/ROUGE β‰  good text. Low β‰  bad text.
# Always combine with human evaluation for production quality.
MetricFokusTaskRangeGood Score
BLEUPrecision (predicted vs reference)Translation0-100%>40%
ROUGE-1Unigram recallSummarization0-100%>40%
ROUGE-2Bigram recallSummarization0-100%>20%
ROUGE-LLongest common subsequenceSummarization0-100%>35%
Exact MatchExact string matchQA0-100%>75%
F1 (QA)Word overlapQA0-100%>85%
MetricFocusTaskRangeGood Score
BLEUPrecision (predicted vs reference)Translation0-100%>40%
ROUGE-1Unigram recallSummarization0-100%>40%
ROUGE-2Bigram recallSummarization0-100%>20%
ROUGE-LLongest common subsequenceSummarization0-100%>35%
Exact MatchExact string matchQA0-100%>75%
F1 (QA)Word overlapQA0-100%>85%
πŸ’»

12. Di Mana Jalankan? β€” VRAM untuk T5/BART

12. Where to Run? β€” VRAM for T5/BART

ModelParamsVRAM Fine-Tune FP16Colab T4?
T5-small60M~4 GBβœ… Sangat nyaman
FLAN-T5-small80M~5 GBβœ… Nyaman
T5-base220M~10 GBβœ… OK
BART-base140M~7 GBβœ… OK
BART-large-CNN400M~14 GB⚠️ Batch kecil
T5-large770M>16 GB❌ Butuh A100
FLAN-T5-large780M>16 GB❌ Butuh A100
ModelParamsVRAM Fine-Tune FP16Colab T4?
T5-small60M~4 GBβœ… Very comfortable
FLAN-T5-small80M~5 GBβœ… Comfortable
T5-base220M~10 GBβœ… OK
BART-base140M~7 GBβœ… OK
BART-large-CNN400M~14 GB⚠️ Small batch
T5-large770M>16 GB❌ Needs A100
FLAN-T5-large780M>16 GB❌ Needs A100

πŸŽ‰ Semua kode di Page 5 menggunakan T5-small/DistilBERT yang berjalan nyaman di Google Colab gratis (T4 16GB). QA fine-tuning: ~10 menit. T5 summarization: ~20 menit. Setup: sama seperti Page 2.

πŸŽ‰ All code in Page 5 uses T5-small/DistilBERT which runs comfortably on free Google Colab (T4 16GB). QA fine-tuning: ~10 min. T5 summarization: ~20 min. Setup: same as Page 2.

πŸ“

13. Ringkasan Page 5

13. Page 5 Summary

KonsepApa ItuKode Kunci
Extractive QATemukan jawaban di konteksAutoModelForQuestionAnswering
Start/End LogitsModel tunjuk posisi awal+akhiroutputs.start_logits.argmax()
SQuAD87k QA pairs benchmarkload_dataset("squad")
StrideSliding window untuk long contextstride=128, return_overflowing_tokens
Encoder-DecoderUnderstand (enc) + Generate (dec)AutoModelForSeq2SeqLM
T5Text-to-text untuk semua tugas"summarize: ..." / "translate: ..."
Seq2SeqTrainerTrainer untuk Seq2Seq modelspredict_with_generate=True
DataCollatorForSeq2SeqPad input + labelsDataCollatorForSeq2Seq(tokenizer, model)
BLEUTranslation quality (precision)evaluate.load("bleu")
ROUGESummarization quality (recall)evaluate.load("rouge")
ConceptWhat It IsKey Code
Extractive QAFind answer in contextAutoModelForQuestionAnswering
Start/End LogitsModel points to start+end positionoutputs.start_logits.argmax()
SQuAD87k QA pairs benchmarkload_dataset("squad")
StrideSliding window for long contextstride=128, return_overflowing_tokens
Encoder-DecoderUnderstand (enc) + Generate (dec)AutoModelForSeq2SeqLM
T5Text-to-text for all tasks"summarize: ..." / "translate: ..."
Seq2SeqTrainerTrainer for Seq2Seq modelspredict_with_generate=True
DataCollatorForSeq2SeqPad input + labelsDataCollatorForSeq2Seq(tokenizer, model)
BLEUTranslation quality (precision)evaluate.load("bleu")
ROUGESummarization quality (recall)evaluate.load("rouge")
← Page Sebelumnya← Previous Page

Page 4 β€” Token Classification & NER

πŸ“˜

Coming Next: Page 6 β€” Sentence Embeddings & Semantic Search

Ubah kalimat menjadi vektor bermakna! Page 6 membahas: Sentence Transformers library, semantic similarity (cosine similarity), fine-tuning embedding models, FAISS vector search untuk jutaan dokumen, building a semantic search engine, retrieval-augmented generation (RAG) foundations, dan cross-encoder vs bi-encoder.

πŸ“˜

Coming Next: Page 6 β€” Sentence Embeddings & Semantic Search

Turn sentences into meaningful vectors! Page 6 covers: Sentence Transformers library, semantic similarity (cosine similarity), fine-tuning embedding models, FAISS vector search for millions of documents, building a semantic search engine, retrieval-augmented generation (RAG) foundations, and cross-encoder vs bi-encoder.