π Daftar Isi β Page 5
π Table of Contents β Page 5
- Extractive QA β Menemukan jawaban di dalam konteks
- Bagaimana QA Model Bekerja β Start/end logits β span
- SQuAD Dataset β Benchmark QA klasik
- Long Context Problem β Stride & overflow handling
- Fine-Tune BERT untuk QA β Complete SQuAD pipeline
- Evaluasi QA β Exact Match & F1
- Encoder-Decoder Architecture β T5 & BART
- T5 Text-to-Text Framework β Semua tugas = textβtext
- Fine-Tune T5 untuk Summarization β CNN/DailyMail
- Fine-Tune T5 untuk Translation β EnglishβIndonesian
- BLEU & ROUGE Metrics β Evaluasi generasi teks
- Di Mana Jalankan? β VRAM untuk T5/BART
- Ringkasan & Preview Page 6
- Extractive QA β Finding answers within context
- How QA Models Work β Start/end logits β span
- SQuAD Dataset β Classic QA benchmark
- Long Context Problem β Stride & overflow handling
- Fine-Tune BERT for QA β Complete SQuAD pipeline
- QA Evaluation β Exact Match & F1
- Encoder-Decoder Architecture β T5 & BART
- T5 Text-to-Text Framework β All tasks = textβtext
- Fine-Tune T5 for Summarization β CNN/DailyMail
- Fine-Tune T5 for Translation β EnglishβIndonesian
- BLEU & ROUGE Metrics β Evaluating text generation
- Where to Run? β VRAM for T5/BART
- Summary & Page 6 Preview
1. Extractive QA β Menemukan Jawaban di Dalam Konteks
1. Extractive QA β Finding Answers Within Context
Extractive QA = model diberikan pertanyaan dan konteks (paragraf), lalu harus menemukan potongan teks di dalam konteks yang menjawab pertanyaan. Model tidak menghasilkan teks baru β ia hanya menunjuk posisi awal dan akhir jawaban di konteks. Ini berbeda dari generative QA (ChatGPT-style) yang menulis jawaban baru.
Extractive QA = the model is given a question and context (paragraph), then must find the text span within the context that answers the question. The model doesn't generate new text β it only points to the start and end positions of the answer in the context. This differs from generative QA (ChatGPT-style) which writes new answers.
from transformers import pipeline # =========================== # QA pipeline β zero training! # =========================== qa = pipeline("question-answering", device=0) context = """ Indonesia is a country in Southeast Asia and Oceania between the Indian and Pacific oceans. Jakarta is the capital and most populous city. The country has over 17,000 islands with a population of 270 million, making it the world's fourth most populous country. Indonesia became independent from the Netherlands on August 17, 1945. """ # Ask multiple questions on the SAME context questions = [ "What is the capital of Indonesia?", "How many islands does Indonesia have?", "When did Indonesia become independent?", "What is the population of Indonesia?", ] for q in questions: result = qa(question=q, context=context) print(f"Q: {q}") print(f"A: {result['answer']} (score: {result['score']:.1%}, pos: {result['start']}-{result['end']})") print() # Q: What is the capital of Indonesia? # A: Jakarta (score: 97.2%, pos: 92-99) # # Q: How many islands does Indonesia have? # A: over 17,000 (score: 85.3%, pos: 131-143) # # Q: When did Indonesia become independent? # A: August 17, 1945 (score: 95.1%, pos: 233-249) # # Q: What is the population of Indonesia? # A: 270 million (score: 88.7%, pos: 161-172)
2. Bagaimana QA Model Bekerja β Start/End Logits
2. How QA Models Work β Start/End Logits
import torch from transformers import AutoTokenizer, AutoModelForQuestionAnswering model_name = "distilbert-base-cased-distilled-squad" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForQuestionAnswering.from_pretrained(model_name).to("cuda") question = "What is the capital?" context = "Indonesia is a country in Southeast Asia. Jakarta is the capital city." # =========================== # 1. Tokenize question + context as a PAIR # =========================== inputs = tokenizer(question, context, return_tensors="pt").to("cuda") tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) print("Tokens:", tokens) # ['[CLS]', 'What', 'is', 'the', 'capital', '?', '[SEP]', # 'Indonesia', 'is', 'a', 'country', 'in', 'Southeast', 'Asia', '.', # 'Jakarta', 'is', 'the', 'capital', 'city', '.', '[SEP]'] # β question β β context β # token_type_ids: [0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] # 0=question, 1=context # =========================== # 2. Forward pass β start_logits & end_logits # =========================== with torch.no_grad(): outputs = model(**inputs) print(f"start_logits shape: {outputs.start_logits.shape}") # (1, 22) print(f"end_logits shape: {outputs.end_logits.shape}") # (1, 22) # Satu score per token! Model menilai setiap token: # "Seberapa mungkin token ini adalah AWAL jawaban?" # "Seberapa mungkin token ini adalah AKHIR jawaban?" # =========================== # 3. Find answer span # =========================== start_idx = outputs.start_logits.argmax().item() end_idx = outputs.end_logits.argmax().item() print(f"Start token [{start_idx}]: '{tokens[start_idx]}'") # 'Jakarta' print(f"End token [{end_idx}]: '{tokens[end_idx]}'") # 'Jakarta' # Decode answer answer_ids = inputs["input_ids"][0][start_idx:end_idx + 1] answer = tokenizer.decode(answer_ids) print(f"Answer: '{answer}'") # 'Jakarta' # =========================== # 4. Confidence score # =========================== start_probs = torch.softmax(outputs.start_logits, dim=-1) end_probs = torch.softmax(outputs.end_logits, dim=-1) score = (start_probs[0, start_idx] * end_probs[0, end_idx]).item() print(f"Confidence: {score:.1%}") # 94.2%
3. SQuAD Dataset β Benchmark QA Klasik
3. SQuAD Dataset β Classic QA Benchmark
from datasets import load_dataset dataset = load_dataset("squad") print(dataset) # DatasetDict({ # train: Dataset({features: ['id','title','context','question','answers'], num_rows: 87599}) # validation: Dataset({num_rows: 10570}) # }) example = dataset["train"][0] print(f"Title: {example['title']}") print(f"Context: {example['context'][:150]}...") print(f"Question: {example['question']}") print(f"Answers: {example['answers']}") # Title: University_of_Notre_Dame # Context: "Architecturally, the most striking of the univer..." # Question: "To whom did the Virgin Mary allegedly appear in 1858..." # Answers: {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]} # KEY FORMAT: # answers = {'text': ['jawaban'], 'answer_start': [posisi_karakter]} # answer_start = CHARACTER position (bukan token position!) # Bisa ada beberapa jawaban valid (SQuAD v2 juga punya unanswerable) # SQuAD v2 (dengan unanswerable questions): # squad_v2 = load_dataset("squad_v2") # Beberapa pertanyaan TIDAK punya jawaban di konteks! # Model harus belajar menjawab "I don't know" (empty answer)
4. Long Context Problem β Stride & Overflow Handling
4. Long Context Problem β Stride & Overflow Handling
Masalah: BERT hanya bisa menerima 512 tokens. Tapi konteks Wikipedia bisa 1000+ tokens. Jika kita truncate, jawaban mungkin terpotong! Solusi: sliding window (stride) β pecah konteks menjadi beberapa overlapping chunks.
Problem: BERT can only accept 512 tokens. But Wikipedia contexts can be 1000+ tokens. If we truncate, the answer might get cut off! Solution: sliding window (stride) β split context into overlapping chunks.
# Tokenize with stride (overlap) for long contexts def tokenize_qa_with_stride(examples): tokenized = tokenizer( examples["question"], examples["context"], truncation="only_second", # truncate CONTEXT only, not question! max_length=384, stride=128, # overlap between chunks return_overflowing_tokens=True, # return ALL chunks! return_offsets_mapping=True, # charβtoken mapping (for answer position) padding="max_length", ) # 1 example bisa jadi 2-3 chunks jika konteks panjang! # overflow_to_sample_mapping: chunk β original example index return tokenized # Satu contoh dengan konteks panjang bisa menghasilkan: # - 1 chunk jika konteks pendek (<384 tokens) # - 2-3 chunks jika konteks panjang (800-1200 tokens) # - Chunks overlap 128 tokens β jawaban tidak terpotong!
5. Fine-Tune BERT untuk QA pada SQuAD
5. Fine-Tune BERT for QA on SQuAD
from transformers import ( AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer, DefaultDataCollator ) from datasets import load_dataset # βββββββββββββββββββββββββββββββββββββββ # STEP 1: LOAD # βββββββββββββββββββββββββββββββββββββββ model_name = "distilbert-base-cased" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForQuestionAnswering.from_pretrained(model_name) dataset = load_dataset("squad") # βββββββββββββββββββββββββββββββββββββββ # STEP 2: TOKENIZE + FIND ANSWER POSITION IN TOKENS # This is the HARDEST part of QA fine-tuning! # βββββββββββββββββββββββββββββββββββββββ def preprocess_training(examples): tokenized = tokenizer( examples["question"], examples["context"], truncation="only_second", max_length=384, stride=128, return_overflowing_tokens=True, return_offsets_mapping=True, padding="max_length", ) sample_map = tokenized.pop("overflow_to_sample_mapping") offset_mapping = tokenized.pop("offset_mapping") start_positions, end_positions = [], [] for i, offsets in enumerate(offset_mapping): sample_idx = sample_map[i] answers = examples["answers"][sample_idx] start_char = answers["answer_start"][0] end_char = start_char + len(answers["text"][0]) # Find context token range (skip question tokens) sequence_ids = tokenized.sequence_ids(i) ctx_start = 0 while sequence_ids[ctx_start] != 1: ctx_start += 1 ctx_end = len(sequence_ids) - 1 while sequence_ids[ctx_end] != 1: ctx_end -= 1 # Check if answer is within this chunk if offsets[ctx_start][0] > end_char or offsets[ctx_end][1] < start_char: # Answer not in this chunk β CLS token (no answer) start_positions.append(0) end_positions.append(0) else: # Find token positions of answer s = ctx_start while s <= ctx_end and offsets[s][0] <= start_char: s += 1 start_positions.append(s - 1) e = ctx_end while e >= ctx_start and offsets[e][1] >= end_char: e -= 1 end_positions.append(e + 1) tokenized["start_positions"] = start_positions tokenized["end_positions"] = end_positions return tokenized tokenized = dataset["train"].map(preprocess_training, batched=True, remove_columns=dataset["train"].column_names) # βββββββββββββββββββββββββββββββββββββββ # STEP 3: TRAIN # βββββββββββββββββββββββββββββββββββββββ args = TrainingArguments( output_dir="./qa-distilbert", num_train_epochs=3, per_device_train_batch_size=16, learning_rate=2e-5, weight_decay=0.01, fp16=True, eval_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, report_to="none", ) trainer = Trainer( model=model, args=args, train_dataset=tokenized, data_collator=DefaultDataCollator(), tokenizer=tokenizer, ) trainer.train() trainer.save_model("./qa-distilbert-final") # Test with pipeline! from transformers import pipeline qa_pipe = pipeline("question-answering", model="./qa-distilbert-final", device=0) result = qa_pipe(question="What is the capital?", context="Jakarta is the capital of Indonesia.") print(f"Answer: {result['answer']} ({result['score']:.1%})") # Answer: Jakarta (95.3%)
π Kenapa QA Fine-Tuning Lebih Kompleks dari Text Classification?
1. Input = pair (question + context), bukan single text
2. Label = positions (start_char β start_token, end_char β end_token), bukan class index
3. Charβtoken mapping: jawaban di dataset = posisi karakter, model butuh posisi token
4. Long context: stride/overflow β satu contoh jadi beberapa chunks
5. Answer in chunk?: harus cek apakah jawaban ada di chunk ini atau tidak
Ini tugas fine-tuning paling kompleks di seri HF β tapi hasilnya powerful!
π Why Is QA Fine-Tuning More Complex Than Text Classification?
1. Input = pair (question + context), not single text
2. Label = positions (start_char β start_token, end_char β end_token), not class index
3. Charβtoken mapping: dataset answer = character position, model needs token position
4. Long context: stride/overflow β one example becomes multiple chunks
5. Answer in chunk?: must check if answer is in this chunk or not
This is the most complex fine-tuning task in the HF series β but the result is powerful!
6. Evaluasi QA β Exact Match & F1 Score
6. QA Evaluation β Exact Match & F1 Score
# =========================== # QA uses 2 metrics: # =========================== # # 1. Exact Match (EM): jawaban predicted == jawaban gold? # Predicted: "Jakarta" Gold: "Jakarta" β EM = 1.0 β # Predicted: "Jakarta" Gold: "the city of Jakarta" β EM = 0.0 β # (strict! harus PERSIS sama setelah normalisasi) # # 2. F1 Score: overlap kata antara predicted dan gold # Predicted: "Jakarta" Gold: "the city of Jakarta" # Overlap: {"Jakarta"} = 1 kata # Precision: 1/1 = 100% (semua predicted kata ada di gold) # Recall: 1/4 = 25% (hanya 1 dari 4 gold kata di-predict) # F1 = 2 Γ (1.0 Γ 0.25) / (1.0 + 0.25) = 0.40 # # Typical results for BERT on SQuAD v1.1: # EM: ~80% F1: ~88% # (F1 selalu >= EM karena partial credit) import evaluate squad_metric = evaluate.load("squad") predictions = [{"id": "1", "prediction_text": "Jakarta"}] references = [{"id": "1", "answers": {"text": ["Jakarta"], "answer_start": [0]}}] result = squad_metric.compute(predictions=predictions, references=references) print(result) # {'exact_match': 100.0, 'f1': 100.0}
7. Encoder-Decoder (Seq2Seq) Architecture β T5 & BART
7. Encoder-Decoder (Seq2Seq) Architecture β T5 & BART
8. T5 Text-to-Text Framework β Semua Tugas = TextβText
8. T5 Text-to-Text Framework β All Tasks = TextβText
from transformers import pipeline # =========================== # T5 treats EVERY task as text β text! # Just add a PREFIX to tell T5 what task to do. # =========================== # Translation translator = pipeline("text2text-generation", model="google-t5/t5-small") result = translator("translate English to French: I love machine learning.") print(result[0]["generated_text"]) # "J'aime l'apprentissage automatique." # Summarization result = translator("summarize: Hugging Face is a company that provides tools...", max_length=50) print(result[0]["generated_text"]) # Sentiment (as text-to-text!) result = translator("sst2 sentence: I love this movie.") print(result[0]["generated_text"]) # "positive" # T5 sizes (all use same text-to-text format): # t5-small: 60M params, ~240MB (fits Colab easily!) # t5-base: 220M params, ~890MB # t5-large: 770M params, ~3GB # t5-3b: 3B params, ~12GB # t5-11b: 11B params, ~42GB # flan-t5-*: instruction-tuned versions (MUCH better!) # FLAN-T5 = T5 + instruction tuning (1.8k tasks!) flan = pipeline("text2text-generation", model="google/flan-t5-small") result = flan("What is the capital of Indonesia?") print(result[0]["generated_text"]) # "Jakarta" β no prefix needed! FLAN-T5 understands instructions.
9. Fine-Tune T5 untuk Summarization β CNN/DailyMail
9. Fine-Tune T5 for Summarization β CNN/DailyMail
from transformers import ( AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq ) from datasets import load_dataset # βββββββββββββββββββββββββββββββββββββββ # STEP 1: LOAD # βββββββββββββββββββββββββββββββββββββββ model_name = "google-t5/t5-small" # 60M params, fits Colab! tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) dataset = load_dataset("cnn_dailymail", "3.0.0") # train: 287k articles, val: 13k, test: 11k # Use subset for quick training train_data = dataset["train"].shuffle(42).select(range(10000)) val_data = dataset["validation"].shuffle(42).select(range(1000)) # βββββββββββββββββββββββββββββββββββββββ # STEP 2: TOKENIZE (input + target!) # βββββββββββββββββββββββββββββββββββββββ def preprocess(examples): # T5 needs prefix! inputs = ["summarize: " + doc for doc in examples["article"]] targets = examples["highlights"] # Tokenize inputs model_inputs = tokenizer(inputs, max_length=512, truncation=True) # Tokenize targets (labels!) labels = tokenizer(text_target=targets, max_length=128, truncation=True) model_inputs["labels"] = labels["input_ids"] return model_inputs tokenized_train = train_data.map(preprocess, batched=True, remove_columns=train_data.column_names) tokenized_val = val_data.map(preprocess, batched=True, remove_columns=val_data.column_names) # βββββββββββββββββββββββββββββββββββββββ # STEP 3: DATA COLLATOR (Seq2Seq specific!) # βββββββββββββββββββββββββββββββββββββββ data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model) # Pads BOTH input_ids AND labels! # Labels padding β -100 (ignored in loss) # βββββββββββββββββββββββββββββββββββββββ # STEP 4: TRAIN # βββββββββββββββββββββββββββββββββββββββ args = Seq2SeqTrainingArguments( output_dir="./t5-summarization", num_train_epochs=3, per_device_train_batch_size=8, per_device_eval_batch_size=16, gradient_accumulation_steps=4, learning_rate=3e-5, weight_decay=0.01, fp16=True, eval_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, predict_with_generate=True, # β CRITICAL for Seq2Seq eval! generation_max_length=128, report_to="none", ) trainer = Seq2SeqTrainer( model=model, args=args, train_dataset=tokenized_train, eval_dataset=tokenized_val, tokenizer=tokenizer, data_collator=data_collator, ) trainer.train() # Test! from transformers import pipeline summarizer = pipeline("summarization", model="./t5-summarization", device=0) article = "Hugging Face has raised $235 million in a Series D funding round..." summary = summarizer(article, max_length=60, min_length=20) print(summary[0]["summary_text"])
π Perbedaan Kunci Seq2Seq vs Classification Fine-Tuning:
1. Model: AutoModelForSeq2SeqLM (bukan ForSequenceClassification)
2. Labels: = tokenized TARGET text (bukan integer class)
3. Collator: DataCollatorForSeq2Seq (pad input + labels)
4. Trainer: Seq2SeqTrainer + Seq2SeqTrainingArguments
5. Eval: predict_with_generate=True β model generates text saat eval
6. Prefix: T5 butuh prefix ("summarize:", "translate:"). BART tidak.
7. Metric: ROUGE (bukan accuracy/F1)
π Key Differences: Seq2Seq vs Classification Fine-Tuning:
1. Model: AutoModelForSeq2SeqLM (not ForSequenceClassification)
2. Labels: = tokenized TARGET text (not integer class)
3. Collator: DataCollatorForSeq2Seq (pads input + labels)
4. Trainer: Seq2SeqTrainer + Seq2SeqTrainingArguments
5. Eval: predict_with_generate=True β model generates text during eval
6. Prefix: T5 needs prefix ("summarize:", "translate:"). BART doesn't.
7. Metric: ROUGE (not accuracy/F1)
10. Fine-Tune T5 untuk Translation β EnglishβIndonesian
10. Fine-Tune T5 for Translation β EnglishβIndonesian
# Translation = SAME pipeline as summarization! # Just change: dataset, prefix, and metrics. # =========================== # 1. Use pre-trained translation model (no fine-tuning!) # =========================== from transformers import pipeline # Helsinki-NLP models: dedicated translation models en_to_id = pipeline("translation", model="Helsinki-NLP/opus-mt-en-id") result = en_to_id("I am learning artificial intelligence with Hugging Face.") print(result[0]["translation_text"]) # "Saya belajar kecerdasan buatan dengan Hugging Face." id_to_en = pipeline("translation", model="Helsinki-NLP/opus-mt-id-en") result = id_to_en("Jakarta adalah ibukota Indonesia.") print(result[0]["translation_text"]) # "Jakarta is the capital of Indonesia." # =========================== # 2. Fine-tune T5 for custom translation # =========================== def preprocess_translation(examples): # Add T5 prefix for translation inputs = ["translate English to Indonesian: " + s for s in examples["en"]] targets = examples["id"] model_inputs = tokenizer(inputs, max_length=128, truncation=True) labels = tokenizer(text_target=targets, max_length=128, truncation=True) model_inputs["labels"] = labels["input_ids"] return model_inputs # Use parallel corpus (English-Indonesian pairs) # dataset = load_dataset("opus100", "en-id") # Or your own CSV: columns "en" and "id" # dataset = load_dataset("csv", data_files="parallel_corpus.csv") # Rest is IDENTICAL to summarization pipeline! # tokenize β DataCollatorForSeq2Seq β Seq2SeqTrainer β train
11. BLEU & ROUGE Metrics β Evaluasi Generasi Teks
11. BLEU & ROUGE Metrics β Evaluating Text Generation
import evaluate # =========================== # 1. BLEU β for Translation # Measures n-gram PRECISION (how many predicted n-grams are in reference?) # =========================== bleu = evaluate.load("bleu") predictions = ["Jakarta is the capital of Indonesia"] references = [["Jakarta is the capital city of Indonesia"]] # references = list of LIST (multiple valid translations per example!) result = bleu.compute(predictions=predictions, references=references) print(f"BLEU: {result['bleu']:.2%}") # ~71% # BLEU range: 0-100%. >40% = good translation, >60% = very good # =========================== # 2. ROUGE β for Summarization # Measures n-gram RECALL (how many reference n-grams are in prediction?) # =========================== rouge = evaluate.load("rouge") predictions = ["HF raised $235M at $4.5B valuation"] references = ["Hugging Face raised $235 million in Series D funding"] result = rouge.compute(predictions=predictions, references=references) print(f"ROUGE-1: {result['rouge1']:.2%}") # unigram overlap print(f"ROUGE-2: {result['rouge2']:.2%}") # bigram overlap print(f"ROUGE-L: {result['rougeL']:.2%}") # longest common subsequence # ROUGE-1: ~55%, ROUGE-2: ~25%, ROUGE-L: ~45% (typical for summarization) # =========================== # BLEU vs ROUGE β kapan pakai mana? # =========================== # BLEU: translation (precision-focused β are predicted words correct?) # ROUGE: summarization (recall-focused β are reference ideas covered?) # Both: imperfect! High BLEU/ROUGE β good text. Low β bad text. # Always combine with human evaluation for production quality.
| Metric | Fokus | Task | Range | Good Score |
|---|---|---|---|---|
| BLEU | Precision (predicted vs reference) | Translation | 0-100% | >40% |
| ROUGE-1 | Unigram recall | Summarization | 0-100% | >40% |
| ROUGE-2 | Bigram recall | Summarization | 0-100% | >20% |
| ROUGE-L | Longest common subsequence | Summarization | 0-100% | >35% |
| Exact Match | Exact string match | QA | 0-100% | >75% |
| F1 (QA) | Word overlap | QA | 0-100% | >85% |
| Metric | Focus | Task | Range | Good Score |
|---|---|---|---|---|
| BLEU | Precision (predicted vs reference) | Translation | 0-100% | >40% |
| ROUGE-1 | Unigram recall | Summarization | 0-100% | >40% |
| ROUGE-2 | Bigram recall | Summarization | 0-100% | >20% |
| ROUGE-L | Longest common subsequence | Summarization | 0-100% | >35% |
| Exact Match | Exact string match | QA | 0-100% | >75% |
| F1 (QA) | Word overlap | QA | 0-100% | >85% |
12. Di Mana Jalankan? β VRAM untuk T5/BART
12. Where to Run? β VRAM for T5/BART
| Model | Params | VRAM Fine-Tune FP16 | Colab T4? |
|---|---|---|---|
| T5-small | 60M | ~4 GB | β Sangat nyaman |
| FLAN-T5-small | 80M | ~5 GB | β Nyaman |
| T5-base | 220M | ~10 GB | β OK |
| BART-base | 140M | ~7 GB | β OK |
| BART-large-CNN | 400M | ~14 GB | β οΈ Batch kecil |
| T5-large | 770M | >16 GB | β Butuh A100 |
| FLAN-T5-large | 780M | >16 GB | β Butuh A100 |
| Model | Params | VRAM Fine-Tune FP16 | Colab T4? |
|---|---|---|---|
| T5-small | 60M | ~4 GB | β Very comfortable |
| FLAN-T5-small | 80M | ~5 GB | β Comfortable |
| T5-base | 220M | ~10 GB | β OK |
| BART-base | 140M | ~7 GB | β OK |
| BART-large-CNN | 400M | ~14 GB | β οΈ Small batch |
| T5-large | 770M | >16 GB | β Needs A100 |
| FLAN-T5-large | 780M | >16 GB | β Needs A100 |
π Semua kode di Page 5 menggunakan T5-small/DistilBERT yang berjalan nyaman di Google Colab gratis (T4 16GB). QA fine-tuning: ~10 menit. T5 summarization: ~20 menit. Setup: sama seperti Page 2.
π All code in Page 5 uses T5-small/DistilBERT which runs comfortably on free Google Colab (T4 16GB). QA fine-tuning: ~10 min. T5 summarization: ~20 min. Setup: same as Page 2.
13. Ringkasan Page 5
13. Page 5 Summary
| Konsep | Apa Itu | Kode Kunci |
|---|---|---|
| Extractive QA | Temukan jawaban di konteks | AutoModelForQuestionAnswering |
| Start/End Logits | Model tunjuk posisi awal+akhir | outputs.start_logits.argmax() |
| SQuAD | 87k QA pairs benchmark | load_dataset("squad") |
| Stride | Sliding window untuk long context | stride=128, return_overflowing_tokens |
| Encoder-Decoder | Understand (enc) + Generate (dec) | AutoModelForSeq2SeqLM |
| T5 | Text-to-text untuk semua tugas | "summarize: ..." / "translate: ..." |
| Seq2SeqTrainer | Trainer untuk Seq2Seq models | predict_with_generate=True |
| DataCollatorForSeq2Seq | Pad input + labels | DataCollatorForSeq2Seq(tokenizer, model) |
| BLEU | Translation quality (precision) | evaluate.load("bleu") |
| ROUGE | Summarization quality (recall) | evaluate.load("rouge") |
| Concept | What It Is | Key Code |
|---|---|---|
| Extractive QA | Find answer in context | AutoModelForQuestionAnswering |
| Start/End Logits | Model points to start+end position | outputs.start_logits.argmax() |
| SQuAD | 87k QA pairs benchmark | load_dataset("squad") |
| Stride | Sliding window for long context | stride=128, return_overflowing_tokens |
| Encoder-Decoder | Understand (enc) + Generate (dec) | AutoModelForSeq2SeqLM |
| T5 | Text-to-text for all tasks | "summarize: ..." / "translate: ..." |
| Seq2SeqTrainer | Trainer for Seq2Seq models | predict_with_generate=True |
| DataCollatorForSeq2Seq | Pad input + labels | DataCollatorForSeq2Seq(tokenizer, model) |
| BLEU | Translation quality (precision) | evaluate.load("bleu") |
| ROUGE | Summarization quality (recall) | evaluate.load("rouge") |
Page 4 β Token Classification & NER
Coming Next: Page 6 β Sentence Embeddings & Semantic Search
Ubah kalimat menjadi vektor bermakna! Page 6 membahas: Sentence Transformers library, semantic similarity (cosine similarity), fine-tuning embedding models, FAISS vector search untuk jutaan dokumen, building a semantic search engine, retrieval-augmented generation (RAG) foundations, dan cross-encoder vs bi-encoder.
Coming Next: Page 6 β Sentence Embeddings & Semantic Search
Turn sentences into meaningful vectors! Page 6 covers: Sentence Transformers library, semantic similarity (cosine similarity), fine-tuning embedding models, FAISS vector search for millions of documents, building a semantic search engine, retrieval-augmented generation (RAG) foundations, and cross-encoder vs bi-encoder.