๐ Daftar Isi โ Page 3
๐ Table of Contents โ Page 3
- Encoder vs Decoder โ BERT vs GPT: dua dunia berbeda
- Causal Language Modeling โ Bagaimana GPT belajar
- Di Mana Jalankan GPT? โ Colab setup, VRAM, model sizes
- Text Generation dengan Pipeline โ 1 baris โ generate teks
- Manual generate() โ Full control: token by token
- SETIAP Parameter Generation โ temperature, top_k, top_p, beam, dll
- Sampling Strategies Visual โ Kenapa temperature 0.7 vs 1.5 berbeda drastis
- Fine-Tuning GPT-2 pada Custom Corpus โ Puisi, kode, dialog
- Instruction Tuning โ GPT โ assistant yang patuh
- Proyek: Chatbot CLI Sederhana โ GPT-2 fine-tuned interaktif
- Model Generatif Lain โ Bloom, LLaMA, Mistral, Gemma
- Ringkasan & Preview Page 4
- Encoder vs Decoder โ BERT vs GPT: two different worlds
- Causal Language Modeling โ How GPT learns
- Where to Run GPT? โ Colab setup, VRAM, model sizes
- Text Generation with Pipeline โ 1 line โ generate text
- Manual generate() โ Full control: token by token
- EVERY Generation Parameter โ temperature, top_k, top_p, beam, etc
- Sampling Strategies Visual โ Why temperature 0.7 vs 1.5 differs drastically
- Fine-Tuning GPT-2 on Custom Corpus โ Poetry, code, dialogue
- Instruction Tuning โ GPT โ instruction-following assistant
- Project: Simple CLI Chatbot โ Interactive fine-tuned GPT-2
- Other Generative Models โ Bloom, LLaMA, Mistral, Gemma
- Summary & Page 4 Preview
1. Encoder vs Decoder โ BERT vs GPT: Dua Dunia Berbeda
1. Encoder vs Decoder โ BERT vs GPT: Two Different Worlds
Di Page 2, kita fine-tune BERT (encoder) untuk memahami teks โ klasifikasi sentimen, NER, QA. Sekarang kita beralih ke GPT (decoder) untuk menghasilkan teks โ menulis cerita, menjawab pertanyaan, coding. Perbedaannya bukan hanya tugas โ arsitekturnya fundamental berbeda.
In Page 2, we fine-tuned BERT (encoder) to understand text โ sentiment classification, NER, QA. Now we switch to GPT (decoder) to generate text โ writing stories, answering questions, coding. The difference isn't just the task โ the architecture is fundamentally different.
| Aspek | BERT (Encoder) | GPT (Decoder) | T5 (Encoder-Decoder) |
|---|---|---|---|
| Attention | Bidirectional (lihat semua) | Causal (lihat kiri saja) | Encoder: bi, Decoder: causal |
| Pre-training | Masked LM: tebak [MASK] | Next token: prediksi berikutnya | Span corruption |
| Output | Representation (embedding) | Next token probability | Sequence output |
| Best For | Classification, NER, QA | Generation, Chat, Code | Translation, Summarization |
| Contoh Model | BERT, RoBERTa, DeBERTa | GPT-2, LLaMA, Mistral | T5, BART, mBART |
| HF Auto Class | AutoModelForSequenceClassification | AutoModelForCausalLM | AutoModelForSeq2SeqLM |
| Page di Seri Ini | Page 2 (fine-tune BERT) | Page 3 (ini!) | Page 4 (T5, translation) |
| Aspect | BERT (Encoder) | GPT (Decoder) | T5 (Encoder-Decoder) |
|---|---|---|---|
| Attention | Bidirectional (sees all) | Causal (sees left only) | Encoder: bi, Decoder: causal |
| Pre-training | Masked LM: guess [MASK] | Next token: predict next | Span corruption |
| Output | Representation (embedding) | Next token probability | Sequence output |
| Best For | Classification, NER, QA | Generation, Chat, Code | Translation, Summarization |
| Model Examples | BERT, RoBERTa, DeBERTa | GPT-2, LLaMA, Mistral | T5, BART, mBART |
| HF Auto Class | AutoModelForSequenceClassification | AutoModelForCausalLM | AutoModelForSeq2SeqLM |
| Page in This Series | Page 2 (fine-tune BERT) | Page 3 (this!) | Page 4 (T5, translation) |
๐ Kenapa GPT Tidak Bisa "Melihat ke Depan"?
Bayangkan Anda sedang menulis kalimat โ Anda menulis satu kata pada satu waktu, dari kiri ke kanan. Saat menulis kata ke-5, kata ke-6 belum ada! GPT bekerja persis seperti ini: ia memprediksi kata berikutnya berdasarkan kata-kata sebelumnya saja.
Jika GPT bisa melihat ke depan (seperti BERT), ia akan "menyontek" โ tidak perlu belajar memprediksi, tinggal copy dari masa depan. Inilah kenapa causal attention mask sangat penting: memblokir informasi dari posisi masa depan selama training dan inference.
๐ Why Can't GPT "See the Future"?
Imagine you're writing a sentence โ you write one word at a time, left to right. When writing word 5, word 6 doesn't exist yet! GPT works exactly like this: it predicts the next word based on previous words only.
If GPT could see ahead (like BERT), it would "cheat" โ no need to learn prediction, just copy from the future. This is why the causal attention mask is so important: it blocks information from future positions during training and inference.
2. Causal Language Modeling โ Bagaimana GPT Belajar
2. Causal Language Modeling โ How GPT Learns
Causal Language Modeling (CLM) adalah tugas training GPT: diberikan sequence kata, prediksi kata berikutnya. Ini diulang untuk setiap posisi dalam sequence. Contoh: dari kalimat "Saya suka makan nasi goreng", GPT belajar:
Causal Language Modeling (CLM) is GPT's training task: given a sequence of words, predict the next word. This is repeated for every position in the sequence. Example: from the sentence "I love eating fried rice", GPT learns:
import torch from transformers import AutoTokenizer, AutoModelForCausalLM # =========================== # 1. Load GPT-2 # =========================== model_name = "gpt2" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) # GPT-2 sizes: # "gpt2" โ 117M params, ~500MB (small, fits Colab easily) # "gpt2-medium" โ 345M params, ~1.4GB (medium) # "gpt2-large" โ 774M params, ~3.1GB (large, tight on T4) # "gpt2-xl" โ 1.5B params, ~6.2GB (XL, needs >16GB VRAM) # =========================== # 2. Tokenize & compute loss # =========================== text = "The capital of France is Paris" inputs = tokenizer(text, return_tensors="pt") # For CLM: labels = input_ids (shifted internally by the model!) # Model predicts token[i+1] from tokens[0:i] outputs = model(**inputs, labels=inputs["input_ids"]) print(f"Loss: {outputs.loss.item():.4f}") # Loss: 3.2145 (lower = better at predicting next tokens) print(f"Perplexity: {torch.exp(outputs.loss).item():.2f}") # Perplexity: 24.89 (lower = better, 1.0 = perfect prediction) # =========================== # 3. What the model sees internally: # =========================== tokens = tokenizer.tokenize(text) print(f"Tokens: {tokens}") # ['The', 'ฤ capital', 'ฤ of', 'ฤ France', 'ฤ is', 'ฤ Paris'] # ฤ = space prefix (GPT-2 BPE tokenizer) # Model internally shifts labels: # Position 0: sees "The" โ should predict "capital" # Position 1: sees "The capital" โ should predict "of" # Position 2: sees "The capital of" โ should predict "France" # Position 3: sees "The capital of France" โ should predict "is" # Position 4: sees "The capital of France is" โ should predict "Paris" # Loss = average cross-entropy over all positions # =========================== # 4. Check what GPT-2 predicts at each position # =========================== with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits # (1, seq_len, vocab_size=50257) for i in range(len(tokens) - 1): predicted_id = logits[0, i].argmax().item() predicted_token = tokenizer.decode(predicted_id) actual_token = tokens[i + 1] match = "โ " if predicted_token.strip() == actual_token.replace("ฤ ", "") else "โ" context = " ".join(tokens[:i+1]).replace("ฤ ", "") print(f" '{context}' โ predicted: '{predicted_token}' | actual: '{actual_token}' {match}") # 'The' โ predicted: ' first' | actual: 'ฤ capital' โ # 'The capital' โ predicted: ' of' | actual: 'ฤ of' โ # 'The capital of' โ predicted: ' the' | actual: 'ฤ France' โ # 'The capital of France' โ predicted: ' is' | actual: 'ฤ is' โ # 'The capital of France is' โ predicted: ' Paris' | actual: 'ฤ Paris' โ # GPT-2 knows Paris is the capital of France! ๐
3. Di Mana Jalankan GPT? โ Setup, VRAM, dan Batasan
3. Where to Run GPT? โ Setup, VRAM, and Limitations
| Model | Params | VRAM Inference | VRAM Fine-Tune FP16 | Colab T4 (16GB)? |
|---|---|---|---|---|
| GPT-2 small | 117M | ~1 GB | ~5 GB | โ Sangat nyaman |
| GPT-2 medium | 345M | ~2 GB | ~13 GB | โ ๏ธ Batch kecil + grad accum |
| GPT-2 large | 774M | ~4 GB | >16 GB | โ Butuh gradient checkpointing |
| Bloom-560M | 560M | ~2 GB | ~14 GB | โ ๏ธ Ketat |
| LLaMA 3.2 1B | 1B | ~4 GB | ~16 GB (LoRA) | โ ๏ธ LoRA only (Page 8) |
| Mistral 7B | 7B | ~15 GB | ~40 GB | โ Butuh A100 |
| Model | Params | VRAM Inference | VRAM Fine-Tune FP16 | Colab T4 (16GB)? |
|---|---|---|---|---|
| GPT-2 small | 117M | ~1 GB | ~5 GB | โ Very comfortable |
| GPT-2 medium | 345M | ~2 GB | ~13 GB | โ ๏ธ Small batch + grad accum |
| GPT-2 large | 774M | ~4 GB | >16 GB | โ Needs gradient checkpointing |
| Bloom-560M | 560M | ~2 GB | ~14 GB | โ ๏ธ Tight |
| LLaMA 3.2 1B | 1B | ~4 GB | ~16 GB (LoRA) | โ ๏ธ LoRA only (Page 8) |
| Mistral 7B | 7B | ~15 GB | ~40 GB | โ Needs A100 |
# Cell 1: Verify GPU !nvidia-smi import torch print(f"GPU: {torch.cuda.get_device_name(0)}, VRAM: {torch.cuda.get_device_properties(0).total_mem/1e9:.1f} GB") # Cell 2: Install !pip install -q transformers datasets accelerate # Cell 3: Test GPT-2 inference (< 1 menit download) from transformers import pipeline generator = pipeline("text-generation", model="gpt2", device=0) print(generator("Artificial intelligence will", max_new_tokens=30)[0]["generated_text"]) # โ Ready! GPT-2 small = ~500MB, fits easily on T4 # PENTING: GPT-2 tokenizer TIDAK punya pad token! # Harus set manual sebelum fine-tuning: from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("gpt2") tokenizer.pad_token = tokenizer.eos_token # โ WAJIB untuk GPT-2! # Tanpa ini: error "Cannot handle padding" saat fine-tuning
๐ก Semua kode di Page 3 ini menggunakan GPT-2 small (117M) yang berjalan nyaman di Google Colab gratis. Inference: <1 detik per generation. Fine-tuning: ~10-20 menit. Mau model lebih pintar? Ganti ke "gpt2-medium" (sama syntax, butuh batch lebih kecil) atau tunggu Page 8 (LoRA untuk LLaMA/Mistral).
๐ก All code in this Page 3 uses GPT-2 small (117M) which runs comfortably on free Google Colab. Inference: <1 second per generation. Fine-tuning: ~10-20 minutes. Want a smarter model? Switch to "gpt2-medium" (same syntax, needs smaller batch) or wait for Page 8 (LoRA for LLaMA/Mistral).
4. Text Generation dengan Pipeline โ 1 Baris Magic
4. Text Generation with Pipeline โ 1-Line Magic
from transformers import pipeline # =========================== # 1. Basic generation # =========================== generator = pipeline("text-generation", model="gpt2", device=0) result = generator("The future of artificial intelligence is", max_new_tokens=50) print(result[0]["generated_text"]) # "The future of artificial intelligence is not just about the technology, # but about how we use it. The question is whether we can build systems..." # =========================== # 2. Multiple completions # =========================== results = generator( "Once upon a time in Jakarta,", max_new_tokens=80, num_return_sequences=3, # generate 3 different completions! do_sample=True, # enable random sampling temperature=0.8, # creativity level ) for i, r in enumerate(results): print(f"\\n--- Completion {i+1} ---") print(r["generated_text"]) # Setiap completion berbeda! (karena sampling random) # =========================== # 3. Different generation strategies # =========================== # Deterministic (greedy โ selalu sama) result_greedy = generator("AI is", max_new_tokens=20, do_sample=False) # Creative (high temperature sampling) result_creative = generator("AI is", max_new_tokens=20, do_sample=True, temperature=1.2, top_p=0.9) # Focused (low temperature) result_focused = generator("AI is", max_new_tokens=20, do_sample=True, temperature=0.3) print(f"Greedy: {result_greedy[0]['generated_text']}") print(f"Creative: {result_creative[0]['generated_text']}") print(f"Focused: {result_focused[0]['generated_text']}") # Greedy: "AI is a very important part of the future of the world." # Creative: "AI is an existential rollercoaster of digital consciousness..." # Focused: "AI is a field of computer science that focuses on..."
5. Manual generate() โ Full Control Token by Token
5. Manual generate() โ Full Control Token by Token
import torch from transformers import AutoTokenizer, AutoModelForCausalLM model_name = "gpt2" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda") # =========================== # 1. Basic generate() # =========================== prompt = "Indonesia is a beautiful country with" inputs = tokenizer(prompt, return_tensors="pt").to("cuda") with torch.no_grad(): output_ids = model.generate( **inputs, max_new_tokens=50, # generate 50 NEW tokens do_sample=True, # enable sampling temperature=0.7, # creativity top_k=50, # consider top 50 tokens top_p=0.9, # nucleus sampling repetition_penalty=1.2, # penalize repetition pad_token_id=tokenizer.eos_token_id, ) # Decode โ skip the prompt tokens generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True) print(generated_text) # =========================== # 2. Stream generation (token by token) โ like ChatGPT! # =========================== from transformers import TextIteratorStreamer from threading import Thread streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True) inputs = tokenizer("The meaning of life is", return_tensors="pt").to("cuda") gen_kwargs = {**inputs, "max_new_tokens": 100, "streamer": streamer, "do_sample": True, "temperature": 0.7} # Run in separate thread (non-blocking) thread = Thread(target=model.generate, kwargs=gen_kwargs) thread.start() # Print tokens as they arrive! for text in streamer: print(text, end="", flush=True) # "The meaning of life is" โ appears token by token, like ChatGPT! # =========================== # 3. Access generation logits (for custom post-processing) # =========================== outputs = model.generate( **inputs, max_new_tokens=5, output_scores=True, # return logits for each step! return_dict_in_generate=True, ) # outputs.scores = tuple of (vocab_size,) tensors, one per generated token for i, scores in enumerate(outputs.scores): probs = torch.softmax(scores[0], dim=-1) top5 = torch.topk(probs, 5) print(f"\\nStep {i+1} top-5 candidates:") for prob, idx in zip(top5.values, top5.indices): token = tokenizer.decode(idx) print(f" '{token}': {prob:.1%}")
6. SETIAP Parameter Generation โ Dijelaskan Detail
6. EVERY Generation Parameter โ Explained in Detail
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # SETIAP PARAMETER GENERATION โ EXPLAINED # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ output = model.generate( **inputs, # โโ LENGTH CONTROL โโ max_new_tokens=100, # generate MAX 100 new tokens min_new_tokens=10, # generate MIN 10 tokens (prevent empty output) # max_length=150, # alternative: total length (prompt + generated) # โโ SAMPLING vs GREEDY โโ do_sample=True, # True=random sampling, False=greedy (deterministic) # Greedy: always pick highest probability token โ boring, repetitive # Sampling: randomly pick from probability distribution โ creative, varied # โโ TEMPERATURE โโ (hanya berlaku jika do_sample=True) temperature=0.7, # Controls randomness of sampling # temperature=0.1 โ almost greedy (very focused, repetitive) # temperature=0.7 โ balanced (creative but coherent) โ RECOMMENDED # temperature=1.0 โ standard (model's natural distribution) # temperature=1.5 โ very random (wild, often incoherent) # temperature=2.0 โ chaos (mostly nonsense) # # HOW IT WORKS: # logits_adjusted = logits / temperature # probs = softmax(logits_adjusted) # Low temp โ sharper distribution โ top token dominates # High temp โ flatter distribution โ more variety # โโ TOP-K SAMPLING โโ (hanya jika do_sample=True) top_k=50, # Only consider top K highest-probability tokens # top_k=1 โ greedy (only top 1 token) # top_k=10 โ conservative (limited vocabulary) # top_k=50 โ balanced โ DEFAULT # top_k=0 โ disabled (consider ALL tokens) # # PROBLEM: top_k=50 treats all distributions equally. # If model is very confident: top 5 tokens have 95% probability # โ tokens 6-50 are almost random noise! # SOLUTION: use top_p instead (or together) # โโ TOP-P (NUCLEUS) SAMPLING โโ (hanya jika do_sample=True) top_p=0.9, # Keep tokens until cumulative probability reaches P # top_p=0.9 โ keep tokens that sum to 90% probability # If model confident: might keep only 3 tokens (they sum to 90%) # If model uncertain: might keep 50 tokens (all needed for 90%) # ADAPTS to model's confidence! Better than fixed top_k. # # top_p=1.0 โ disabled (keep all tokens) # top_p=0.95 โ slightly conservative # top_p=0.9 โ balanced โ RECOMMENDED # top_p=0.5 โ very focused # โโ REPETITION CONTROL โโ repetition_penalty=1.2, # Penalize tokens that already appeared # 1.0 = no penalty (can repeat freely) # 1.1 = mild (some repetition OK) # 1.2 = moderate โ RECOMMENDED for most use cases # 1.5 = strong (almost never repeats) # 2.0+ = too strong (forced to use rare words) no_repeat_ngram_size=3, # Never repeat the same 3-word phrase # 0 = disabled, 2 = no repeated bigrams, 3 = no repeated trigrams # โโ BEAM SEARCH โโ (alternative to sampling) # num_beams=5, # explore 5 paths simultaneously # early_stopping=True, # stop when all beams finish # length_penalty=1.0, # >1 = prefer longer, <1 = prefer shorter # Beam search: more coherent but LESS creative than sampling # Good for: translation, summarization # Bad for: creative writing, chat (too boring) # NOTE: beam search is INCOMPATIBLE with do_sample=True! # โโ STOP CONDITIONS โโ pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id, # stop_strings=["Human:", "\n\n"], # stop at these strings )
| Parameter | Default | Recommended | Efek |
|---|---|---|---|
| temperature | 1.0 | 0.7 | โ = fokus, โ = kreatif |
| top_k | 50 | 50 | Jumlah token kandidat tetap |
| top_p | 1.0 | 0.9 | Kumulatif probability threshold (adaptif) |
| repetition_penalty | 1.0 | 1.1-1.3 | Cegah pengulangan kata |
| no_repeat_ngram | 0 | 3 | Cegah frasa berulang |
| num_beams | 1 | 1 (sampling) / 5 (translation) | Pencarian lebih luas tapi deterministic |
| max_new_tokens | 20 | 50-500 | Panjang output maksimum |
| Parameter | Default | Recommended | Effect |
|---|---|---|---|
| temperature | 1.0 | 0.7 | โ = focused, โ = creative |
| top_k | 50 | 50 | Fixed number of candidate tokens |
| top_p | 1.0 | 0.9 | Cumulative probability threshold (adaptive) |
| repetition_penalty | 1.0 | 1.1-1.3 | Prevent word repetition |
| no_repeat_ngram | 0 | 3 | Prevent phrase repetition |
| num_beams | 1 | 1 (sampling) / 5 (translation) | Wider search but deterministic |
| max_new_tokens | 20 | 50-500 | Maximum output length |
๐ Resep Cepat untuk Berbagai Use Case:
Chatbot: temperature=0.7, top_p=0.9, repetition_penalty=1.2
Creative writing: temperature=0.9, top_p=0.95, top_k=100
Code generation: temperature=0.2, top_p=0.9 (harus akurat!)
Factual text: do_sample=False (greedy, deterministic)
Translation: num_beams=5, do_sample=False, length_penalty=1.0
๐ Quick Recipes for Various Use Cases:
Chatbot: temperature=0.7, top_p=0.9, repetition_penalty=1.2
Creative writing: temperature=0.9, top_p=0.95, top_k=100
Code generation: temperature=0.2, top_p=0.9 (needs accuracy!)
Factual text: do_sample=False (greedy, deterministic)
Translation: num_beams=5, do_sample=False, length_penalty=1.0
7. Sampling Strategies Visual โ Temperature & Top-P Divisualisasikan
7. Sampling Strategies Visual โ Temperature & Top-P Visualized
8. Fine-Tuning GPT-2 pada Custom Corpus โ Buat GPT Anda Sendiri
8. Fine-Tuning GPT-2 on Custom Corpus โ Build Your Own GPT
Fine-tuning GPT-2 = memberikan contoh teks, lalu GPT belajar memprediksi kata berikutnya di domain Anda. Setelah fine-tuning, GPT bisa generate teks yang mirip dengan training data Anda. Contoh: fine-tune pada puisi โ GPT jadi "penyair". Fine-tune pada kode Python โ GPT jadi "programmer".
Fine-tuning GPT-2 = giving example text, then GPT learns to predict next words in your domain. After fine-tuning, GPT can generate text similar to your training data. Example: fine-tune on poetry โ GPT becomes a "poet". Fine-tune on Python code โ GPT becomes a "programmer".
from transformers import ( AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling ) from datasets import load_dataset # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # STEP 1: LOAD MODEL & TOKENIZER # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ model_name = "gpt2" # 117M params, fits Colab T4 tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) # CRITICAL: GPT-2 has NO pad token! Must set it! tokenizer.pad_token = tokenizer.eos_token model.config.pad_token_id = tokenizer.eos_token_id print(f"Model params: {model.num_parameters():,}") # 124,439,808 # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # STEP 2: LOAD & PREPARE DATASET # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # Option A: From Hugging Face Hub dataset = load_dataset("wikitext", "wikitext-2-raw-v1") # Option B: From your own text file # dataset = load_dataset("text", data_files="my_corpus.txt") # Option C: From CSV with "text" column # dataset = load_dataset("csv", data_files="poems.csv") print(dataset) print(f"Sample: {dataset['train'][0]['text'][:100]}...") # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # STEP 3: TOKENIZE # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ def tokenize_function(examples): return tokenizer( examples["text"], truncation=True, max_length=512, # GPT-2 max context = 1024 return_overflowing_tokens=True, # split long texts into chunks! return_length=True, ) tokenized = dataset.map(tokenize_function, batched=True, remove_columns=dataset["train"].column_names) # Filter out very short sequences tokenized = tokenized.filter(lambda x: len(x["input_ids"]) > 10) print(f"Training examples: {len(tokenized['train'])}") # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # STEP 4: DATA COLLATOR (special for CLM!) # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ data_collator = DataCollatorForLanguageModeling( tokenizer=tokenizer, mlm=False, # False = Causal LM (GPT) # mlm=True โ Masked LM (BERT) โ NOT for GPT! ) # DataCollatorForLanguageModeling automatically: # 1. Pads sequences in each batch # 2. Creates labels = input_ids (shifted by 1 internally) # 3. Sets label=-100 for padding tokens (ignored in loss) # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # STEP 5: TRAINING # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ args = TrainingArguments( output_dir="./gpt2-finetuned", num_train_epochs=3, per_device_train_batch_size=4, # smaller batch for generation models gradient_accumulation_steps=8, # effective batch = 4 ร 8 = 32 learning_rate=5e-5, # slightly higher than BERT (5e-5 vs 2e-5) weight_decay=0.01, warmup_ratio=0.1, fp16=True, logging_steps=50, save_strategy="epoch", save_total_limit=2, prediction_loss_only=True, # don't compute metrics (CLM only needs loss) report_to="none", ) trainer = Trainer( model=model, args=args, train_dataset=tokenized["train"], eval_dataset=tokenized["validation"], data_collator=data_collator, tokenizer=tokenizer, ) print("๐๏ธ Training GPT-2...") trainer.train() # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # STEP 6: EVALUATE (Perplexity) # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ import math eval_results = trainer.evaluate() perplexity = math.exp(eval_results["eval_loss"]) print(f"Perplexity: {perplexity:.2f}") # Lower = better. GPT-2 base on WikiText: ~30. Fine-tuned: ~20-25. # Human text: ~20-50 depending on domain. # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # STEP 7: SAVE & GENERATE! # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ trainer.save_model("./gpt2-finetuned-final") tokenizer.save_pretrained("./gpt2-finetuned-final") # Test generation! from transformers import pipeline gen = pipeline("text-generation", model="./gpt2-finetuned-final", device=0) prompts = ["In the field of machine learning,", "The history of Indonesia"] for p in prompts: result = gen(p, max_new_tokens=80, do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.2) print(f"\\nPrompt: {p}") print(f"Output: {result[0]['generated_text']}") print("\\n๐ GPT-2 fine-tuning complete!")
๐ Perbedaan Kunci dari Fine-Tuning BERT (Page 2):
1. Auto Class: AutoModelForCausalLM bukan AutoModelForSequenceClassification
2. Data Collator: DataCollatorForLanguageModeling(mlm=False) bukan DataCollatorWithPadding
3. Labels: Otomatis (labels = shifted input_ids). Tidak perlu kolom "label" di dataset.
4. Pad token: tokenizer.pad_token = tokenizer.eos_token โ GPT-2 tidak punya pad token default!
5. Batch size: Lebih kecil (4-8 vs 16-32) karena sequence panjang = lebih banyak VRAM.
6. LR: Sedikit lebih tinggi (5e-5 vs 2e-5) โ GPT fine-tuning umumnya butuh LR lebih besar.
7. Metric: Perplexity (bukan accuracy/F1) โ karena tidak ada "label benar/salah" di text generation.
๐ Key Differences from BERT Fine-Tuning (Page 2):
1. Auto Class: AutoModelForCausalLM not AutoModelForSequenceClassification
2. Data Collator: DataCollatorForLanguageModeling(mlm=False) not DataCollatorWithPadding
3. Labels: Automatic (labels = shifted input_ids). No "label" column needed in dataset.
4. Pad token: tokenizer.pad_token = tokenizer.eos_token โ GPT-2 has no default pad token!
5. Batch size: Smaller (4-8 vs 16-32) because long sequences = more VRAM.
6. LR: Slightly higher (5e-5 vs 2e-5) โ GPT fine-tuning generally needs larger LR.
7. Metric: Perplexity (not accuracy/F1) โ because there's no "right/wrong label" in text generation.
9. Instruction Tuning โ GPT โ Assistant yang Patuh
9. Instruction Tuning โ GPT โ Obedient Assistant
GPT-2 biasa hanya melanjutkan teks โ ia tidak "menjawab pertanyaan" atau "mengikuti instruksi". Instruction tuning mengajarkan GPT untuk memahami format instruksi dan memberikan respons yang sesuai. Ini adalah teknik yang membuat GPT-3 menjadi ChatGPT.
Plain GPT-2 only continues text โ it doesn't "answer questions" or "follow instructions". Instruction tuning teaches GPT to understand instruction formats and give appropriate responses. This is the technique that turned GPT-3 into ChatGPT.
# =========================== # 1. Format data instruction tuning # =========================== # Setiap training example = instruksi + respons dalam satu string # Format Alpaca-style (paling populer): training_examples = [ """### Instruction: Summarize the following text in one sentence. ### Input: Hugging Face is a company that provides tools and platforms for machine learning. They are best known for their Transformers library, which provides thousands of pre-trained models for natural language processing, computer vision, and audio tasks. ### Response: Hugging Face is an ML company known for their Transformers library offering thousands of pre-trained models for NLP, vision, and audio.""", """### Instruction: Translate the following English text to Indonesian. ### Input: I love learning about artificial intelligence. ### Response: Saya suka belajar tentang kecerdasan buatan.""", """### Instruction: What is the capital of Japan? ### Response: The capital of Japan is Tokyo.""", ] # Format ChatML (used by many chat models): chat_examples = [ """<|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user What is Python?<|im_end|> <|im_start|>assistant Python is a high-level programming language known for its readability and versatility.<|im_end|>""", ] # =========================== # 2. Prepare dataset # =========================== from datasets import Dataset # Dari list of strings: dataset = Dataset.from_dict({"text": training_examples}) # Atau dari JSONL file: # {"instruction": "...", "input": "...", "output": "..."} # dataset = load_dataset("json", data_files="instructions.jsonl") # Format setiap row menjadi satu string def format_instruction(example): if example.get("input"): text = f"""### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}{tokenizer.eos_token}""" else: text = f"""### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}{tokenizer.eos_token}""" return {"text": text} # dataset = dataset.map(format_instruction) # Then tokenize & train exactly like Section 8! # =========================== # 3. Inference with instruction format # =========================== def ask(instruction, input_text=""): if input_text: prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n" else: prompt = f"### Instruction:\n{instruction}\n\n### Response:\n" result = gen(prompt, max_new_tokens=200, do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.2) response = result[0]["generated_text"][len(prompt):] # Stop at next "###" (prevent generating another instruction) if "###" in response: response = response[:response.index("###")] return response.strip() print(ask("What is the largest planet in our solar system?")) # "Jupiter is the largest planet in our solar system."
10. Proyek: Chatbot CLI Sederhana โ GPT-2 Interaktif
10. Project: Simple CLI Chatbot โ Interactive GPT-2
from transformers import pipeline # Load fine-tuned model (atau GPT-2 biasa untuk demo) gen = pipeline("text-generation", model="./gpt2-finetuned-final", # atau "gpt2" untuk demo device=0) def chat(user_input, history=""): """Generate chatbot response.""" prompt = history + f"### Human: {user_input}\n### Assistant:" result = gen( prompt, max_new_tokens=150, do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.3, pad_token_id=gen.tokenizer.eos_token_id, ) full_text = result[0]["generated_text"] response = full_text[len(prompt):].strip() # Stop at next "### Human:" or newline for stop in ["### Human:", "###", "\n\n"]: if stop in response: response = response[:response.index(stop)] # Update history for context new_history = prompt + " " + response + "\n" return response.strip(), new_history # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # Interactive loop # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ print("๐ค GPT-2 Chatbot (type 'quit' to exit)") print("=" * 50) history = "" while True: user_input = input("\\n๐ค You: ") if user_input.lower() in ["quit", "exit", "q"]: print("๐ Bye!") break response, history = chat(user_input, history) print(f"๐ค Bot: {response}") # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # Sample conversation: # ๐ค You: What is machine learning? # ๐ค Bot: Machine learning is a subset of AI that enables systems # to learn from data without being explicitly programmed. # ๐ค You: Give me an example. # ๐ค Bot: A spam filter that learns to identify spam emails by # analyzing thousands of examples is a common ML application. # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ก Catatan Realistis: GPT-2 (117M) adalah model kecil โ jawabannya sering kurang akurat dan koheren dibandingkan ChatGPT (175B+). Ini untuk belajar konsep. Untuk chatbot production, gunakan LLaMA/Mistral 7B+ dengan LoRA fine-tuning (Page 8) atau API model besar (GPT-4, Claude).
๐ก Realistic Note: GPT-2 (117M) is a small model โ its answers are often less accurate and coherent compared to ChatGPT (175B+). This is for learning concepts. For production chatbots, use LLaMA/Mistral 7B+ with LoRA fine-tuning (Page 8) or large model APIs (GPT-4, Claude).
11. Model Generatif Lain โ Bloom, LLaMA, Mistral, Gemma
11. Other Generative Models โ Bloom, LLaMA, Mistral, Gemma
| Model | Params | Bahasa | License | Best For |
|---|---|---|---|---|
| GPT-2 | 117M-1.5B | English | MIT (free) | Belajar, eksperimen โญ |
| Bloom | 560M-176B | 46 bahasa | Open RAIL-M | Multilingual generation |
| LLaMA 3.2 | 1B-90B | Multi + ID | Meta License | State-of-the-art open โญ |
| Mistral | 7B-8x22B | Multi | Apache 2.0 | Best ratio size/quality โญ |
| Gemma 2 | 2B-27B | Multi | Gemma License | Google's open model |
| Qwen 2.5 | 0.5B-72B | Multi + ID | Apache 2.0 | Strong multilingual + code |
| Phi-3 | 3.8B-14B | English | MIT | Small but powerful |
| Model | Params | Languages | License | Best For |
|---|---|---|---|---|
| GPT-2 | 117M-1.5B | English | MIT (free) | Learning, experiments โญ |
| Bloom | 560M-176B | 46 languages | Open RAIL-M | Multilingual generation |
| LLaMA 3.2 | 1B-90B | Multi + ID | Meta License | State-of-the-art open โญ |
| Mistral | 7B-8x22B | Multi | Apache 2.0 | Best size/quality ratio โญ |
| Gemma 2 | 2B-27B | Multi | Gemma License | Google's open model |
| Qwen 2.5 | 0.5B-72B | Multi + ID | Apache 2.0 | Strong multilingual + code |
| Phi-3 | 3.8B-14B | English | MIT | Small but powerful |
๐ Roadmap Model Generatif di Seri Ini:
Page 3 (ini): GPT-2 (117M) โ belajar konsep CLM, generation params, fine-tuning
Page 8: LoRA & QLoRA โ fine-tune LLaMA/Mistral 7B di Colab!
Page 9: RLHF โ align model dengan preferensi manusia (ChatGPT method)
Anda sedang membangun fondasi untuk fine-tuning model besar di page-page berikutnya.
๐ Generative Model Roadmap in This Series:
Page 3 (this): GPT-2 (117M) โ learn CLM concepts, generation params, fine-tuning
Page 8: LoRA & QLoRA โ fine-tune LLaMA/Mistral 7B on Colab!
Page 9: RLHF โ align models with human preferences (ChatGPT method)
You're building the foundation for fine-tuning large models in upcoming pages.
12. Ringkasan Page 3
12. Page 3 Summary
| Konsep | Apa Itu | Kode Kunci |
|---|---|---|
| Encoder vs Decoder | BERT (bidirectional) vs GPT (causal) | AutoModelForCausalLM |
| Causal LM | Prediksi kata berikutnya | labels=input_ids (auto-shifted) |
| Pipeline Generation | 1-line text generation | pipeline("text-generation") |
| model.generate() | Full control generation | model.generate(**inputs, ...) |
| Temperature | Kontrol kreativitas (0.1-2.0) | temperature=0.7 |
| Top-P (Nucleus) | Adaptive probability cutoff | top_p=0.9 |
| Top-K | Fixed candidate count | top_k=50 |
| Repetition Penalty | Cegah pengulangan | repetition_penalty=1.2 |
| Fine-Tune GPT-2 | Custom corpus โ custom GPT | DataCollatorForLanguageModeling(mlm=False) |
| Instruction Tuning | GPT โ instruction follower | "### Instruction:\n...\n### Response:\n" |
| Streaming | Token-by-token output | TextIteratorStreamer |
| Perplexity | Generation quality metric | exp(eval_loss) |
| Concept | What It Is | Key Code |
|---|---|---|
| Encoder vs Decoder | BERT (bidirectional) vs GPT (causal) | AutoModelForCausalLM |
| Causal LM | Next token prediction | labels=input_ids (auto-shifted) |
| Pipeline Generation | 1-line text generation | pipeline("text-generation") |
| model.generate() | Full control generation | model.generate(**inputs, ...) |
| Temperature | Creativity control (0.1-2.0) | temperature=0.7 |
| Top-P (Nucleus) | Adaptive probability cutoff | top_p=0.9 |
| Top-K | Fixed candidate count | top_k=50 |
| Repetition Penalty | Prevent repetition | repetition_penalty=1.2 |
| Fine-Tune GPT-2 | Custom corpus โ custom GPT | DataCollatorForLanguageModeling(mlm=False) |
| Instruction Tuning | GPT โ instruction follower | "### Instruction:\n...\n### Response:\n" |
| Streaming | Token-by-token output | TextIteratorStreamer |
| Perplexity | Generation quality metric | exp(eval_loss) |
Page 2 โ Fine-Tuning BERT & Trainer API
Coming Next: Page 4 โ Token Classification & NER
Dari klasifikasi kalimat (Page 2) ke klasifikasi per-token! Page 4 membahas: Named Entity Recognition (NER) โ identifikasi orang, tempat, organisasi, POS Tagging, BIO/IOB2 labeling scheme, tokenization alignment (subword โ word labels), fine-tuning BERT untuk NER pada custom dataset, evaluasi per-entity (seqeval), dan building a NER pipeline production.
Coming Next: Page 4 โ Token Classification & NER
From sentence classification (Page 2) to per-token classification! Page 4 covers: Named Entity Recognition (NER) โ identifying people, places, organizations, POS Tagging, BIO/IOB2 labeling scheme, tokenization alignment (subword โ word labels), fine-tuning BERT for NER on custom datasets, per-entity evaluation (seqeval), and building a production NER pipeline.