π Daftar Isi β Page 4
π Table of Contents β Page 4
- Token vs Sequence Classification β Perbedaan fundamental
- Apa Itu NER? β Identifikasi entitas di teks
- BIO/IOB2 Labeling β Cara menandai batas entitas
- Dataset CoNLL-2003 β Benchmark NER klasik
- Masalah Alignment SubwordβWord β Kenapa ini KRITIS
- Tokenisasi + Label Alignment β Solusi lengkap dengan kode
- Di Mana Jalankan? β Sama seperti BERT Page 2 (Colab T4 β )
- Proyek: Fine-Tune BERT NER pada CoNLL-2003 β Complete pipeline
- Evaluasi NER dengan seqeval β Per-entity F1
- Post-Processing β Subword β word-level entities
- NER pada Custom Dataset β Template CSV Anda sendiri
- POS Tagging β Token classification lainnya
- Ringkasan & Preview Page 5
- Token vs Sequence Classification β Fundamental difference
- What Is NER? β Identifying entities in text
- BIO/IOB2 Labeling β How to mark entity boundaries
- CoNLL-2003 Dataset β Classic NER benchmark
- SubwordβWord Alignment Problem β Why this is CRITICAL
- Tokenization + Label Alignment β Complete solution with code
- Where to Run? β Same as BERT Page 2 (Colab T4 β )
- Project: Fine-Tune BERT NER on CoNLL-2003 β Complete pipeline
- NER Evaluation with seqeval β Per-entity F1
- Post-Processing β Subword β word-level entities
- NER on Custom Dataset β Your own CSV template
- POS Tagging β Another token classification task
- Summary & Page 5 Preview
1. Token vs Sequence Classification β Perbedaan Fundamental
1. Token vs Sequence Classification β Fundamental Difference
π Kenapa Token Classification Lebih Sulit?
1. Lebih banyak predictions: Kalimat 20 kata = 20 predictions (bukan 1)
2. Subword problem: "Widodo" mungkin di-tokenize menjadi ["Wi", "##do", "##do"] β 3 subwords tapi hanya 1 label! Bagaimana align-nya? (Section 5)
3. Entity boundaries: "New York City" = 1 entitas tapi 3 kata. Bagaimana menandai awal dan akhir? (Section 3: BIO scheme)
4. Class imbalance: Mayoritas token = O (bukan entitas). Hanya ~5-10% token adalah entitas.
π Why Is Token Classification Harder?
1. More predictions: A 20-word sentence = 20 predictions (not 1)
2. Subword problem: "Widodo" might tokenize to ["Wi", "##do", "##do"] β 3 subwords but only 1 label! How to align? (Section 5)
3. Entity boundaries: "New York City" = 1 entity but 3 words. How to mark start and end? (Section 3: BIO scheme)
4. Class imbalance: Most tokens = O (not entity). Only ~5-10% of tokens are entities.
2. Apa Itu NER? β Identifikasi Entitas di Teks
2. What Is NER? β Identifying Entities in Text
Named Entity Recognition (NER) adalah tugas NLP untuk mengidentifikasi dan mengkategorikan entitas bernama (named entities) di teks. Entitas yang umum:
Named Entity Recognition (NER) is an NLP task to identify and categorize named entities in text. Common entities:
from transformers import pipeline # =========================== # 1. NER pipeline β instant, zero training! # =========================== ner = pipeline("ner", grouped_entities=True, device=0) text = "Joko Widodo met Tim Cook at Apple Park in Cupertino on January 15, 2024." entities = ner(text) for e in entities: print(f" {e['word']:20s} β {e['entity_group']:5s} ({e['score']:.1%}) [{e['start']}:{e['end']}]") # Joko Widodo β PER (99.8%) [0:12] # Tim Cook β PER (99.6%) [17:25] # Apple Park β LOC (97.3%) [29:39] # Cupertino β LOC (99.9%) [43:52] # January 15, 2024 β MISC (85.2%) [56:73] # =========================== # 2. Entity types (standard CoNLL-2003) # =========================== # PER = Person (Joko Widodo, Tim Cook, Albert Einstein) # LOC = Location (Jakarta, California, Mount Everest) # ORG = Organization (Google, BRI, United Nations) # MISC = Miscellaneous (Indonesian, COVID-19, iPhone 15) # # Custom NER bisa menambah tipe apapun: # PRODUCT, DATE, MONEY, EVENT, DISEASE, DRUG, dll. # =========================== # 3. NER use cases di production # =========================== # β’ Search engines: extract entities for knowledge graphs # β’ Customer support: detect product names, order IDs # β’ Medical NLP: extract drug names, diseases, symptoms # β’ Finance: extract company names, monetary amounts # β’ Legal: extract person names, dates, legal references # β’ Social media: extract locations, organizations, events
| Tipe Entitas | Contoh | Keterangan |
|---|---|---|
| PER (Person) | Joko Widodo, Elon Musk | Nama orang (termasuk fiktif) |
| LOC (Location) | Jakarta, Gunung Bromo | Tempat: kota, negara, gunung, sungai |
| ORG (Organization) | Google, BRI, PBB | Perusahaan, institusi, organisasi |
| MISC (Miscellaneous) | Indonesian, iPhone, COVID-19 | Bahasa, produk, event, nasionalitas |
| Entity Type | Examples | Description |
|---|---|---|
| PER (Person) | Joko Widodo, Elon Musk | Person names (including fictional) |
| LOC (Location) | Jakarta, Mount Bromo | Places: cities, countries, mountains, rivers |
| ORG (Organization) | Google, BRI, UN | Companies, institutions, organizations |
| MISC (Miscellaneous) | Indonesian, iPhone, COVID-19 | Languages, products, events, nationalities |
3. BIO/IOB2 Labeling Scheme β Cara Menandai Batas Entitas
3. BIO/IOB2 Labeling Scheme β How to Mark Entity Boundaries
Masalah: "New York City" = 1 entitas LOC tapi 3 kata. Bagaimana model tahu bahwa "New", "York", dan "City" adalah SATU entitas, bukan tiga entitas terpisah? Jawaban: BIO tagging scheme.
Problem: "New York City" = 1 LOC entity but 3 words. How does the model know "New", "York", and "City" are ONE entity, not three separate ones? Answer: BIO tagging scheme.
# =========================== # Standard CoNLL-2003 label list (9 labels) # =========================== label_list = [ "O", # 0 β Outside (bukan entitas) "B-PER", # 1 β Beginning of Person "I-PER", # 2 β Inside Person "B-ORG", # 3 β Beginning of Organization "I-ORG", # 4 β Inside Organization "B-LOC", # 5 β Beginning of Location "I-LOC", # 6 β Inside Location "B-MISC", # 7 β Beginning of Miscellaneous "I-MISC", # 8 β Inside Miscellaneous ] # Mappings label2id = {label: i for i, label in enumerate(label_list)} id2label = {i: label for i, label in enumerate(label_list)} print(label2id) # {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, # 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8} # num_labels untuk model = 9 # Formula: num_labels = 1 (O) + 2 Γ num_entity_types (B + I per type) # 4 entity types β 1 + 2Γ4 = 9 labels
4. Dataset CoNLL-2003 β Benchmark NER Klasik
4. CoNLL-2003 Dataset β Classic NER Benchmark
from datasets import load_dataset # =========================== # Load CoNLL-2003 # =========================== dataset = load_dataset("conll2003") print(dataset) # DatasetDict({ # train: Dataset({features: ['id','tokens','pos_tags','chunk_tags','ner_tags'], num_rows: 14041}) # validation: Dataset({num_rows: 3250}) # test: Dataset({num_rows: 3453}) # }) # =========================== # Inspect satu contoh # =========================== example = dataset["train"][0] print(f"Tokens: {example['tokens']}") print(f"NER tags: {example['ner_tags']}") # Tokens: ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'] # NER tags: [3, 0, 7, 0, 0, 0, 7, 0, 0] # B-ORG O B-MISC O O O B-MISC O O # Decode NER tags to human-readable ner_feature = dataset["train"].features["ner_tags"].feature print(f"Label names: {ner_feature.names}") # ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'] # Human-readable version: for token, tag_id in zip(example["tokens"], example["ner_tags"]): tag = ner_feature.names[tag_id] marker = " β" if tag != "O" else "" print(f" {token:15s} {tag:8s}{marker}") # EU B-ORG β (European Union) # rejects O # German B-MISC β (nationality) # call O # to O # boycott O # British B-MISC β (nationality) # lamb O # . O # PENTING: Data sudah PRE-TOKENIZED (list of words + list of tags) # Ini BERBEDA dari text classification di mana input = string! # Input: ["EU", "rejects", "German", ...] (list, bukan string!) # Labels: [3, 0, 7, ...] (satu label per kata)
5. Masalah Alignment SubwordβWord β Kenapa Ini KRITIS
5. SubwordβWord Alignment Problem β Why This Is CRITICAL
Ini adalah tantangan terbesar dalam NER di Hugging Face dan yang paling sering membuat orang bingung. Masalahnya: dataset NER memberikan 1 label per kata, tapi model BERT bekerja di level subword tokens. Satu kata bisa menjadi beberapa subwords!
This is the biggest challenge in NER with Hugging Face and what confuses people most. The problem: NER datasets give 1 label per word, but the BERT model works at the subword token level. One word can become multiple subwords!
π Kenapa -100?
PyTorch CrossEntropyLoss punya parameter ignore_index=-100 (default). Artinya: jika label = -100, posisi tersebut tidak berkontribusi ke loss β model tidak dihukum atau diberi reward untuk prediksi di posisi tersebut.
Ini sempurna untuk NER karena kita hanya ingin model belajar memprediksi label pada subword pertama dari setiap kata (yang memiliki label asli), bukan pada subword lanjutan atau special tokens.
π Why -100?
PyTorch CrossEntropyLoss has parameter ignore_index=-100 (default). This means: if label = -100, that position doesn't contribute to the loss β the model isn't penalized or rewarded for predictions at that position.
This is perfect for NER because we only want the model to learn predictions at the first subword of each word (which has the real label), not at continuation subwords or special tokens.
6. Tokenisasi + Label Alignment β Solusi Lengkap
6. Tokenization + Label Alignment β Complete Solution
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") # IMPORTANT: use CASED model for NER! Names are capitalized! # =========================== # THE alignment function β THE MOST IMPORTANT CODE IN NER! # =========================== def tokenize_and_align_labels(examples): """Tokenize pre-tokenized words and align NER labels. Key concept: word_ids() tells us which WORD each subword came from. - word_id=None β special token ([CLS], [SEP], [PAD]) β label=-100 - First subword of a word β keep original label - Continuation subword β label=-100 Args: examples: batch with 'tokens' (list of words) and 'ner_tags' (list of ints) Returns: dict with input_ids, attention_mask, labels (aligned!) """ tokenized = tokenizer( examples["tokens"], truncation=True, is_split_into_words=True, # β CRITICAL! Input is already split into words! max_length=128, ) all_labels = [] for i, labels in enumerate(examples["ner_tags"]): word_ids = tokenized.word_ids(batch_index=i) # word_ids: [None, 0, 0, 1, 1, 1, 2, 3, 3, 3, None] # [CLS] Jo ##ko Wi ##do ##do visited Cup ##ert ##ino [SEP] label_ids = [] previous_word_id = None for word_id in word_ids: if word_id is None: # Special token β ignore label_ids.append(-100) elif word_id != previous_word_id: # First subword of a new word β use word's label label_ids.append(labels[word_id]) else: # Continuation subword β ignore label_ids.append(-100) previous_word_id = word_id all_labels.append(label_ids) tokenized["labels"] = all_labels return tokenized # =========================== # Apply to dataset # =========================== tokenized_dataset = dataset.map( tokenize_and_align_labels, batched=True, remove_columns=dataset["train"].column_names, ) # Verify alignment example = tokenized_dataset["train"][0] tokens = tokenizer.convert_ids_to_tokens(example["input_ids"]) labels = example["labels"] for token, label in zip(tokens, labels): label_name = label_list[label] if label != -100 else "IGNORE" print(f" {token:15s} β {label_name}") # [CLS] β IGNORE # EU β B-ORG β first subword, keep label # re β IGNORE β continuation, ignore # ##ject β IGNORE # ##s β IGNORE # German β B-MISC β first subword of "German" # ...
π 3 Hal Kritis yang Harus Diingat:
1. is_split_into_words=True β WAJIB! Tanpa ini, tokenizer menganggap input = 1 string (bukan list of words). Akan error atau hasil salah.
2. word_ids() β fungsi ajaib yang mengembalikan mapping subwordβword. None = special token, angka = index kata asal.
3. BERT Cased β Gunakan bert-base-cased (BUKAN uncased!) untuk NER. Kapitalisasi penting untuk nama: "Apple" (perusahaan) vs "apple" (buah).
π 3 Critical Things to Remember:
1. is_split_into_words=True β MANDATORY! Without this, the tokenizer treats input as 1 string (not a list of words). Will error or give wrong results.
2. word_ids() β magic function that returns subwordβword mapping. None = special token, number = source word index.
3. BERT Cased β Use bert-base-cased (NOT uncased!) for NER. Capitalization matters for names: "Apple" (company) vs "apple" (fruit).
7. Di Mana Jalankan? β Sama Seperti BERT Page 2
7. Where to Run? β Same as BERT Page 2
NER fine-tuning menggunakan model yang sama persis dengan text classification di Page 2 (BERT/DistilBERT). Perbedaannya hanya di head: AutoModelForTokenClassification bukan AutoModelForSequenceClassification. VRAM, setup Colab, dan troubleshooting OOM identik β refer ke Page 2 Section 1b.
NER fine-tuning uses the exact same model as text classification in Page 2 (BERT/DistilBERT). The only difference is the head: AutoModelForTokenClassification instead of AutoModelForSequenceClassification. VRAM, Colab setup, and OOM troubleshooting are identical β refer to Page 2 Section 1b.
π TL;DR: Buka Google Colab β GPU T4 β !pip install -q transformers datasets accelerate evaluate seqeval β Copy-paste kode Section 8 β Run. ~10 menit training, F1 ~92%. Persis sama setupnya dengan Page 2.
π TL;DR: Open Google Colab β T4 GPU β !pip install -q transformers datasets accelerate evaluate seqeval β Copy-paste Section 8 code β Run. ~10 min training, F1 ~92%. Exact same setup as Page 2.
8. Proyek: Fine-Tune BERT NER pada CoNLL-2003 β Complete Pipeline
8. Project: Fine-Tune BERT NER on CoNLL-2003 β Complete Pipeline
import numpy as np import evaluate from datasets import load_dataset from transformers import ( AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification ) # βββββββββββββββββββββββββββββββββββββββ # STEP 1: LOAD DATASET # βββββββββββββββββββββββββββββββββββββββ dataset = load_dataset("conll2003") label_list = dataset["train"].features["ner_tags"].feature.names num_labels = len(label_list) # 9 print(f"Labels ({num_labels}): {label_list}") # βββββββββββββββββββββββββββββββββββββββ # STEP 2: LOAD TOKENIZER & MODEL # βββββββββββββββββββββββββββββββββββββββ model_name = "bert-base-cased" # CASED for NER! tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained( model_name, num_labels=num_labels, id2label={i: l for i, l in enumerate(label_list)}, label2id={l: i for i, l in enumerate(label_list)}, ) # βββββββββββββββββββββββββββββββββββββββ # STEP 3: TOKENIZE + ALIGN (from Section 6!) # βββββββββββββββββββββββββββββββββββββββ def tokenize_and_align(examples): tokenized = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True, max_length=128) all_labels = [] for i, labels in enumerate(examples["ner_tags"]): word_ids = tokenized.word_ids(batch_index=i) label_ids, prev = [], None for wid in word_ids: if wid is None: label_ids.append(-100) elif wid != prev: label_ids.append(labels[wid]) else: label_ids.append(-100) prev = wid all_labels.append(label_ids) tokenized["labels"] = all_labels return tokenized tokenized = dataset.map(tokenize_and_align, batched=True, remove_columns=dataset["train"].column_names) # βββββββββββββββββββββββββββββββββββββββ # STEP 4: DATA COLLATOR (special for token classification!) # βββββββββββββββββββββββββββββββββββββββ data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer) # Pads BOTH input_ids AND labels to same length! # Padding labels get value -100 (ignored in loss) # βββββββββββββββββββββββββββββββββββββββ # STEP 5: METRICS β seqeval (NER-specific!) # βββββββββββββββββββββββββββββββββββββββ seqeval = evaluate.load("seqeval") def compute_metrics(eval_pred): logits, labels = eval_pred predictions = np.argmax(logits, axis=-1) # Convert IDs back to label strings (skip -100!) true_labels, true_preds = [], [] for pred_seq, label_seq in zip(predictions, labels): t_labels, t_preds = [], [] for p, l in zip(pred_seq, label_seq): if l != -100: # skip ignored positions! t_labels.append(label_list[l]) t_preds.append(label_list[p]) true_labels.append(t_labels) true_preds.append(t_preds) results = seqeval.compute(predictions=true_preds, references=true_labels) return { "precision": results["overall_precision"], "recall": results["overall_recall"], "f1": results["overall_f1"], "accuracy": results["overall_accuracy"], } # βββββββββββββββββββββββββββββββββββββββ # STEP 6: TRAIN! # βββββββββββββββββββββββββββββββββββββββ args = TrainingArguments( output_dir="./ner-bert", num_train_epochs=3, per_device_train_batch_size=16, per_device_eval_batch_size=64, learning_rate=2e-5, weight_decay=0.01, warmup_ratio=0.1, fp16=True, eval_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, metric_for_best_model="f1", save_total_limit=2, report_to="none", ) trainer = Trainer( model=model, args=args, train_dataset=tokenized["train"], eval_dataset=tokenized["validation"], tokenizer=tokenizer, data_collator=data_collator, compute_metrics=compute_metrics, ) trainer.train() # βββββββββββββββββββββββββββββββββββββββ # STEP 7: EVALUATE # βββββββββββββββββββββββββββββββββββββββ results = trainer.evaluate(tokenized["test"]) print(f"\nπ·οΈ NER Results:") print(f" F1: {results['eval_f1']:.1%}") print(f" Precision: {results['eval_precision']:.1%}") print(f" Recall: {results['eval_recall']:.1%}") # π·οΈ NER Results: # F1: 91.8% # Precision: 92.1% # Recall: 91.5% # Save trainer.save_model("./ner-bert-final") tokenizer.save_pretrained("./ner-bert-final") print("π NER model saved!")
π·οΈ 91.8% F1 pada CoNLL-2003!
State-of-the-art pada CoNLL-2003 adalah ~94% F1 (DeBERTa-xlarge). BERT-base-cased mendapat ~92%, yang sudah sangat bagus. Bandingkan dengan model tradisional: CRF = ~80%, BiLSTM-CRF = ~88%, BERT = ~92%.
π·οΈ 91.8% F1 on CoNLL-2003!
State-of-the-art on CoNLL-2003 is ~94% F1 (DeBERTa-xlarge). BERT-base-cased achieves ~92%, which is already very good. Compare with traditional models: CRF = ~80%, BiLSTM-CRF = ~88%, BERT = ~92%.
9. Evaluasi NER dengan seqeval β Per-Entity F1
9. NER Evaluation with seqeval β Per-Entity F1
import evaluate seqeval = evaluate.load("seqeval") # Example predictions and references true_labels = [["B-PER", "I-PER", "O", "B-LOC"]] pred_labels = [["B-PER", "I-PER", "O", "B-ORG"]] # LOCβORG mistake! results = seqeval.compute(predictions=pred_labels, references=true_labels) print(results) # { # 'PER': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1}, # 'LOC': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1}, β missed! # 'ORG': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 0}, β false positive # 'overall_precision': 0.5, # 'overall_recall': 0.5, # 'overall_f1': 0.5, # 'overall_accuracy': 0.75, # } # PENTING: seqeval mengevaluasi pada level ENTITAS, bukan token! # "Joko Widodo" = 1 entitas PER (2 tokens) # Jika model memprediksi "Joko"=B-PER tapi "Widodo"=O (bukan I-PER) # β entitas SALAH (partial match = wrong!) # seqeval menggunakan "exact match" β semua tokens harus benar. # Typical results after fine-tuning BERT on CoNLL-2003: # PER: F1 = 96.2% (names are relatively easy) # LOC: F1 = 93.1% (locations are clear) # ORG: F1 = 89.4% (organizations can be ambiguous) # MISC: F1 = 82.6% (misc is hardest β very diverse)
10. Post-Processing β Subword Predictions β Word-Level Entities
10. Post-Processing β Subword Predictions β Word-Level Entities
from transformers import pipeline # =========================== # 1. Use fine-tuned model as pipeline (EASIEST!) # =========================== ner_pipe = pipeline("ner", model="./ner-bert-final", aggregation_strategy="simple", # merge subwords automatically! device=0, ) # aggregation_strategy options: # "none" β raw subword predictions (no merging) # "simple" β merge subwords, take first subword's label β RECOMMENDED # "first" β same as simple # "average" β average scores across subwords # "max" β take max score across subwords text = "Barack Obama visited the United Nations headquarters in New York City." entities = ner_pipe(text) for e in entities: print(f" {e['word']:25s} {e['entity_group']:5s} ({e['score']:.1%})") # Barack Obama PER (99.7%) # United Nations ORG (99.2%) # New York City LOC (99.5%) # =========================== # 2. Manual post-processing (for custom needs) # =========================== import torch def extract_entities(text, model, tokenizer): """Extract entities manually with full control.""" inputs = tokenizer(text, return_tensors="pt", return_offsets_mapping=True) offsets = inputs.pop("offset_mapping")[0] with torch.no_grad(): outputs = model(**inputs.to(model.device)) predictions = outputs.logits.argmax(dim=-1)[0].cpu().tolist() entities = [] current_entity = None for idx, (pred, offset) in enumerate(zip(predictions, offsets)): label = model.config.id2label[pred] start, end = offset.tolist() if start == 0 and end == 0: # special token continue if label.startswith("B-"): if current_entity: entities.append(current_entity) current_entity = { "entity": label[2:], "word": text[start:end], "start": start, "end": end } elif label.startswith("I-") and current_entity: current_entity["word"] = text[current_entity["start"]:end] current_entity["end"] = end else: if current_entity: entities.append(current_entity) current_entity = None if current_entity: entities.append(current_entity) return entities result = extract_entities(text, model, tokenizer) for e in result: print(f" {e['word']:25s} β {e['entity']}")
11. NER pada Custom Dataset β Template CSV Anda Sendiri
11. NER on Custom Dataset β Your Own CSV Template
from datasets import Dataset, Features, Sequence, ClassLabel, Value # =========================== # Format 1: CoNLL-style (recommended) # =========================== # File: data.txt (tab-separated, empty line = sentence boundary) # Joko B-PER # Widodo I-PER # visited O # BRI B-ORG # . O # β empty line = new sentence # Jakarta B-LOC # is O # big O # . O # =========================== # Format 2: From Python dict (easiest for small datasets) # =========================== custom_labels = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"] data = { "tokens": [ ["Joko", "Widodo", "visited", "BRI", "in", "Jakarta"], ["Google", "opened", "office", "in", "Surabaya"], ], "ner_tags": [ [1, 2, 0, 3, 0, 5], # B-PER I-PER O B-ORG O B-LOC [3, 0, 0, 0, 5], # B-ORG O O O B-LOC ], } features = Features({ "tokens": Sequence(Value("string")), "ner_tags": Sequence(ClassLabel(names=custom_labels)), }) custom_dataset = Dataset.from_dict(data, features=features) split = custom_dataset.train_test_split(test_size=0.2, seed=42) # Now use EXACTLY the same pipeline as Section 8! # Just replace: dataset = split (instead of load_dataset("conll2003")) # And update: label_list = custom_labels # Everything else is IDENTICAL. π # Minimum data recommendation: # β’ 200-500 annotated sentences for basic NER # β’ 1000-5000 for production-quality # β’ 10000+ for state-of-the-art
12. POS Tagging β Token Classification Lainnya
12. POS Tagging β Another Token Classification Task
POS Tagging = menandai setiap kata dengan kelas tata bahasa (noun, verb, adjective, dll). Ini juga token classification, persis sama prosesnya dengan NER β hanya label yang berbeda. CoNLL-2003 sudah menyediakan pos_tags dan chunk_tags di dataset yang sama.
POS Tagging = labeling each word with its grammatical class (noun, verb, adjective, etc.). This is also token classification, the process is exactly the same as NER β only the labels differ. CoNLL-2003 already provides pos_tags and chunk_tags in the same dataset.
# POS tags in CoNLL-2003: # "EU rejects German call to boycott British lamb ." # NNP VBZ JJ NN TO VB JJ NN . # (Proper Noun, Verb, Adjective, Noun, To, Verb, Adj, Noun, Punct) # Fine-tuning is IDENTICAL to NER! # Just change: # 1. label_list β POS tag list (NN, VB, JJ, DT, ...) # 2. examples["ner_tags"] β examples["pos_tags"] # 3. num_labels β len(pos_tag_list) # Everything else (alignment, collator, trainer) is the SAME! # Other token classification tasks: # β’ Chunking (NP, VP, PP phrases) # β’ Aspect-Based Sentiment (target + sentiment per word) # β’ Keyphrase extraction # β’ Slot filling (for chatbot intents)
13. Ringkasan Page 4
13. Page 4 Summary
| Konsep | Apa Itu | Kode Kunci |
|---|---|---|
| Token Classification | 1 label per token (bukan per kalimat) | AutoModelForTokenClassification |
| NER | Identifikasi PER/LOC/ORG/MISC | pipeline("ner", grouped_entities=True) |
| BIO Scheme | B=mulai, I=lanjutan, O=bukan | ["O","B-PER","I-PER","B-LOC",...] |
| Alignment | Subwordβword label matching | word_ids() + label=-100 |
| is_split_into_words | Input sudah di-split ke kata | tokenizer(..., is_split_into_words=True) |
| DataCollator | Pad input + labels bersamaan | DataCollatorForTokenClassification |
| seqeval | Entity-level evaluation | evaluate.load("seqeval") |
| Post-processing | Merge subword β word entities | aggregation_strategy="simple" |
| BERT Cased | Kapitalisasi penting untuk NER | "bert-base-cased" |
| Concept | What It Is | Key Code |
|---|---|---|
| Token Classification | 1 label per token (not per sentence) | AutoModelForTokenClassification |
| NER | Identify PER/LOC/ORG/MISC | pipeline("ner", grouped_entities=True) |
| BIO Scheme | B=beginning, I=inside, O=outside | ["O","B-PER","I-PER","B-LOC",...] |
| Alignment | Subwordβword label matching | word_ids() + label=-100 |
| is_split_into_words | Input already split into words | tokenizer(..., is_split_into_words=True) |
| DataCollator | Pad input + labels together | DataCollatorForTokenClassification |
| seqeval | Entity-level evaluation | evaluate.load("seqeval") |
| Post-processing | Merge subword β word entities | aggregation_strategy="simple" |
| BERT Cased | Capitalization matters for NER | "bert-base-cased" |
Page 3 β Fine-Tuning GPT & Text Generation
Coming Next: Page 5 β Question Answering & Seq2Seq (T5)
Dua tugas NLP paling powerful: extractive QA (menemukan jawaban di konteks β SQuAD) dan Seq2Seq (text-to-text: translation, summarization, table-to-text). Fine-tune T5 dan BART untuk translation dan summarization, memahami encoder-decoder architecture, dan building production QA system.
Coming Next: Page 5 β Question Answering & Seq2Seq (T5)
Two of the most powerful NLP tasks: extractive QA (finding answers in context β SQuAD) and Seq2Seq (text-to-text: translation, summarization, table-to-text). Fine-tune T5 and BART for translation and summarization, understand encoder-decoder architecture, and building production QA systems.