Belajar Hugging Face Page 4 — Token Classification & NER

📑 Daftar Isi — Page 4

📑 Table of Contents — Page 4

Token vs Sequence Classification — Perbedaan fundamental
Apa Itu NER? — Identifikasi entitas di teks
BIO/IOB2 Labeling — Cara menandai batas entitas
Dataset CoNLL-2003 — Benchmark NER klasik
Masalah Alignment Subword↔Word — Kenapa ini KRITIS
Tokenisasi + Label Alignment — Solusi lengkap dengan kode
Di Mana Jalankan? — Sama seperti BERT Page 2 (Colab T4 ✅)
Proyek: Fine-Tune BERT NER pada CoNLL-2003 — Complete pipeline
Evaluasi NER dengan seqeval — Per-entity F1
Post-Processing — Subword → word-level entities
NER pada Custom Dataset — Template CSV Anda sendiri
POS Tagging — Token classification lainnya
Ringkasan & Preview Page 5

Token vs Sequence Classification — Fundamental difference
What Is NER? — Identifying entities in text
BIO/IOB2 Labeling — How to mark entity boundaries
CoNLL-2003 Dataset — Classic NER benchmark
Subword↔Word Alignment Problem — Why this is CRITICAL
Tokenization + Label Alignment — Complete solution with code
Where to Run? — Same as BERT Page 2 (Colab T4 ✅)
Project: Fine-Tune BERT NER on CoNLL-2003 — Complete pipeline
NER Evaluation with seqeval — Per-entity F1
Post-Processing — Subword → word-level entities
NER on Custom Dataset — Your own CSV template
POS Tagging — Another token classification task
Summary & Page 5 Preview

⚖️

1. Token vs Sequence Classification — Perbedaan Fundamental

1. Token vs Sequence Classification — Fundamental Difference

Page 2 = satu label per kalimat. Page 4 = satu label per KATA.

Page 2 = one label per sentence. Page 4 = one label per WORD.

Sequence Classification (Page 2) vs Token Classification (Page 4) Sequence Classification (Page 2: Sentiment Analysis) ──────────────────────────────────────────────────── Input: "I love this movie so much!" Output: POSITIVE (satu label untuk SELURUH kalimat) Model: BERT → [CLS] embedding → Dense(2) → softmax → label AutoModel: AutoModelForSequenceClassification Token Classification (Page 4: NER) ──────────────────────────────────────────────────── Input: "Joko Widodo visited Google in California" Output: B-PER I-PER O B-ORG O B-LOC ↑ ↑ ↑ ↑ ↑ ↑ satu label untuk SETIAP token! Model: BERT → SETIAP token embedding → Dense(num_labels) → softmax → labels AutoModel: AutoModelForTokenClassification Key Difference: Sequence: 1 kalimat → 1 label (sentiment, topic) Token: 1 kalimat → N labels (NER, POS tagging) (N = jumlah tokens dalam kalimat)

🎓 Kenapa Token Classification Lebih Sulit?
1. Lebih banyak predictions: Kalimat 20 kata = 20 predictions (bukan 1)
2. Subword problem: "Widodo" mungkin di-tokenize menjadi ["Wi", "##do", "##do"] — 3 subwords tapi hanya 1 label! Bagaimana align-nya? (Section 5)
3. Entity boundaries: "New York City" = 1 entitas tapi 3 kata. Bagaimana menandai awal dan akhir? (Section 3: BIO scheme)
4. Class imbalance: Mayoritas token = O (bukan entitas). Hanya ~5-10% token adalah entitas.

🎓 Why Is Token Classification Harder?
1. More predictions: A 20-word sentence = 20 predictions (not 1)
2. Subword problem: "Widodo" might tokenize to ["Wi", "##do", "##do"] — 3 subwords but only 1 label! How to align? (Section 5)
3. Entity boundaries: "New York City" = 1 entity but 3 words. How to mark start and end? (Section 3: BIO scheme)
4. Class imbalance: Most tokens = O (not entity). Only ~5-10% of tokens are entities.

🏷️

2. Apa Itu NER? — Identifikasi Entitas di Teks

2. What Is NER? — Identifying Entities in Text

Menemukan dan mengklasifikasi nama orang, tempat, organisasi, tanggal, dll di teks

Finding and classifying person names, locations, organizations, dates, etc. in text

Named Entity Recognition (NER) adalah tugas NLP untuk mengidentifikasi dan mengkategorikan entitas bernama (named entities) di teks. Entitas yang umum:

Named Entity Recognition (NER) is an NLP task to identify and categorize named entities in text. Common entities:

25_ner_intro.py — NER dengan Pipelinepython

from transformers import pipeline

# ===========================
# 1. NER pipeline — instant, zero training!
# ===========================
ner = pipeline("ner", grouped_entities=True, device=0)

text = "Joko Widodo met Tim Cook at Apple Park in Cupertino on January 15, 2024."
entities = ner(text)

for e in entities:
    print(f"  {e['word']:20s} → {e['entity_group']:5s} ({e['score']:.1%})  [{e['start']}:{e['end']}]")
# Joko Widodo          → PER   (99.8%)  [0:12]
# Tim Cook             → PER   (99.6%)  [17:25]
# Apple Park           → LOC   (97.3%)  [29:39]
# Cupertino            → LOC   (99.9%)  [43:52]
# January 15, 2024     → MISC  (85.2%)  [56:73]

# ===========================
# 2. Entity types (standard CoNLL-2003)
# ===========================
# PER  = Person         (Joko Widodo, Tim Cook, Albert Einstein)
# LOC  = Location       (Jakarta, California, Mount Everest)
# ORG  = Organization   (Google, BRI, United Nations)
# MISC = Miscellaneous  (Indonesian, COVID-19, iPhone 15)
#
# Custom NER bisa menambah tipe apapun:
# PRODUCT, DATE, MONEY, EVENT, DISEASE, DRUG, dll.

# ===========================
# 3. NER use cases di production
# ===========================
# • Search engines: extract entities for knowledge graphs
# • Customer support: detect product names, order IDs
# • Medical NLP: extract drug names, diseases, symptoms
# • Finance: extract company names, monetary amounts
# • Legal: extract person names, dates, legal references
# • Social media: extract locations, organizations, events

Tipe Entitas	Contoh	Keterangan
PER (Person)	Joko Widodo, Elon Musk	Nama orang (termasuk fiktif)
LOC (Location)	Jakarta, Gunung Bromo	Tempat: kota, negara, gunung, sungai
ORG (Organization)	Google, BRI, PBB	Perusahaan, institusi, organisasi
MISC (Miscellaneous)	Indonesian, iPhone, COVID-19	Bahasa, produk, event, nasionalitas

Entity Type	Examples	Description
PER (Person)	Joko Widodo, Elon Musk	Person names (including fictional)
LOC (Location)	Jakarta, Mount Bromo	Places: cities, countries, mountains, rivers
ORG (Organization)	Google, BRI, UN	Companies, institutions, organizations
MISC (Miscellaneous)	Indonesian, iPhone, COVID-19	Languages, products, events, nationalities

🏷️

3. BIO/IOB2 Labeling Scheme — Cara Menandai Batas Entitas

3. BIO/IOB2 Labeling Scheme — How to Mark Entity Boundaries

B = Beginning, I = Inside, O = Outside — sistem yang membedakan "New" dan "York" dalam "New York"

B = Beginning, I = Inside, O = Outside — the system that distinguishes "New" and "York" in "New York"

Masalah: "New York City" = 1 entitas LOC tapi 3 kata. Bagaimana model tahu bahwa "New", "York", dan "City" adalah SATU entitas, bukan tiga entitas terpisah? Jawaban: BIO tagging scheme.

Problem: "New York City" = 1 LOC entity but 3 words. How does the model know "New", "York", and "City" are ONE entity, not three separate ones? Answer: BIO tagging scheme.

BIO Labeling — Setiap Label Punya 3 Kemungkinan B-XXX = Beginning of entity XXX (kata PERTAMA dari entitas) I-XXX = Inside entity XXX (kata LANJUTAN dari entitas yang sama) O = Outside any entity (bukan entitas) Contoh: ┌────────┬────────┬─────────┬──────┬─────────┬────┬───────┐ │ Joko │ Widodo │ visited │ New │ York │ on │ Monday│ ├────────┼────────┼─────────┼──────┼─────────┼────┼───────┤ │ B-PER │ I-PER │ O │ B-LOC│ I-LOC │ O │ B-MISC│ └────────┴────────┴─────────┴──────┴─────────┴────┴───────┘ ↑ mulai ↑ lanjutan ↑ mulai ↑ lanjutan PER PER LOC LOC Dari BIO tags, kita bisa reconstruct entitas: • B-PER + I-PER = "Joko Widodo" (1 entitas PER) • B-LOC + I-LOC = "New York" (1 entitas LOC) • B-MISC = "Monday" (1 entitas MISC, hanya 1 kata) Kenapa B dan I penting? Tanpa B/I: "Joko PER Widodo PER visited O Tim PER Cook PER" → Apakah "Joko Widodo Tim Cook" = 1 entitas atau 2?? Ambigu! Dengan B/I: "Joko B-PER Widodo I-PER Tim B-PER Cook I-PER" → Jelas: 2 entitas terpisah! B = mulai entitas baru.

26_bio_labeling.py — BIO Label Mappingpython

# ===========================
# Standard CoNLL-2003 label list (9 labels)
# ===========================
label_list = [
    "O",       # 0 — Outside (bukan entitas)
    "B-PER",   # 1 — Beginning of Person
    "I-PER",   # 2 — Inside Person
    "B-ORG",   # 3 — Beginning of Organization
    "I-ORG",   # 4 — Inside Organization
    "B-LOC",   # 5 — Beginning of Location
    "I-LOC",   # 6 — Inside Location
    "B-MISC",  # 7 — Beginning of Miscellaneous
    "I-MISC",  # 8 — Inside Miscellaneous
]

# Mappings
label2id = {label: i for i, label in enumerate(label_list)}
id2label = {i: label for i, label in enumerate(label_list)}

print(label2id)
# {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4,
#  'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}

# num_labels untuk model = 9
# Formula: num_labels = 1 (O) + 2 × num_entity_types (B + I per type)
# 4 entity types → 1 + 2×4 = 9 labels

📊

4. Dataset CoNLL-2003 — Benchmark NER Klasik

4. CoNLL-2003 Dataset — Classic NER Benchmark

Dataset NER paling populer: 20k kalimat English dari berita Reuters

The most popular NER dataset: 20k English sentences from Reuters news

27_conll_dataset.py — Explore CoNLL-2003python

from datasets import load_dataset

# ===========================
# Load CoNLL-2003
# ===========================
dataset = load_dataset("conll2003")
print(dataset)
# DatasetDict({
#     train: Dataset({features: ['id','tokens','pos_tags','chunk_tags','ner_tags'], num_rows: 14041})
#     validation: Dataset({num_rows: 3250})
#     test: Dataset({num_rows: 3453})
# })

# ===========================
# Inspect satu contoh
# ===========================
example = dataset["train"][0]
print(f"Tokens:   {example['tokens']}")
print(f"NER tags: {example['ner_tags']}")
# Tokens:   ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
# NER tags: [3, 0, 7, 0, 0, 0, 7, 0, 0]
#           B-ORG O  B-MISC O O O  B-MISC O O

# Decode NER tags to human-readable
ner_feature = dataset["train"].features["ner_tags"].feature
print(f"Label names: {ner_feature.names}")
# ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

# Human-readable version:
for token, tag_id in zip(example["tokens"], example["ner_tags"]):
    tag = ner_feature.names[tag_id]
    marker = "  ←" if tag != "O" else ""
    print(f"  {token:15s} {tag:8s}{marker}")
# EU              B-ORG   ←   (European Union)
# rejects         O
# German          B-MISC  ←   (nationality)
# call            O
# to              O
# boycott         O
# British         B-MISC  ←   (nationality)
# lamb            O
# .               O

# PENTING: Data sudah PRE-TOKENIZED (list of words + list of tags)
# Ini BERBEDA dari text classification di mana input = string!
# Input: ["EU", "rejects", "German", ...] (list, bukan string!)
# Labels: [3, 0, 7, ...] (satu label per kata)

🔗

5. Masalah Alignment Subword↔Word — Kenapa Ini KRITIS

5. Subword↔Word Alignment Problem — Why This Is CRITICAL

Tokenizer memecah kata menjadi subwords — tapi label NER ada di level KATA, bukan subword!

Tokenizer splits words into subwords — but NER labels are at WORD level, not subword!

Ini adalah tantangan terbesar dalam NER di Hugging Face dan yang paling sering membuat orang bingung. Masalahnya: dataset NER memberikan 1 label per kata, tapi model BERT bekerja di level subword tokens. Satu kata bisa menjadi beberapa subwords!

This is the biggest challenge in NER with Hugging Face and what confuses people most. The problem: NER datasets give 1 label per word, but the BERT model works at the subword token level. One word can become multiple subwords!

The Alignment Problem — Word Labels vs Subword Tokens Dataset (word-level): Words: ["Joko", "Widodo", "visited", "Cupertino"] Labels: [ B-PER, I-PER, O, B-LOC ] Count: 4 words → 4 labels ✓ After BERT tokenizer (subword-level): Tokens: ["[CLS]", "Jo", "##ko", "Wi", "##do", "##do", "visited", "Cup", "##ert", "##ino", "[SEP]"] Count: 11 tokens → need 11 labels! But we only have 4! PROBLEM: How to assign labels to 11 tokens from 4 word labels? SOLUSI: Alignment strategy ┌───────────┬────────┬────────────┬───────────────────────┐ │ Token │ Word # │ Is First? │ Assigned Label │ ├───────────┼────────┼────────────┼───────────────────────┤ │ [CLS] │ None │ special │ -100 (IGNORE in loss) │ │ Jo │ 0 │ ✅ YES │ B-PER (dari word 0) │ │ ##ko │ 0 │ ❌ no │ -100 (IGNORE) │ │ Wi │ 1 │ ✅ YES │ I-PER (dari word 1) │ │ ##do │ 1 │ ❌ no │ -100 (IGNORE) │ │ ##do │ 1 │ ❌ no │ -100 (IGNORE) │ │ visited │ 2 │ ✅ YES │ O (dari word 2) │ │ Cup │ 3 │ ✅ YES │ B-LOC (dari word 3) │ │ ##ert │ 3 │ ❌ no │ -100 (IGNORE) │ │ ##ino │ 3 │ ❌ no │ -100 (IGNORE) │ │ [SEP] │ None │ special │ -100 (IGNORE) │ └───────────┴────────┴────────────┴───────────────────────┘ Aturan: • Special tokens ([CLS], [SEP], [PAD]) → label = -100 • Subword pertama dari sebuah kata → label = label kata tersebut • Subword lanjutan (##xxx) → label = -100 • -100 = "IGNORE" → PyTorch CrossEntropyLoss skip posisi ini!

🎓 Kenapa -100?
PyTorch CrossEntropyLoss punya parameter ignore_index=-100 (default). Artinya: jika label = -100, posisi tersebut tidak berkontribusi ke loss — model tidak dihukum atau diberi reward untuk prediksi di posisi tersebut.

Ini sempurna untuk NER karena kita hanya ingin model belajar memprediksi label pada subword pertama dari setiap kata (yang memiliki label asli), bukan pada subword lanjutan atau special tokens.

🎓 Why -100?
PyTorch CrossEntropyLoss has parameter ignore_index=-100 (default). This means: if label = -100, that position doesn't contribute to the loss — the model isn't penalized or rewarded for predictions at that position.

This is perfect for NER because we only want the model to learn predictions at the first subword of each word (which has the real label), not at continuation subwords or special tokens.

✂️

6. Tokenisasi + Label Alignment — Solusi Lengkap

6. Tokenization + Label Alignment — Complete Solution

Fungsi tokenisasi yang menangani alignment subword↔word secara otomatis

Tokenization function that handles subword↔word alignment automatically

28_tokenize_align.py — Alignment Solution 🔑python

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
# IMPORTANT: use CASED model for NER! Names are capitalized!

# ===========================
# THE alignment function — THE MOST IMPORTANT CODE IN NER!
# ===========================
def tokenize_and_align_labels(examples):
    """Tokenize pre-tokenized words and align NER labels.

    Key concept: word_ids() tells us which WORD each subword came from.
    - word_id=None → special token ([CLS], [SEP], [PAD]) → label=-100
    - First subword of a word → keep original label
    - Continuation subword → label=-100

    Args:
        examples: batch with 'tokens' (list of words) and 'ner_tags' (list of ints)
    Returns:
        dict with input_ids, attention_mask, labels (aligned!)
    """
    tokenized = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,  # ← CRITICAL! Input is already split into words!
        max_length=128,
    )

    all_labels = []
    for i, labels in enumerate(examples["ner_tags"]):
        word_ids = tokenized.word_ids(batch_index=i)
        # word_ids: [None, 0, 0, 1, 1, 1, 2, 3, 3, 3, None]
        #           [CLS] Jo ##ko Wi ##do ##do visited Cup ##ert ##ino [SEP]

        label_ids = []
        previous_word_id = None

        for word_id in word_ids:
            if word_id is None:
                # Special token → ignore
                label_ids.append(-100)
            elif word_id != previous_word_id:
                # First subword of a new word → use word's label
                label_ids.append(labels[word_id])
            else:
                # Continuation subword → ignore
                label_ids.append(-100)

            previous_word_id = word_id

        all_labels.append(label_ids)

    tokenized["labels"] = all_labels
    return tokenized

# ===========================
# Apply to dataset
# ===========================
tokenized_dataset = dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=dataset["train"].column_names,
)

# Verify alignment
example = tokenized_dataset["train"][0]
tokens = tokenizer.convert_ids_to_tokens(example["input_ids"])
labels = example["labels"]
for token, label in zip(tokens, labels):
    label_name = label_list[label] if label != -100 else "IGNORE"
    print(f"  {token:15s} → {label_name}")
# [CLS]           → IGNORE
# EU              → B-ORG   ← first subword, keep label
# re              → IGNORE  ← continuation, ignore
# ##ject          → IGNORE
# ##s             → IGNORE
# German          → B-MISC  ← first subword of "German"
# ...

🎓 3 Hal Kritis yang Harus Diingat:
1. is_split_into_words=True — WAJIB! Tanpa ini, tokenizer menganggap input = 1 string (bukan list of words). Akan error atau hasil salah.
2. word_ids() — fungsi ajaib yang mengembalikan mapping subword→word. None = special token, angka = index kata asal.
3. BERT Cased — Gunakan bert-base-cased (BUKAN uncased!) untuk NER. Kapitalisasi penting untuk nama: "Apple" (perusahaan) vs "apple" (buah).

🎓 3 Critical Things to Remember:
1. is_split_into_words=True — MANDATORY! Without this, the tokenizer treats input as 1 string (not a list of words). Will error or give wrong results.
2. word_ids() — magic function that returns subword→word mapping. None = special token, number = source word index.
3. BERT Cased — Use bert-base-cased (NOT uncased!) for NER. Capitalization matters for names: "Apple" (company) vs "apple" (fruit).

💻

7. Di Mana Jalankan? — Sama Seperti BERT Page 2

7. Where to Run? — Same as BERT Page 2

NER = fine-tuning BERT → VRAM dan setup identik dengan Page 2

NER = fine-tuning BERT → VRAM and setup identical to Page 2

NER fine-tuning menggunakan model yang sama persis dengan text classification di Page 2 (BERT/DistilBERT). Perbedaannya hanya di head: AutoModelForTokenClassification bukan AutoModelForSequenceClassification. VRAM, setup Colab, dan troubleshooting OOM identik — refer ke Page 2 Section 1b.

NER fine-tuning uses the exact same model as text classification in Page 2 (BERT/DistilBERT). The only difference is the head: AutoModelForTokenClassification instead of AutoModelForSequenceClassification. VRAM, Colab setup, and OOM troubleshooting are identical — refer to Page 2 Section 1b.

🎉 TL;DR: Buka Google Colab → GPU T4 → !pip install -q transformers datasets accelerate evaluate seqeval → Copy-paste kode Section 8 → Run. ~10 menit training, F1 ~92%. Persis sama setupnya dengan Page 2.

🎉 TL;DR: Open Google Colab → T4 GPU → !pip install -q transformers datasets accelerate evaluate seqeval → Copy-paste Section 8 code → Run. ~10 min training, F1 ~92%. Exact same setup as Page 2.

🔥

8. Proyek: Fine-Tune BERT NER pada CoNLL-2003 — Complete Pipeline

8. Project: Fine-Tune BERT NER on CoNLL-2003 — Complete Pipeline

Gabungkan semua: dataset → alignment → collator → metrics → Trainer → evaluate

Combine everything: dataset → alignment → collator → metrics → Trainer → evaluate

29_ner_finetune.py — Complete NER Fine-Tuning 🔥🔥🔥python

import numpy as np
import evaluate
from datasets import load_dataset
from transformers import (
    AutoTokenizer, AutoModelForTokenClassification,
    TrainingArguments, Trainer, DataCollatorForTokenClassification
)

# ═══════════════════════════════════════
# STEP 1: LOAD DATASET
# ═══════════════════════════════════════
dataset = load_dataset("conll2003")
label_list = dataset["train"].features["ner_tags"].feature.names
num_labels = len(label_list)  # 9
print(f"Labels ({num_labels}): {label_list}")

# ═══════════════════════════════════════
# STEP 2: LOAD TOKENIZER & MODEL
# ═══════════════════════════════════════
model_name = "bert-base-cased"  # CASED for NER!
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    id2label={i: l for i, l in enumerate(label_list)},
    label2id={l: i for i, l in enumerate(label_list)},
)

# ═══════════════════════════════════════
# STEP 3: TOKENIZE + ALIGN (from Section 6!)
# ═══════════════════════════════════════
def tokenize_and_align(examples):
    tokenized = tokenizer(examples["tokens"], truncation=True,
                          is_split_into_words=True, max_length=128)
    all_labels = []
    for i, labels in enumerate(examples["ner_tags"]):
        word_ids = tokenized.word_ids(batch_index=i)
        label_ids, prev = [], None
        for wid in word_ids:
            if wid is None:       label_ids.append(-100)
            elif wid != prev:     label_ids.append(labels[wid])
            else:                 label_ids.append(-100)
            prev = wid
        all_labels.append(label_ids)
    tokenized["labels"] = all_labels
    return tokenized

tokenized = dataset.map(tokenize_and_align, batched=True,
                        remove_columns=dataset["train"].column_names)

# ═══════════════════════════════════════
# STEP 4: DATA COLLATOR (special for token classification!)
# ═══════════════════════════════════════
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
# Pads BOTH input_ids AND labels to same length!
# Padding labels get value -100 (ignored in loss)

# ═══════════════════════════════════════
# STEP 5: METRICS — seqeval (NER-specific!)
# ═══════════════════════════════════════
seqeval = evaluate.load("seqeval")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    # Convert IDs back to label strings (skip -100!)
    true_labels, true_preds = [], []
    for pred_seq, label_seq in zip(predictions, labels):
        t_labels, t_preds = [], []
        for p, l in zip(pred_seq, label_seq):
            if l != -100:  # skip ignored positions!
                t_labels.append(label_list[l])
                t_preds.append(label_list[p])
        true_labels.append(t_labels)
        true_preds.append(t_preds)

    results = seqeval.compute(predictions=true_preds, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

# ═══════════════════════════════════════
# STEP 6: TRAIN!
# ═══════════════════════════════════════
args = TrainingArguments(
    output_dir="./ner-bert",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    fp16=True,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    save_total_limit=2,
    report_to="none",
)

trainer = Trainer(
    model=model, args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

# ═══════════════════════════════════════
# STEP 7: EVALUATE
# ═══════════════════════════════════════
results = trainer.evaluate(tokenized["test"])
print(f"\n🏷️ NER Results:")
print(f"  F1:        {results['eval_f1']:.1%}")
print(f"  Precision: {results['eval_precision']:.1%}")
print(f"  Recall:    {results['eval_recall']:.1%}")
# 🏷️ NER Results:
#   F1:        91.8%
#   Precision: 92.1%
#   Recall:    91.5%

# Save
trainer.save_model("./ner-bert-final")
tokenizer.save_pretrained("./ner-bert-final")
print("🏆 NER model saved!")

🏷️ 91.8% F1 pada CoNLL-2003!
State-of-the-art pada CoNLL-2003 adalah ~94% F1 (DeBERTa-xlarge). BERT-base-cased mendapat ~92%, yang sudah sangat bagus. Bandingkan dengan model tradisional: CRF = ~80%, BiLSTM-CRF = ~88%, BERT = ~92%.

🏷️ 91.8% F1 on CoNLL-2003!
State-of-the-art on CoNLL-2003 is ~94% F1 (DeBERTa-xlarge). BERT-base-cased achieves ~92%, which is already very good. Compare with traditional models: CRF = ~80%, BiLSTM-CRF = ~88%, BERT = ~92%.

📊

9. Evaluasi NER dengan seqeval — Per-Entity F1

9. NER Evaluation with seqeval — Per-Entity F1

Tidak cukup overall F1 — Anda perlu tahu F1 per tipe entitas (PER, LOC, ORG, MISC)

Overall F1 isn't enough — you need per-entity-type F1 (PER, LOC, ORG, MISC)

30_seqeval_detail.py — Detailed NER Evaluationpython

import evaluate

seqeval = evaluate.load("seqeval")

# Example predictions and references
true_labels = [["B-PER", "I-PER", "O", "B-LOC"]]
pred_labels = [["B-PER", "I-PER", "O", "B-ORG"]]  # LOC→ORG mistake!

results = seqeval.compute(predictions=pred_labels, references=true_labels)
print(results)
# {
#   'PER':  {'precision': 1.0,  'recall': 1.0,  'f1': 1.0,  'number': 1},
#   'LOC':  {'precision': 0.0,  'recall': 0.0,  'f1': 0.0,  'number': 1},  ← missed!
#   'ORG':  {'precision': 0.0,  'recall': 0.0,  'f1': 0.0,  'number': 0},  ← false positive
#   'overall_precision': 0.5,
#   'overall_recall': 0.5,
#   'overall_f1': 0.5,
#   'overall_accuracy': 0.75,
# }

# PENTING: seqeval mengevaluasi pada level ENTITAS, bukan token!
# "Joko Widodo" = 1 entitas PER (2 tokens)
# Jika model memprediksi "Joko"=B-PER tapi "Widodo"=O (bukan I-PER)
# → entitas SALAH (partial match = wrong!)
# seqeval menggunakan "exact match" — semua tokens harus benar.

# Typical results after fine-tuning BERT on CoNLL-2003:
# PER:  F1 = 96.2% (names are relatively easy)
# LOC:  F1 = 93.1% (locations are clear)
# ORG:  F1 = 89.4% (organizations can be ambiguous)
# MISC: F1 = 82.6% (misc is hardest — very diverse)

🔧

10. Post-Processing — Subword Predictions → Word-Level Entities

Setelah model predict, gabungkan subword predictions menjadi entitas utuh

After model predicts, merge subword predictions into complete entities

31_postprocess.py — NER Post-Processing & Pipelinepython

from transformers import pipeline

# ===========================
# 1. Use fine-tuned model as pipeline (EASIEST!)
# ===========================
ner_pipe = pipeline("ner",
    model="./ner-bert-final",
    aggregation_strategy="simple",  # merge subwords automatically!
    device=0,
)
# aggregation_strategy options:
# "none"   → raw subword predictions (no merging)
# "simple" → merge subwords, take first subword's label ← RECOMMENDED
# "first"  → same as simple
# "average" → average scores across subwords
# "max"    → take max score across subwords

text = "Barack Obama visited the United Nations headquarters in New York City."
entities = ner_pipe(text)
for e in entities:
    print(f"  {e['word']:25s} {e['entity_group']:5s} ({e['score']:.1%})")
# Barack Obama              PER   (99.7%)
# United Nations            ORG   (99.2%)
# New York City             LOC   (99.5%)

# ===========================
# 2. Manual post-processing (for custom needs)
# ===========================
import torch

def extract_entities(text, model, tokenizer):
    """Extract entities manually with full control."""
    inputs = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
    offsets = inputs.pop("offset_mapping")[0]

    with torch.no_grad():
        outputs = model(**inputs.to(model.device))
    predictions = outputs.logits.argmax(dim=-1)[0].cpu().tolist()

    entities = []
    current_entity = None

    for idx, (pred, offset) in enumerate(zip(predictions, offsets)):
        label = model.config.id2label[pred]
        start, end = offset.tolist()

        if start == 0 and end == 0:  # special token
            continue

        if label.startswith("B-"):
            if current_entity:
                entities.append(current_entity)
            current_entity = {
                "entity": label[2:],
                "word": text[start:end],
                "start": start, "end": end
            }
        elif label.startswith("I-") and current_entity:
            current_entity["word"] = text[current_entity["start"]:end]
            current_entity["end"] = end
        else:
            if current_entity:
                entities.append(current_entity)
                current_entity = None

    if current_entity:
        entities.append(current_entity)
    return entities

result = extract_entities(text, model, tokenizer)
for e in result:
    print(f"  {e['word']:25s} → {e['entity']}")

📁

11. NER pada Custom Dataset — Template CSV Anda Sendiri

11. NER on Custom Dataset — Your Own CSV Template

Format data Anda ke CoNLL format → fine-tune → custom NER model

Format your data to CoNLL format → fine-tune → custom NER model

32_custom_ner.py — Custom NER Datasetpython

from datasets import Dataset, Features, Sequence, ClassLabel, Value

# ===========================
# Format 1: CoNLL-style (recommended)
# ===========================
# File: data.txt (tab-separated, empty line = sentence boundary)
# Joko    B-PER
# Widodo  I-PER
# visited O
# BRI     B-ORG
# .       O
#                  ← empty line = new sentence
# Jakarta B-LOC
# is      O
# big     O
# .       O

# ===========================
# Format 2: From Python dict (easiest for small datasets)
# ===========================
custom_labels = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]

data = {
    "tokens": [
        ["Joko", "Widodo", "visited", "BRI", "in", "Jakarta"],
        ["Google", "opened", "office", "in", "Surabaya"],
    ],
    "ner_tags": [
        [1, 2, 0, 3, 0, 5],   # B-PER I-PER O B-ORG O B-LOC
        [3, 0, 0, 0, 5],       # B-ORG O O O B-LOC
    ],
}

features = Features({
    "tokens": Sequence(Value("string")),
    "ner_tags": Sequence(ClassLabel(names=custom_labels)),
})

custom_dataset = Dataset.from_dict(data, features=features)
split = custom_dataset.train_test_split(test_size=0.2, seed=42)

# Now use EXACTLY the same pipeline as Section 8!
# Just replace: dataset = split  (instead of load_dataset("conll2003"))
# And update: label_list = custom_labels
# Everything else is IDENTICAL. 🎉

# Minimum data recommendation:
# • 200-500 annotated sentences for basic NER
# • 1000-5000 for production-quality
# • 10000+ for state-of-the-art

📝

12. POS Tagging — Token Classification Lainnya

12. POS Tagging — Another Token Classification Task

Part-of-Speech tagging: noun, verb, adjective — syntax analysis per kata

Part-of-Speech tagging: noun, verb, adjective — syntax analysis per word

POS Tagging = menandai setiap kata dengan kelas tata bahasa (noun, verb, adjective, dll). Ini juga token classification, persis sama prosesnya dengan NER — hanya label yang berbeda. CoNLL-2003 sudah menyediakan pos_tags dan chunk_tags di dataset yang sama.

POS Tagging = labeling each word with its grammatical class (noun, verb, adjective, etc.). This is also token classification, the process is exactly the same as NER — only the labels differ. CoNLL-2003 already provides pos_tags and chunk_tags in the same dataset.

33_pos_tagging.py — POS Tagging (Same Process!)python

# POS tags in CoNLL-2003:
# "EU rejects German call to boycott British lamb ."
# NNP VBZ    JJ     NN   TO VB      JJ      NN   .
# (Proper Noun, Verb, Adjective, Noun, To, Verb, Adj, Noun, Punct)

# Fine-tuning is IDENTICAL to NER!
# Just change:
# 1. label_list → POS tag list (NN, VB, JJ, DT, ...)
# 2. examples["ner_tags"] → examples["pos_tags"]
# 3. num_labels → len(pos_tag_list)
# Everything else (alignment, collator, trainer) is the SAME!

# Other token classification tasks:
# • Chunking (NP, VP, PP phrases)
# • Aspect-Based Sentiment (target + sentiment per word)
# • Keyphrase extraction
# • Slot filling (for chatbot intents)

📝

13. Ringkasan Page 4

13. Page 4 Summary

Konsep	Apa Itu	Kode Kunci
Token Classification	1 label per token (bukan per kalimat)	`AutoModelForTokenClassification`
NER	Identifikasi PER/LOC/ORG/MISC	`pipeline("ner", grouped_entities=True)`
BIO Scheme	B=mulai, I=lanjutan, O=bukan	`["O","B-PER","I-PER","B-LOC",...]`
Alignment	Subword↔word label matching	`word_ids() + label=-100`
is_split_into_words	Input sudah di-split ke kata	`tokenizer(..., is_split_into_words=True)`
DataCollator	Pad input + labels bersamaan	`DataCollatorForTokenClassification`
seqeval	Entity-level evaluation	`evaluate.load("seqeval")`
Post-processing	Merge subword → word entities	`aggregation_strategy="simple"`
BERT Cased	Kapitalisasi penting untuk NER	`"bert-base-cased"`

Concept	What It Is	Key Code
Token Classification	1 label per token (not per sentence)	`AutoModelForTokenClassification`
NER	Identify PER/LOC/ORG/MISC	`pipeline("ner", grouped_entities=True)`
BIO Scheme	B=beginning, I=inside, O=outside	`["O","B-PER","I-PER","B-LOC",...]`
Alignment	Subword↔word label matching	`word_ids() + label=-100`
is_split_into_words	Input already split into words	`tokenizer(..., is_split_into_words=True)`
DataCollator	Pad input + labels together	`DataCollatorForTokenClassification`
seqeval	Entity-level evaluation	`evaluate.load("seqeval")`
Post-processing	Merge subword → word entities	`aggregation_strategy="simple"`
BERT Cased	Capitalization matters for NER	`"bert-base-cased"`

← Page Sebelumnya← Previous Page

Page 3 — Fine-Tuning GPT & Text Generation

📘

Coming Next: Page 5 — Question Answering & Seq2Seq (T5)

Dua tugas NLP paling powerful: extractive QA (menemukan jawaban di konteks — SQuAD) dan Seq2Seq (text-to-text: translation, summarization, table-to-text). Fine-tune T5 dan BART untuk translation dan summarization, memahami encoder-decoder architecture, dan building production QA system.

📘

Coming Next: Page 5 — Question Answering & Seq2Seq (T5)

Two of the most powerful NLP tasks: extractive QA (finding answers in context — SQuAD) and Seq2Seq (text-to-text: translation, summarization, table-to-text). Fine-tune T5 and BART for translation and summarization, understand encoder-decoder architecture, and building production QA systems.

Token Classification
& Named Entity Recognition

Token Classification
& Named Entity Recognition

📑 Daftar Isi — Page 4

📑 Table of Contents — Page 4

1. Token vs Sequence Classification — Perbedaan Fundamental

1. Token vs Sequence Classification — Fundamental Difference

2. Apa Itu NER? — Identifikasi Entitas di Teks

2. What Is NER? — Identifying Entities in Text

3. BIO/IOB2 Labeling Scheme — Cara Menandai Batas Entitas

3. BIO/IOB2 Labeling Scheme — How to Mark Entity Boundaries

4. Dataset CoNLL-2003 — Benchmark NER Klasik

4. CoNLL-2003 Dataset — Classic NER Benchmark

5. Masalah Alignment Subword↔Word — Kenapa Ini KRITIS

5. Subword↔Word Alignment Problem — Why This Is CRITICAL

6. Tokenisasi + Label Alignment — Solusi Lengkap

6. Tokenization + Label Alignment — Complete Solution

7. Di Mana Jalankan? — Sama Seperti BERT Page 2

7. Where to Run? — Same as BERT Page 2

8. Proyek: Fine-Tune BERT NER pada CoNLL-2003 — Complete Pipeline

8. Project: Fine-Tune BERT NER on CoNLL-2003 — Complete Pipeline

9. Evaluasi NER dengan seqeval — Per-Entity F1

9. NER Evaluation with seqeval — Per-Entity F1

10. Post-Processing — Subword Predictions → Word-Level Entities

10. Post-Processing — Subword Predictions → Word-Level Entities

11. NER pada Custom Dataset — Template CSV Anda Sendiri

11. NER on Custom Dataset — Your Own CSV Template

12. POS Tagging — Token Classification Lainnya

12. POS Tagging — Another Token Classification Task

13. Ringkasan Page 4

13. Page 4 Summary

Page 3 — Fine-Tuning GPT & Text Generation

Coming Next: Page 5 — Question Answering & Seq2Seq (T5)

Coming Next: Page 5 — Question Answering & Seq2Seq (T5)