πŸ“ Artikel ini ditulis dalam Bahasa Indonesia & English
πŸ“ This article is available in English & Bahasa Indonesia

🏷️ Belajar Hugging Face β€” Page 4Learn Hugging Face β€” Page 4

Token Classification
& Named Entity Recognition

Token Classification
& Named Entity Recognition

Dari klasifikasi kalimat (Page 2) ke klasifikasi per-token. Page 4 membahas super detail: apa itu token classification dan kenapa berbeda dari sequence classification, Named Entity Recognition (NER) β€” mengidentifikasi orang/tempat/organisasi/tanggal di teks, BIO/IOB2 labeling scheme β€” cara menandai batas entitas, masalah alignment subword↔word labels (KRITIS β€” ini yang membuat NER di HF tricky!), fine-tuning BERT untuk NER pada dataset CoNLL-2003 dan dataset custom, evaluasi per-entity dengan seqeval (precision/recall/F1 per tipe entitas), POS Tagging sebagai contoh token classification lain, post-processing: menggabungkan subword predictions menjadi word-level entities, membangun NER pipeline production dari model fine-tuned, dan di mana menjalankan (Colab setup, VRAM sama dengan BERT Page 2).

From sentence classification (Page 2) to per-token classification. Page 4 covers in super detail: what token classification is and how it differs from sequence classification, Named Entity Recognition (NER) β€” identifying people/places/organizations/dates in text, BIO/IOB2 labeling scheme β€” how to mark entity boundaries, subword↔word label alignment problem (CRITICAL β€” this is what makes NER in HF tricky!), fine-tuning BERT for NER on CoNLL-2003 and custom datasets, per-entity evaluation with seqeval (precision/recall/F1 per entity type), POS Tagging as another token classification example, post-processing: merging subword predictions into word-level entities, building a production NER pipeline from fine-tuned model, and where to run it (Colab setup, same VRAM as BERT Page 2).

πŸ“… MaretMarch 2026⏱ 40 menit baca40 min read
🏷 NERToken ClassificationBIO TaggingCoNLL-2003Subword AlignmentseqevalPOS Tagging
πŸ“š Seri Belajar Hugging Face:Learn Hugging Face Series:

πŸ“‘ Daftar Isi β€” Page 4

πŸ“‘ Table of Contents β€” Page 4

  1. Token vs Sequence Classification β€” Perbedaan fundamental
  2. Apa Itu NER? β€” Identifikasi entitas di teks
  3. BIO/IOB2 Labeling β€” Cara menandai batas entitas
  4. Dataset CoNLL-2003 β€” Benchmark NER klasik
  5. Masalah Alignment Subword↔Word β€” Kenapa ini KRITIS
  6. Tokenisasi + Label Alignment β€” Solusi lengkap dengan kode
  7. Di Mana Jalankan? β€” Sama seperti BERT Page 2 (Colab T4 βœ…)
  8. Proyek: Fine-Tune BERT NER pada CoNLL-2003 β€” Complete pipeline
  9. Evaluasi NER dengan seqeval β€” Per-entity F1
  10. Post-Processing β€” Subword β†’ word-level entities
  11. NER pada Custom Dataset β€” Template CSV Anda sendiri
  12. POS Tagging β€” Token classification lainnya
  13. Ringkasan & Preview Page 5
  1. Token vs Sequence Classification β€” Fundamental difference
  2. What Is NER? β€” Identifying entities in text
  3. BIO/IOB2 Labeling β€” How to mark entity boundaries
  4. CoNLL-2003 Dataset β€” Classic NER benchmark
  5. Subword↔Word Alignment Problem β€” Why this is CRITICAL
  6. Tokenization + Label Alignment β€” Complete solution with code
  7. Where to Run? β€” Same as BERT Page 2 (Colab T4 βœ…)
  8. Project: Fine-Tune BERT NER on CoNLL-2003 β€” Complete pipeline
  9. NER Evaluation with seqeval β€” Per-entity F1
  10. Post-Processing β€” Subword β†’ word-level entities
  11. NER on Custom Dataset β€” Your own CSV template
  12. POS Tagging β€” Another token classification task
  13. Summary & Page 5 Preview
βš–οΈ

1. Token vs Sequence Classification β€” Perbedaan Fundamental

1. Token vs Sequence Classification β€” Fundamental Difference

Page 2 = satu label per kalimat. Page 4 = satu label per KATA.
Page 2 = one label per sentence. Page 4 = one label per WORD.
Sequence Classification (Page 2) vs Token Classification (Page 4) Sequence Classification (Page 2: Sentiment Analysis) ──────────────────────────────────────────────────── Input: "I love this movie so much!" Output: POSITIVE (satu label untuk SELURUH kalimat) Model: BERT β†’ [CLS] embedding β†’ Dense(2) β†’ softmax β†’ label AutoModel: AutoModelForSequenceClassification Token Classification (Page 4: NER) ──────────────────────────────────────────────────── Input: "Joko Widodo visited Google in California" Output: B-PER I-PER O B-ORG O B-LOC ↑ ↑ ↑ ↑ ↑ ↑ satu label untuk SETIAP token! Model: BERT β†’ SETIAP token embedding β†’ Dense(num_labels) β†’ softmax β†’ labels AutoModel: AutoModelForTokenClassification Key Difference: Sequence: 1 kalimat β†’ 1 label (sentiment, topic) Token: 1 kalimat β†’ N labels (NER, POS tagging) (N = jumlah tokens dalam kalimat)

πŸŽ“ Kenapa Token Classification Lebih Sulit?
1. Lebih banyak predictions: Kalimat 20 kata = 20 predictions (bukan 1)
2. Subword problem: "Widodo" mungkin di-tokenize menjadi ["Wi", "##do", "##do"] β€” 3 subwords tapi hanya 1 label! Bagaimana align-nya? (Section 5)
3. Entity boundaries: "New York City" = 1 entitas tapi 3 kata. Bagaimana menandai awal dan akhir? (Section 3: BIO scheme)
4. Class imbalance: Mayoritas token = O (bukan entitas). Hanya ~5-10% token adalah entitas.

πŸŽ“ Why Is Token Classification Harder?
1. More predictions: A 20-word sentence = 20 predictions (not 1)
2. Subword problem: "Widodo" might tokenize to ["Wi", "##do", "##do"] β€” 3 subwords but only 1 label! How to align? (Section 5)
3. Entity boundaries: "New York City" = 1 entity but 3 words. How to mark start and end? (Section 3: BIO scheme)
4. Class imbalance: Most tokens = O (not entity). Only ~5-10% of tokens are entities.

🏷️

2. Apa Itu NER? β€” Identifikasi Entitas di Teks

2. What Is NER? β€” Identifying Entities in Text

Menemukan dan mengklasifikasi nama orang, tempat, organisasi, tanggal, dll di teks
Finding and classifying person names, locations, organizations, dates, etc. in text

Named Entity Recognition (NER) adalah tugas NLP untuk mengidentifikasi dan mengkategorikan entitas bernama (named entities) di teks. Entitas yang umum:

Named Entity Recognition (NER) is an NLP task to identify and categorize named entities in text. Common entities:

25_ner_intro.py β€” NER dengan Pipelinepython
from transformers import pipeline

# ===========================
# 1. NER pipeline β€” instant, zero training!
# ===========================
ner = pipeline("ner", grouped_entities=True, device=0)

text = "Joko Widodo met Tim Cook at Apple Park in Cupertino on January 15, 2024."
entities = ner(text)

for e in entities:
    print(f"  {e['word']:20s} β†’ {e['entity_group']:5s} ({e['score']:.1%})  [{e['start']}:{e['end']}]")
# Joko Widodo          β†’ PER   (99.8%)  [0:12]
# Tim Cook             β†’ PER   (99.6%)  [17:25]
# Apple Park           β†’ LOC   (97.3%)  [29:39]
# Cupertino            β†’ LOC   (99.9%)  [43:52]
# January 15, 2024     β†’ MISC  (85.2%)  [56:73]

# ===========================
# 2. Entity types (standard CoNLL-2003)
# ===========================
# PER  = Person         (Joko Widodo, Tim Cook, Albert Einstein)
# LOC  = Location       (Jakarta, California, Mount Everest)
# ORG  = Organization   (Google, BRI, United Nations)
# MISC = Miscellaneous  (Indonesian, COVID-19, iPhone 15)
#
# Custom NER bisa menambah tipe apapun:
# PRODUCT, DATE, MONEY, EVENT, DISEASE, DRUG, dll.

# ===========================
# 3. NER use cases di production
# ===========================
# β€’ Search engines: extract entities for knowledge graphs
# β€’ Customer support: detect product names, order IDs
# β€’ Medical NLP: extract drug names, diseases, symptoms
# β€’ Finance: extract company names, monetary amounts
# β€’ Legal: extract person names, dates, legal references
# β€’ Social media: extract locations, organizations, events
Tipe EntitasContohKeterangan
PER (Person)Joko Widodo, Elon MuskNama orang (termasuk fiktif)
LOC (Location)Jakarta, Gunung BromoTempat: kota, negara, gunung, sungai
ORG (Organization)Google, BRI, PBBPerusahaan, institusi, organisasi
MISC (Miscellaneous)Indonesian, iPhone, COVID-19Bahasa, produk, event, nasionalitas
Entity TypeExamplesDescription
PER (Person)Joko Widodo, Elon MuskPerson names (including fictional)
LOC (Location)Jakarta, Mount BromoPlaces: cities, countries, mountains, rivers
ORG (Organization)Google, BRI, UNCompanies, institutions, organizations
MISC (Miscellaneous)Indonesian, iPhone, COVID-19Languages, products, events, nationalities
🏷️

3. BIO/IOB2 Labeling Scheme β€” Cara Menandai Batas Entitas

3. BIO/IOB2 Labeling Scheme β€” How to Mark Entity Boundaries

B = Beginning, I = Inside, O = Outside β€” sistem yang membedakan "New" dan "York" dalam "New York"
B = Beginning, I = Inside, O = Outside β€” the system that distinguishes "New" and "York" in "New York"

Masalah: "New York City" = 1 entitas LOC tapi 3 kata. Bagaimana model tahu bahwa "New", "York", dan "City" adalah SATU entitas, bukan tiga entitas terpisah? Jawaban: BIO tagging scheme.

Problem: "New York City" = 1 LOC entity but 3 words. How does the model know "New", "York", and "City" are ONE entity, not three separate ones? Answer: BIO tagging scheme.

BIO Labeling β€” Setiap Label Punya 3 Kemungkinan B-XXX = Beginning of entity XXX (kata PERTAMA dari entitas) I-XXX = Inside entity XXX (kata LANJUTAN dari entitas yang sama) O = Outside any entity (bukan entitas) Contoh: β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β” β”‚ Joko β”‚ Widodo β”‚ visited β”‚ New β”‚ York β”‚ on β”‚ Mondayβ”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€ β”‚ B-PER β”‚ I-PER β”‚ O β”‚ B-LOCβ”‚ I-LOC β”‚ O β”‚ B-MISCβ”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜ ↑ mulai ↑ lanjutan ↑ mulai ↑ lanjutan PER PER LOC LOC Dari BIO tags, kita bisa reconstruct entitas: β€’ B-PER + I-PER = "Joko Widodo" (1 entitas PER) β€’ B-LOC + I-LOC = "New York" (1 entitas LOC) β€’ B-MISC = "Monday" (1 entitas MISC, hanya 1 kata) Kenapa B dan I penting? Tanpa B/I: "Joko PER Widodo PER visited O Tim PER Cook PER" β†’ Apakah "Joko Widodo Tim Cook" = 1 entitas atau 2?? Ambigu! Dengan B/I: "Joko B-PER Widodo I-PER Tim B-PER Cook I-PER" β†’ Jelas: 2 entitas terpisah! B = mulai entitas baru.
26_bio_labeling.py β€” BIO Label Mappingpython
# ===========================
# Standard CoNLL-2003 label list (9 labels)
# ===========================
label_list = [
    "O",       # 0 β€” Outside (bukan entitas)
    "B-PER",   # 1 β€” Beginning of Person
    "I-PER",   # 2 β€” Inside Person
    "B-ORG",   # 3 β€” Beginning of Organization
    "I-ORG",   # 4 β€” Inside Organization
    "B-LOC",   # 5 β€” Beginning of Location
    "I-LOC",   # 6 β€” Inside Location
    "B-MISC",  # 7 β€” Beginning of Miscellaneous
    "I-MISC",  # 8 β€” Inside Miscellaneous
]

# Mappings
label2id = {label: i for i, label in enumerate(label_list)}
id2label = {i: label for i, label in enumerate(label_list)}

print(label2id)
# {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4,
#  'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}

# num_labels untuk model = 9
# Formula: num_labels = 1 (O) + 2 Γ— num_entity_types (B + I per type)
# 4 entity types β†’ 1 + 2Γ—4 = 9 labels
πŸ“Š

4. Dataset CoNLL-2003 β€” Benchmark NER Klasik

4. CoNLL-2003 Dataset β€” Classic NER Benchmark

Dataset NER paling populer: 20k kalimat English dari berita Reuters
The most popular NER dataset: 20k English sentences from Reuters news
27_conll_dataset.py β€” Explore CoNLL-2003python
from datasets import load_dataset

# ===========================
# Load CoNLL-2003
# ===========================
dataset = load_dataset("conll2003")
print(dataset)
# DatasetDict({
#     train: Dataset({features: ['id','tokens','pos_tags','chunk_tags','ner_tags'], num_rows: 14041})
#     validation: Dataset({num_rows: 3250})
#     test: Dataset({num_rows: 3453})
# })

# ===========================
# Inspect satu contoh
# ===========================
example = dataset["train"][0]
print(f"Tokens:   {example['tokens']}")
print(f"NER tags: {example['ner_tags']}")
# Tokens:   ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
# NER tags: [3, 0, 7, 0, 0, 0, 7, 0, 0]
#           B-ORG O  B-MISC O O O  B-MISC O O

# Decode NER tags to human-readable
ner_feature = dataset["train"].features["ner_tags"].feature
print(f"Label names: {ner_feature.names}")
# ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

# Human-readable version:
for token, tag_id in zip(example["tokens"], example["ner_tags"]):
    tag = ner_feature.names[tag_id]
    marker = "  ←" if tag != "O" else ""
    print(f"  {token:15s} {tag:8s}{marker}")
# EU              B-ORG   ←   (European Union)
# rejects         O
# German          B-MISC  ←   (nationality)
# call            O
# to              O
# boycott         O
# British         B-MISC  ←   (nationality)
# lamb            O
# .               O

# PENTING: Data sudah PRE-TOKENIZED (list of words + list of tags)
# Ini BERBEDA dari text classification di mana input = string!
# Input: ["EU", "rejects", "German", ...] (list, bukan string!)
# Labels: [3, 0, 7, ...] (satu label per kata)
πŸ”—

5. Masalah Alignment Subword↔Word β€” Kenapa Ini KRITIS

5. Subword↔Word Alignment Problem β€” Why This Is CRITICAL

Tokenizer memecah kata menjadi subwords β€” tapi label NER ada di level KATA, bukan subword!
Tokenizer splits words into subwords β€” but NER labels are at WORD level, not subword!

Ini adalah tantangan terbesar dalam NER di Hugging Face dan yang paling sering membuat orang bingung. Masalahnya: dataset NER memberikan 1 label per kata, tapi model BERT bekerja di level subword tokens. Satu kata bisa menjadi beberapa subwords!

This is the biggest challenge in NER with Hugging Face and what confuses people most. The problem: NER datasets give 1 label per word, but the BERT model works at the subword token level. One word can become multiple subwords!

The Alignment Problem β€” Word Labels vs Subword Tokens Dataset (word-level): Words: ["Joko", "Widodo", "visited", "Cupertino"] Labels: [ B-PER, I-PER, O, B-LOC ] Count: 4 words β†’ 4 labels βœ“ After BERT tokenizer (subword-level): Tokens: ["[CLS]", "Jo", "##ko", "Wi", "##do", "##do", "visited", "Cup", "##ert", "##ino", "[SEP]"] Count: 11 tokens β†’ need 11 labels! But we only have 4! PROBLEM: How to assign labels to 11 tokens from 4 word labels? SOLUSI: Alignment strategy β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Token β”‚ Word # β”‚ Is First? β”‚ Assigned Label β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ [CLS] β”‚ None β”‚ special β”‚ -100 (IGNORE in loss) β”‚ β”‚ Jo β”‚ 0 β”‚ βœ… YES β”‚ B-PER (dari word 0) β”‚ β”‚ ##ko β”‚ 0 β”‚ ❌ no β”‚ -100 (IGNORE) β”‚ β”‚ Wi β”‚ 1 β”‚ βœ… YES β”‚ I-PER (dari word 1) β”‚ β”‚ ##do β”‚ 1 β”‚ ❌ no β”‚ -100 (IGNORE) β”‚ β”‚ ##do β”‚ 1 β”‚ ❌ no β”‚ -100 (IGNORE) β”‚ β”‚ visited β”‚ 2 β”‚ βœ… YES β”‚ O (dari word 2) β”‚ β”‚ Cup β”‚ 3 β”‚ βœ… YES β”‚ B-LOC (dari word 3) β”‚ β”‚ ##ert β”‚ 3 β”‚ ❌ no β”‚ -100 (IGNORE) β”‚ β”‚ ##ino β”‚ 3 β”‚ ❌ no β”‚ -100 (IGNORE) β”‚ β”‚ [SEP] β”‚ None β”‚ special β”‚ -100 (IGNORE) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Aturan: β€’ Special tokens ([CLS], [SEP], [PAD]) β†’ label = -100 β€’ Subword pertama dari sebuah kata β†’ label = label kata tersebut β€’ Subword lanjutan (##xxx) β†’ label = -100 β€’ -100 = "IGNORE" β†’ PyTorch CrossEntropyLoss skip posisi ini!

πŸŽ“ Kenapa -100?
PyTorch CrossEntropyLoss punya parameter ignore_index=-100 (default). Artinya: jika label = -100, posisi tersebut tidak berkontribusi ke loss β€” model tidak dihukum atau diberi reward untuk prediksi di posisi tersebut.

Ini sempurna untuk NER karena kita hanya ingin model belajar memprediksi label pada subword pertama dari setiap kata (yang memiliki label asli), bukan pada subword lanjutan atau special tokens.

πŸŽ“ Why -100?
PyTorch CrossEntropyLoss has parameter ignore_index=-100 (default). This means: if label = -100, that position doesn't contribute to the loss β€” the model isn't penalized or rewarded for predictions at that position.

This is perfect for NER because we only want the model to learn predictions at the first subword of each word (which has the real label), not at continuation subwords or special tokens.

βœ‚οΈ

6. Tokenisasi + Label Alignment β€” Solusi Lengkap

6. Tokenization + Label Alignment β€” Complete Solution

Fungsi tokenisasi yang menangani alignment subword↔word secara otomatis
Tokenization function that handles subword↔word alignment automatically
28_tokenize_align.py β€” Alignment Solution πŸ”‘python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
# IMPORTANT: use CASED model for NER! Names are capitalized!

# ===========================
# THE alignment function β€” THE MOST IMPORTANT CODE IN NER!
# ===========================
def tokenize_and_align_labels(examples):
    """Tokenize pre-tokenized words and align NER labels.

    Key concept: word_ids() tells us which WORD each subword came from.
    - word_id=None β†’ special token ([CLS], [SEP], [PAD]) β†’ label=-100
    - First subword of a word β†’ keep original label
    - Continuation subword β†’ label=-100

    Args:
        examples: batch with 'tokens' (list of words) and 'ner_tags' (list of ints)
    Returns:
        dict with input_ids, attention_mask, labels (aligned!)
    """
    tokenized = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,  # ← CRITICAL! Input is already split into words!
        max_length=128,
    )

    all_labels = []
    for i, labels in enumerate(examples["ner_tags"]):
        word_ids = tokenized.word_ids(batch_index=i)
        # word_ids: [None, 0, 0, 1, 1, 1, 2, 3, 3, 3, None]
        #           [CLS] Jo ##ko Wi ##do ##do visited Cup ##ert ##ino [SEP]

        label_ids = []
        previous_word_id = None

        for word_id in word_ids:
            if word_id is None:
                # Special token β†’ ignore
                label_ids.append(-100)
            elif word_id != previous_word_id:
                # First subword of a new word β†’ use word's label
                label_ids.append(labels[word_id])
            else:
                # Continuation subword β†’ ignore
                label_ids.append(-100)

            previous_word_id = word_id

        all_labels.append(label_ids)

    tokenized["labels"] = all_labels
    return tokenized

# ===========================
# Apply to dataset
# ===========================
tokenized_dataset = dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=dataset["train"].column_names,
)

# Verify alignment
example = tokenized_dataset["train"][0]
tokens = tokenizer.convert_ids_to_tokens(example["input_ids"])
labels = example["labels"]
for token, label in zip(tokens, labels):
    label_name = label_list[label] if label != -100 else "IGNORE"
    print(f"  {token:15s} β†’ {label_name}")
# [CLS]           β†’ IGNORE
# EU              β†’ B-ORG   ← first subword, keep label
# re              β†’ IGNORE  ← continuation, ignore
# ##ject          β†’ IGNORE
# ##s             β†’ IGNORE
# German          β†’ B-MISC  ← first subword of "German"
# ...

πŸŽ“ 3 Hal Kritis yang Harus Diingat:
1. is_split_into_words=True β€” WAJIB! Tanpa ini, tokenizer menganggap input = 1 string (bukan list of words). Akan error atau hasil salah.
2. word_ids() — fungsi ajaib yang mengembalikan mapping subword→word. None = special token, angka = index kata asal.
3. BERT Cased β€” Gunakan bert-base-cased (BUKAN uncased!) untuk NER. Kapitalisasi penting untuk nama: "Apple" (perusahaan) vs "apple" (buah).

πŸŽ“ 3 Critical Things to Remember:
1. is_split_into_words=True β€” MANDATORY! Without this, the tokenizer treats input as 1 string (not a list of words). Will error or give wrong results.
2. word_ids() — magic function that returns subword→word mapping. None = special token, number = source word index.
3. BERT Cased β€” Use bert-base-cased (NOT uncased!) for NER. Capitalization matters for names: "Apple" (company) vs "apple" (fruit).

πŸ’»

7. Di Mana Jalankan? β€” Sama Seperti BERT Page 2

7. Where to Run? β€” Same as BERT Page 2

NER = fine-tuning BERT β†’ VRAM dan setup identik dengan Page 2
NER = fine-tuning BERT β†’ VRAM and setup identical to Page 2

NER fine-tuning menggunakan model yang sama persis dengan text classification di Page 2 (BERT/DistilBERT). Perbedaannya hanya di head: AutoModelForTokenClassification bukan AutoModelForSequenceClassification. VRAM, setup Colab, dan troubleshooting OOM identik β€” refer ke Page 2 Section 1b.

NER fine-tuning uses the exact same model as text classification in Page 2 (BERT/DistilBERT). The only difference is the head: AutoModelForTokenClassification instead of AutoModelForSequenceClassification. VRAM, Colab setup, and OOM troubleshooting are identical β€” refer to Page 2 Section 1b.

πŸŽ‰ TL;DR: Buka Google Colab β†’ GPU T4 β†’ !pip install -q transformers datasets accelerate evaluate seqeval β†’ Copy-paste kode Section 8 β†’ Run. ~10 menit training, F1 ~92%. Persis sama setupnya dengan Page 2.

πŸŽ‰ TL;DR: Open Google Colab β†’ T4 GPU β†’ !pip install -q transformers datasets accelerate evaluate seqeval β†’ Copy-paste Section 8 code β†’ Run. ~10 min training, F1 ~92%. Exact same setup as Page 2.

πŸ”₯

8. Proyek: Fine-Tune BERT NER pada CoNLL-2003 β€” Complete Pipeline

8. Project: Fine-Tune BERT NER on CoNLL-2003 β€” Complete Pipeline

Gabungkan semua: dataset β†’ alignment β†’ collator β†’ metrics β†’ Trainer β†’ evaluate
Combine everything: dataset β†’ alignment β†’ collator β†’ metrics β†’ Trainer β†’ evaluate
29_ner_finetune.py β€” Complete NER Fine-Tuning πŸ”₯πŸ”₯πŸ”₯python
import numpy as np
import evaluate
from datasets import load_dataset
from transformers import (
    AutoTokenizer, AutoModelForTokenClassification,
    TrainingArguments, Trainer, DataCollatorForTokenClassification
)

# ═══════════════════════════════════════
# STEP 1: LOAD DATASET
# ═══════════════════════════════════════
dataset = load_dataset("conll2003")
label_list = dataset["train"].features["ner_tags"].feature.names
num_labels = len(label_list)  # 9
print(f"Labels ({num_labels}): {label_list}")

# ═══════════════════════════════════════
# STEP 2: LOAD TOKENIZER & MODEL
# ═══════════════════════════════════════
model_name = "bert-base-cased"  # CASED for NER!
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    id2label={i: l for i, l in enumerate(label_list)},
    label2id={l: i for i, l in enumerate(label_list)},
)

# ═══════════════════════════════════════
# STEP 3: TOKENIZE + ALIGN (from Section 6!)
# ═══════════════════════════════════════
def tokenize_and_align(examples):
    tokenized = tokenizer(examples["tokens"], truncation=True,
                          is_split_into_words=True, max_length=128)
    all_labels = []
    for i, labels in enumerate(examples["ner_tags"]):
        word_ids = tokenized.word_ids(batch_index=i)
        label_ids, prev = [], None
        for wid in word_ids:
            if wid is None:       label_ids.append(-100)
            elif wid != prev:     label_ids.append(labels[wid])
            else:                 label_ids.append(-100)
            prev = wid
        all_labels.append(label_ids)
    tokenized["labels"] = all_labels
    return tokenized

tokenized = dataset.map(tokenize_and_align, batched=True,
                        remove_columns=dataset["train"].column_names)

# ═══════════════════════════════════════
# STEP 4: DATA COLLATOR (special for token classification!)
# ═══════════════════════════════════════
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
# Pads BOTH input_ids AND labels to same length!
# Padding labels get value -100 (ignored in loss)

# ═══════════════════════════════════════
# STEP 5: METRICS β€” seqeval (NER-specific!)
# ═══════════════════════════════════════
seqeval = evaluate.load("seqeval")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    # Convert IDs back to label strings (skip -100!)
    true_labels, true_preds = [], []
    for pred_seq, label_seq in zip(predictions, labels):
        t_labels, t_preds = [], []
        for p, l in zip(pred_seq, label_seq):
            if l != -100:  # skip ignored positions!
                t_labels.append(label_list[l])
                t_preds.append(label_list[p])
        true_labels.append(t_labels)
        true_preds.append(t_preds)

    results = seqeval.compute(predictions=true_preds, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

# ═══════════════════════════════════════
# STEP 6: TRAIN!
# ═══════════════════════════════════════
args = TrainingArguments(
    output_dir="./ner-bert",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    fp16=True,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    save_total_limit=2,
    report_to="none",
)

trainer = Trainer(
    model=model, args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

# ═══════════════════════════════════════
# STEP 7: EVALUATE
# ═══════════════════════════════════════
results = trainer.evaluate(tokenized["test"])
print(f"\n🏷️ NER Results:")
print(f"  F1:        {results['eval_f1']:.1%}")
print(f"  Precision: {results['eval_precision']:.1%}")
print(f"  Recall:    {results['eval_recall']:.1%}")
# 🏷️ NER Results:
#   F1:        91.8%
#   Precision: 92.1%
#   Recall:    91.5%

# Save
trainer.save_model("./ner-bert-final")
tokenizer.save_pretrained("./ner-bert-final")
print("πŸ† NER model saved!")

🏷️ 91.8% F1 pada CoNLL-2003!
State-of-the-art pada CoNLL-2003 adalah ~94% F1 (DeBERTa-xlarge). BERT-base-cased mendapat ~92%, yang sudah sangat bagus. Bandingkan dengan model tradisional: CRF = ~80%, BiLSTM-CRF = ~88%, BERT = ~92%.

🏷️ 91.8% F1 on CoNLL-2003!
State-of-the-art on CoNLL-2003 is ~94% F1 (DeBERTa-xlarge). BERT-base-cased achieves ~92%, which is already very good. Compare with traditional models: CRF = ~80%, BiLSTM-CRF = ~88%, BERT = ~92%.

πŸ“Š

9. Evaluasi NER dengan seqeval β€” Per-Entity F1

9. NER Evaluation with seqeval β€” Per-Entity F1

Tidak cukup overall F1 β€” Anda perlu tahu F1 per tipe entitas (PER, LOC, ORG, MISC)
Overall F1 isn't enough β€” you need per-entity-type F1 (PER, LOC, ORG, MISC)
30_seqeval_detail.py β€” Detailed NER Evaluationpython
import evaluate

seqeval = evaluate.load("seqeval")

# Example predictions and references
true_labels = [["B-PER", "I-PER", "O", "B-LOC"]]
pred_labels = [["B-PER", "I-PER", "O", "B-ORG"]]  # LOC→ORG mistake!

results = seqeval.compute(predictions=pred_labels, references=true_labels)
print(results)
# {
#   'PER':  {'precision': 1.0,  'recall': 1.0,  'f1': 1.0,  'number': 1},
#   'LOC':  {'precision': 0.0,  'recall': 0.0,  'f1': 0.0,  'number': 1},  ← missed!
#   'ORG':  {'precision': 0.0,  'recall': 0.0,  'f1': 0.0,  'number': 0},  ← false positive
#   'overall_precision': 0.5,
#   'overall_recall': 0.5,
#   'overall_f1': 0.5,
#   'overall_accuracy': 0.75,
# }

# PENTING: seqeval mengevaluasi pada level ENTITAS, bukan token!
# "Joko Widodo" = 1 entitas PER (2 tokens)
# Jika model memprediksi "Joko"=B-PER tapi "Widodo"=O (bukan I-PER)
# β†’ entitas SALAH (partial match = wrong!)
# seqeval menggunakan "exact match" β€” semua tokens harus benar.

# Typical results after fine-tuning BERT on CoNLL-2003:
# PER:  F1 = 96.2% (names are relatively easy)
# LOC:  F1 = 93.1% (locations are clear)
# ORG:  F1 = 89.4% (organizations can be ambiguous)
# MISC: F1 = 82.6% (misc is hardest β€” very diverse)
πŸ”§

10. Post-Processing β€” Subword Predictions β†’ Word-Level Entities

10. Post-Processing β€” Subword Predictions β†’ Word-Level Entities

Setelah model predict, gabungkan subword predictions menjadi entitas utuh
After model predicts, merge subword predictions into complete entities
31_postprocess.py β€” NER Post-Processing & Pipelinepython
from transformers import pipeline

# ===========================
# 1. Use fine-tuned model as pipeline (EASIEST!)
# ===========================
ner_pipe = pipeline("ner",
    model="./ner-bert-final",
    aggregation_strategy="simple",  # merge subwords automatically!
    device=0,
)
# aggregation_strategy options:
# "none"   β†’ raw subword predictions (no merging)
# "simple" β†’ merge subwords, take first subword's label ← RECOMMENDED
# "first"  β†’ same as simple
# "average" β†’ average scores across subwords
# "max"    β†’ take max score across subwords

text = "Barack Obama visited the United Nations headquarters in New York City."
entities = ner_pipe(text)
for e in entities:
    print(f"  {e['word']:25s} {e['entity_group']:5s} ({e['score']:.1%})")
# Barack Obama              PER   (99.7%)
# United Nations            ORG   (99.2%)
# New York City             LOC   (99.5%)

# ===========================
# 2. Manual post-processing (for custom needs)
# ===========================
import torch

def extract_entities(text, model, tokenizer):
    """Extract entities manually with full control."""
    inputs = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
    offsets = inputs.pop("offset_mapping")[0]

    with torch.no_grad():
        outputs = model(**inputs.to(model.device))
    predictions = outputs.logits.argmax(dim=-1)[0].cpu().tolist()

    entities = []
    current_entity = None

    for idx, (pred, offset) in enumerate(zip(predictions, offsets)):
        label = model.config.id2label[pred]
        start, end = offset.tolist()

        if start == 0 and end == 0:  # special token
            continue

        if label.startswith("B-"):
            if current_entity:
                entities.append(current_entity)
            current_entity = {
                "entity": label[2:],
                "word": text[start:end],
                "start": start, "end": end
            }
        elif label.startswith("I-") and current_entity:
            current_entity["word"] = text[current_entity["start"]:end]
            current_entity["end"] = end
        else:
            if current_entity:
                entities.append(current_entity)
                current_entity = None

    if current_entity:
        entities.append(current_entity)
    return entities

result = extract_entities(text, model, tokenizer)
for e in result:
    print(f"  {e['word']:25s} β†’ {e['entity']}")
πŸ“

11. NER pada Custom Dataset β€” Template CSV Anda Sendiri

11. NER on Custom Dataset β€” Your Own CSV Template

Format data Anda ke CoNLL format β†’ fine-tune β†’ custom NER model
Format your data to CoNLL format β†’ fine-tune β†’ custom NER model
32_custom_ner.py β€” Custom NER Datasetpython
from datasets import Dataset, Features, Sequence, ClassLabel, Value

# ===========================
# Format 1: CoNLL-style (recommended)
# ===========================
# File: data.txt (tab-separated, empty line = sentence boundary)
# Joko    B-PER
# Widodo  I-PER
# visited O
# BRI     B-ORG
# .       O
#                  ← empty line = new sentence
# Jakarta B-LOC
# is      O
# big     O
# .       O

# ===========================
# Format 2: From Python dict (easiest for small datasets)
# ===========================
custom_labels = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]

data = {
    "tokens": [
        ["Joko", "Widodo", "visited", "BRI", "in", "Jakarta"],
        ["Google", "opened", "office", "in", "Surabaya"],
    ],
    "ner_tags": [
        [1, 2, 0, 3, 0, 5],   # B-PER I-PER O B-ORG O B-LOC
        [3, 0, 0, 0, 5],       # B-ORG O O O B-LOC
    ],
}

features = Features({
    "tokens": Sequence(Value("string")),
    "ner_tags": Sequence(ClassLabel(names=custom_labels)),
})

custom_dataset = Dataset.from_dict(data, features=features)
split = custom_dataset.train_test_split(test_size=0.2, seed=42)

# Now use EXACTLY the same pipeline as Section 8!
# Just replace: dataset = split  (instead of load_dataset("conll2003"))
# And update: label_list = custom_labels
# Everything else is IDENTICAL. πŸŽ‰

# Minimum data recommendation:
# β€’ 200-500 annotated sentences for basic NER
# β€’ 1000-5000 for production-quality
# β€’ 10000+ for state-of-the-art
πŸ“

12. POS Tagging β€” Token Classification Lainnya

12. POS Tagging β€” Another Token Classification Task

Part-of-Speech tagging: noun, verb, adjective β€” syntax analysis per kata
Part-of-Speech tagging: noun, verb, adjective β€” syntax analysis per word

POS Tagging = menandai setiap kata dengan kelas tata bahasa (noun, verb, adjective, dll). Ini juga token classification, persis sama prosesnya dengan NER β€” hanya label yang berbeda. CoNLL-2003 sudah menyediakan pos_tags dan chunk_tags di dataset yang sama.

POS Tagging = labeling each word with its grammatical class (noun, verb, adjective, etc.). This is also token classification, the process is exactly the same as NER β€” only the labels differ. CoNLL-2003 already provides pos_tags and chunk_tags in the same dataset.

33_pos_tagging.py β€” POS Tagging (Same Process!)python
# POS tags in CoNLL-2003:
# "EU rejects German call to boycott British lamb ."
# NNP VBZ    JJ     NN   TO VB      JJ      NN   .
# (Proper Noun, Verb, Adjective, Noun, To, Verb, Adj, Noun, Punct)

# Fine-tuning is IDENTICAL to NER!
# Just change:
# 1. label_list β†’ POS tag list (NN, VB, JJ, DT, ...)
# 2. examples["ner_tags"] β†’ examples["pos_tags"]
# 3. num_labels β†’ len(pos_tag_list)
# Everything else (alignment, collator, trainer) is the SAME!

# Other token classification tasks:
# β€’ Chunking (NP, VP, PP phrases)
# β€’ Aspect-Based Sentiment (target + sentiment per word)
# β€’ Keyphrase extraction
# β€’ Slot filling (for chatbot intents)
πŸ“

13. Ringkasan Page 4

13. Page 4 Summary

KonsepApa ItuKode Kunci
Token Classification1 label per token (bukan per kalimat)AutoModelForTokenClassification
NERIdentifikasi PER/LOC/ORG/MISCpipeline("ner", grouped_entities=True)
BIO SchemeB=mulai, I=lanjutan, O=bukan["O","B-PER","I-PER","B-LOC",...]
AlignmentSubword↔word label matchingword_ids() + label=-100
is_split_into_wordsInput sudah di-split ke katatokenizer(..., is_split_into_words=True)
DataCollatorPad input + labels bersamaanDataCollatorForTokenClassification
seqevalEntity-level evaluationevaluate.load("seqeval")
Post-processingMerge subword β†’ word entitiesaggregation_strategy="simple"
BERT CasedKapitalisasi penting untuk NER"bert-base-cased"
ConceptWhat It IsKey Code
Token Classification1 label per token (not per sentence)AutoModelForTokenClassification
NERIdentify PER/LOC/ORG/MISCpipeline("ner", grouped_entities=True)
BIO SchemeB=beginning, I=inside, O=outside["O","B-PER","I-PER","B-LOC",...]
AlignmentSubword↔word label matchingword_ids() + label=-100
is_split_into_wordsInput already split into wordstokenizer(..., is_split_into_words=True)
DataCollatorPad input + labels togetherDataCollatorForTokenClassification
seqevalEntity-level evaluationevaluate.load("seqeval")
Post-processingMerge subword β†’ word entitiesaggregation_strategy="simple"
BERT CasedCapitalization matters for NER"bert-base-cased"
← Page Sebelumnya← Previous Page

Page 3 β€” Fine-Tuning GPT & Text Generation

πŸ“˜

Coming Next: Page 5 β€” Question Answering & Seq2Seq (T5)

Dua tugas NLP paling powerful: extractive QA (menemukan jawaban di konteks β€” SQuAD) dan Seq2Seq (text-to-text: translation, summarization, table-to-text). Fine-tune T5 dan BART untuk translation dan summarization, memahami encoder-decoder architecture, dan building production QA system.

πŸ“˜

Coming Next: Page 5 β€” Question Answering & Seq2Seq (T5)

Two of the most powerful NLP tasks: extractive QA (finding answers in context β€” SQuAD) and Seq2Seq (text-to-text: translation, summarization, table-to-text). Fine-tune T5 and BART for translation and summarization, understand encoder-decoder architecture, and building production QA systems.