π Daftar Isi β Page 1
π Table of Contents β Page 1
- Apa Itu Hugging Face? β Ekosistem yang merevolusi AI
- Instalasi β transformers, datasets, tokenizers, accelerate
- Cara Pakai HF β Colab, lokal, Inference API, Spaces, self-hosting
- Pipeline API β Inference 1 baris untuk 20+ tugas
- Pipeline: NLP Tasks β Sentiment, NER, QA, Translation, Summarization
- Pipeline: Beyond NLP β Image Classification, Object Detection, Zero-Shot
- Model Hub β 500k+ models, cara memilih yang tepat
- Auto Classes β AutoModel, AutoTokenizer, AutoConfig
- Tokenisasi Mendalam β WordPiece, BPE, SentencePiece, encoding
- Dari Tokenizer ke Model β Full forward pass manual
- First Look: Fine-Tuning BERT β Text classification preview
- Ringkasan & Preview Page 2
- What Is Hugging Face? β The ecosystem that revolutionized AI
- Installation β transformers, datasets, tokenizers, accelerate
- How to Use HF β Colab, local, Inference API, Spaces, self-hosting
- Pipeline API β 1-line inference for 20+ tasks
- Pipeline: NLP Tasks β Sentiment, NER, QA, Translation, Summarization
- Pipeline: Beyond NLP β Image Classification, Object Detection, Zero-Shot
- Model Hub β 500k+ models, choosing the right one
- Auto Classes β AutoModel, AutoTokenizer, AutoConfig
- Deep Dive: Tokenization β WordPiece, BPE, SentencePiece, encoding
- From Tokenizer to Model β Full manual forward pass
- First Look: Fine-Tuning BERT β Text classification preview
- Summary & Page 2 Preview
1. Apa Itu Hugging Face? β Revolusi AI Open-Source
1. What Is Hugging Face? β The Open-Source AI Revolution
Hugging Face (π€) adalah perusahaan dan platform open-source yang menyediakan ekosistem lengkap untuk machine learning modern. Bayangkan GitHub, tapi khusus untuk model AI: Anda bisa menemukan, menggunakan, dan berbagi model dari BERT sampai LLaMA, dari Stable Diffusion sampai Whisper β semuanya gratis. Lebih dari 500,000 model dan 100,000 dataset tersedia di Hub mereka.
Hugging Face (π€) is a company and open-source platform providing a complete ecosystem for modern machine learning. Imagine GitHub, but specifically for AI models: you can find, use, and share models from BERT to LLaMA, from Stable Diffusion to Whisper β all for free. Over 500,000 models and 100,000 datasets are available on their Hub.
Kenapa Hugging Face begitu penting? Karena ia mendemokratisasi AI. Sebelum HF, menggunakan BERT membutuhkan ratusan baris kode boilerplate dan pengetahuan mendalam tentang arsitektur model. Sekarang: pipeline("sentiment-analysis")("I love this!") β selesai, satu baris.
Why is Hugging Face so important? Because it democratizes AI. Before HF, using BERT required hundreds of lines of boilerplate code and deep knowledge of model architecture. Now: pipeline("sentiment-analysis")("I love this!") β done, one line.
π‘ Analogi: Hugging Face = App Store untuk AI
Model Hub = App Store β download model siap pakai dalam 1 baris kode
Datasets Hub = Data marketplace β dataset berkualitas untuk training
Spaces = Demo gallery β coba model langsung di browser
transformers library = SDK β unified API untuk 200+ arsitektur model
Anda tidak perlu implementasi BERT dari nol β cukup from transformers import dan mulai bekerja.
π‘ Analogy: Hugging Face = App Store for AI
Model Hub = App Store β download ready-to-use models in 1 line of code
Datasets Hub = Data marketplace β quality datasets for training
Spaces = Demo gallery β try models directly in browser
transformers library = SDK β unified API for 200+ model architectures
You don't need to implement BERT from scratch β just from transformers import and start working.
2. Instalasi β 4 Library Inti Hugging Face
2. Installation β 4 Core Hugging Face Libraries
# =========================== # Core libraries # =========================== pip install transformers # models, pipelines, Auto classes pip install datasets # dataset loading & processing pip install tokenizers # fast Rust-based tokenizers pip install accelerate # multi-GPU, mixed precision # Or install everything at once: pip install transformers[torch] datasets accelerate # =========================== # Backend: PyTorch or TensorFlow # =========================== pip install torch # PyTorch (RECOMMENDED β community default) # pip install tensorflow # TensorFlow (also supported) # HF Transformers supports BOTH backends! # This series uses PyTorch (90% of HF community uses PyTorch) # =========================== # Optional but useful # =========================== pip install evaluate # evaluation metrics pip install peft # LoRA, QLoRA (efficient fine-tuning) pip install trl # RLHF training (ChatGPT-style) pip install bitsandbytes # 4-bit/8-bit quantization pip install sentencepiece # for T5, LLaMA tokenizers # =========================== # Verify installation # =========================== python -c "import transformers; print(f'transformers {transformers.__version__}')" python -c "import datasets; print(f'datasets {datasets.__version__}')" python -c "import torch; print(f'PyTorch {torch.__version__}, CUDA: {torch.cuda.is_available()}')" # transformers 4.47.x # datasets 3.2.x # PyTorch 2.5.x, CUDA: True
| Library | Fungsi | Size | Wajib? |
|---|---|---|---|
| transformers | Model, tokenizer, pipeline, training | ~30 MB | β Ya |
| datasets | Load & proses dataset | ~5 MB | β Ya (training) |
| tokenizers | Fast Rust tokenizer (auto-installed) | ~5 MB | Auto |
| accelerate | Multi-GPU, mixed precision | ~3 MB | β Ya (training) |
| evaluate | Metrics (accuracy, F1, BLEU) | ~2 MB | Recommended |
| peft | LoRA, QLoRA efficient fine-tuning | ~3 MB | Optional |
| torch | PyTorch backend | ~2 GB | β Ya (1 backend) |
| Library | Purpose | Size | Required? |
|---|---|---|---|
| transformers | Models, tokenizer, pipeline, training | ~30 MB | β Yes |
| datasets | Load & process datasets | ~5 MB | β Yes (training) |
| tokenizers | Fast Rust tokenizer (auto-installed) | ~5 MB | Auto |
| accelerate | Multi-GPU, mixed precision | ~3 MB | β Yes (training) |
| evaluate | Metrics (accuracy, F1, BLEU) | ~2 MB | Recommended |
| peft | LoRA, QLoRA efficient fine-tuning | ~3 MB | Optional |
| torch | PyTorch backend | ~2 GB | β Yes (1 backend) |
π‘ Google Colab: Semua library HF sudah pre-installed di Colab! Cukup !pip install -q transformers datasets accelerate untuk update ke versi terbaru. GPU T4 gratis sudah cukup untuk fine-tuning BERT dan model medium lainnya.
π‘ Google Colab: All HF libraries come pre-installed on Colab! Just !pip install -q transformers datasets accelerate to update to the latest version. The free T4 GPU is sufficient for fine-tuning BERT and other medium models.
2b. Bagaimana Cara Pakai Hugging Face? β 6 Cara dari Gratisan sampai Production
2b. How Do You Actually Use Hugging Face? β 6 Ways from Free to Production
Banyak yang bingung saat pertama kali mengenal Hugging Face: "Ini dijalankan di mana? Di website HF? Di komputer saya? Di cloud?" Jawabannya: semua bisa! Hugging Face bukan satu platform tunggal β ia adalah ekosistem yang bisa dipakai dengan berbagai cara. Berikut 6 cara menggunakan HF, dari yang paling mudah sampai production-grade:
Many people get confused when first encountering Hugging Face: "Where does this run? On HF's website? On my computer? On the cloud?" The answer: all of the above! Hugging Face isn't a single platform β it's an ecosystem that can be used in various ways. Here are 6 ways to use HF, from easiest to production-grade:
β Google Colab β REKOMENDASI #1 untuk Belajar
β Google Colab β #1 RECOMMENDATION for Learning
Google Colab adalah cara termudah dan tercepat untuk mulai menggunakan Hugging Face. Anda tidak perlu install apapun di komputer β cukup buka browser, tulis kode Python, dan jalankan di GPU gratis. Seluruh seri ini bisa diikuti 100% di Colab.
Google Colab is the easiest and fastest way to start using Hugging Face. You don't need to install anything on your computer β just open a browser, write Python code, and run it on a free GPU. This entire series can be followed 100% on Colab.
# =========================== # 1. Buka colab.research.google.com # 2. Runtime β Change runtime type β GPU (T4) # 3. Jalankan cell berikut: # =========================== # Install/update HF libraries (sudah pre-installed, tapi update) !pip install -q transformers datasets accelerate evaluate # Verify GPU import torch print(f"GPU: {torch.cuda.get_device_name(0)}") # GPU: Tesla T4 # Test pipeline from transformers import pipeline classifier = pipeline("sentiment-analysis", device=0) # GPU! print(classifier("Hugging Face is amazing!")) # [{'label': 'POSITIVE', 'score': 0.9998}] # β Selesai! Siap fine-tuning BERT dengan GPU gratis! # Colab T4 = 16GB VRAM β cukup untuk BERT, DistilBERT, RoBERTa # Tidak cukup untuk: LLaMA 7B+, Stable Diffusion (butuh A100)
β‘ Komputer Lokal β Untuk Development Sehari-hari
β‘ Local Computer β For Daily Development
# =========================== # Setup di laptop/desktop Anda # =========================== # 1. Buat virtual environment (recommended!) python -m venv hf-env source hf-env/bin/activate # Linux/Mac # hf-env\Scripts\activate # Windows # 2. Install PyTorch (pilih sesuai GPU Anda) # CPU only: pip install torch torchvision torchaudio # NVIDIA GPU (CUDA 12.x): pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # 3. Install Hugging Face stack pip install transformers datasets accelerate evaluate # 4. Test python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('Hello!'))" # =========================== # Di mana model disimpan? # =========================== # Model di-download ke cache folder: # Linux/Mac: ~/.cache/huggingface/hub/ # Windows: C:\Users\\.cache\huggingface\hub\ # # BERT base: ~420 MB # DistilBERT: ~250 MB # GPT-2 small: ~550 MB # LLaMA 3.2 1B: ~2.5 GB # LLaMA 3.2 8B: ~16 GB # # Pertama kali download β lambat # Kedua kali β instant (cached!) # # Hapus cache: rm -rf ~/.cache/huggingface/hub/
β’ HF Inference API β Pakai Model Tanpa Download
β’ HF Inference API β Use Models Without Downloading
Tidak mau download model besar ke komputer? Gunakan Inference API β kirim request HTTP ke server HF, mereka yang jalankan model. Gratis untuk prototyping (rate-limited).
Don't want to download large models to your computer? Use the Inference API β send HTTP requests to HF servers, they run the model. Free for prototyping (rate-limited).
import requests # =========================== # Method 1: Direct HTTP request (no library needed!) # =========================== API_URL = "https://api-inference.huggingface.co/models/distilbert-base-uncased-finetuned-sst-2-english" headers = {"Authorization": "Bearer hf_YOUR_TOKEN_HERE"} # Get free token: huggingface.co/settings/tokens response = requests.post(API_URL, headers=headers, json={"inputs": "I love this product!"}) print(response.json()) # [[{'label': 'POSITIVE', 'score': 0.9998}]] # =========================== # Method 2: huggingface_hub library (easier) # =========================== from huggingface_hub import InferenceClient client = InferenceClient(token="hf_YOUR_TOKEN") # Text classification result = client.text_classification("I love this!") print(result) # [TextClassificationOutput(label='POSITIVE', score=0.9998)] # Text generation result = client.text_generation( "The meaning of life is", model="gpt2", max_new_tokens=50 ) print(result) # Translation result = client.translation("I am learning AI", model="Helsinki-NLP/opus-mt-en-id") print(result) # "Saya sedang belajar AI" # =========================== # Kapan pakai Inference API? # =========================== # β Prototyping cepat (tidak perlu GPU lokal) # β Demo kecil (< 1000 requests/hari) # β Test model baru sebelum download # β Training / fine-tuning (hanya inference!) # β Production (rate limited, cold starts) # β Data sensitif (data dikirim ke server HF)
β£ HF Spaces β Buat Demo App Gratis
β£ HF Spaces β Build Free Demo Apps
Spaces = hosting gratis untuk demo ML app. Anda bisa membuat app dengan Gradio atau Streamlit, push ke HF, dan mendapat URL publik. Sempurna untuk portfolio dan sharing.
Spaces = free hosting for ML demo apps. You can build apps with Gradio or Streamlit, push to HF, and get a public URL. Perfect for portfolios and sharing.
import gradio as gr from transformers import pipeline # Load model classifier = pipeline("sentiment-analysis") # Define interface def analyze(text): result = classifier(text)[0] return f"{result['label']} ({result['score']:.1%})" # Create Gradio app demo = gr.Interface( fn=analyze, inputs=gr.Textbox(placeholder="Type your text here..."), outputs="text", title="π€ Sentiment Analyzer", description="Analyze sentiment of any English text", ) demo.launch() # =========================== # Deploy ke HF Spaces: # 1. Buat repo di huggingface.co/new-space # 2. Pilih "Gradio" sebagai SDK # 3. Upload app.py + requirements.txt # 4. Otomatis deploy β dapat URL: username.hf.space/sentiment # 5. GRATIS untuk CPU! GPU mulai $0.60/jam # ===========================
β€ & β₯ Production Deployment β Inference Endpoints & Self-Hosting
β€ & β₯ Production Deployment β Inference Endpoints & Self-Hosting
# =========================== # Option 5: HF Inference Endpoints (managed hosting) # β huggingface.co/inference-endpoints # =========================== # 1. Pilih model dari Hub # 2. Pilih hardware (CPU/GPU/A100) # 3. Pilih region (US, EU, Asia) # 4. Deploy β dapat production API URL # 5. Auto-scaling, monitoring, HTTPS included # # Pricing: # CPU (2 vCPU): ~$0.06/jam (~$43/bulan) # GPU T4 (16GB): ~$0.60/jam (~$432/bulan) # GPU A10G (24GB): ~$1.30/jam (~$936/bulan) # GPU A100 (80GB): ~$4.50/jam (~$3,240/bulan) # # Best for: production API tanpa manage infrastructure # =========================== # Option 6: Self-Hosting (Docker on your server) # =========================== # A. Simple: FastAPI + model from fastapi import FastAPI from transformers import pipeline app = FastAPI() classifier = pipeline("sentiment-analysis", device=0) @app.post("/predict") async def predict(text: str): result = classifier(text) return result # uvicorn app:app --host 0.0.0.0 --port 8000 # Deploy dengan Docker β AWS EC2, GCP VM, DigitalOcean, dll. # B. Optimized: Text Generation Inference (TGI) # Docker container dari HF untuk LLM serving # docker run --gpus all -p 8080:80 \ # ghcr.io/huggingface/text-generation-inference \ # --model-id meta-llama/Llama-3.2-1B # # Optimized: continuous batching, flash attention, quantization # Best for: high-throughput LLM serving # C. vLLM (alternative to TGI) # pip install vllm # python -m vllm.entrypoints.openai.api_server \ # --model meta-llama/Llama-3.2-1B --port 8000 # β OpenAI-compatible API for any HF model!
π Rekomendasi Berdasarkan Situasi:
Belajar / ikut seri ini: β β Google Colab (gratis, GPU T4, zero setup) β
Development sehari-hari: β β‘ Lokal + Colab untuk training berat
Demo / portfolio: β β£ HF Spaces (Gradio app, URL publik gratis)
Prototyping cepat: β β’ Inference API (HTTP request, tanpa download)
Production API (startup): β β€ Inference Endpoints (managed, auto-scale)
Production API (enterprise): β β₯ Self-hosting Docker/Kubernetes (full control)
Penting: Hugging Face Hub = "tempat model disimpan" (seperti GitHub). Model di-download dari Hub ke tempat Anda menjalankannya (Colab, laptop, server). HF Hub BUKAN tempat menjalankan kode β kode berjalan di device Anda!
π Recommendation Based on Situation:
Learning / following this series: β β Google Colab (free, T4 GPU, zero setup) β
Daily development: β β‘ Local + Colab for heavy training
Demo / portfolio: β β£ HF Spaces (Gradio app, free public URL)
Quick prototyping: β β’ Inference API (HTTP request, no download)
Production API (startup): β β€ Inference Endpoints (managed, auto-scale)
Production API (enterprise): β β₯ Self-hosting Docker/Kubernetes (full control)
Important: Hugging Face Hub = "where models are stored" (like GitHub). Models are downloaded FROM the Hub TO wherever you run them (Colab, laptop, server). The Hub is NOT where code runs β code runs on YOUR device!
| Cara | Biaya | GPU | Setup | Best For |
|---|---|---|---|---|
| β Colab | Gratis | T4 (16GB) gratis | 0 menit | Belajar, fine-tuning BERT β |
| β‘ Lokal | Listrik | GPU Anda (jika ada) | 10 menit | Development harian |
| β’ Inference API | Gratis (rate-limit) | HF servers | 0 menit | Prototyping, demo kecil |
| β£ Spaces | Gratis (CPU) | Opsional ($0.60/jam) | 5 menit | Demo apps, portfolio |
| β€ Endpoints | $0.06-4.50/jam | T4/A10/A100 | 5 menit | Production API |
| β₯ Self-host | $5-1000+/bln | Your choice | 30-60 menit | Enterprise, privacy |
| Method | Cost | GPU | Setup | Best For |
|---|---|---|---|---|
| β Colab | Free | T4 (16GB) free | 0 min | Learning, BERT fine-tuning β |
| β‘ Local | Electricity | Your GPU (if any) | 10 min | Daily development |
| β’ Inference API | Free (rate-limited) | HF servers | 0 min | Prototyping, small demos |
| β£ Spaces | Free (CPU) | Optional ($0.60/hr) | 5 min | Demo apps, portfolio |
| β€ Endpoints | $0.06-4.50/hr | T4/A10/A100 | 5 min | Production API |
| β₯ Self-host | $5-1000+/mo | Your choice | 30-60 min | Enterprise, privacy |
π TL;DR untuk Pemula:
1. Buka colab.research.google.com
2. Aktifkan GPU: Runtime β Change runtime type β T4 GPU
3. Ketik: !pip install -q transformers datasets accelerate
4. Ketik: from transformers import pipeline
5. Selesai! Anda sudah bisa menjalankan BERT, GPT-2, Whisper, dll. di cloud gratis.
Model di-download dari Hub ke Colab server β berjalan di GPU T4 Colab β Anda dapat hasil di notebook. Tidak perlu install apapun di laptop Anda.
π TL;DR for Beginners:
1. Open colab.research.google.com
2. Enable GPU: Runtime β Change runtime type β T4 GPU
3. Type: !pip install -q transformers datasets accelerate
4. Type: from transformers import pipeline
5. Done! You can now run BERT, GPT-2, Whisper, etc. on a free cloud GPU.
Models are downloaded from the Hub to Colab server β run on Colab's T4 GPU β you get results in the notebook. No need to install anything on your laptop.
3. Pipeline API β Inference 1 Baris untuk 20+ Tugas
3. Pipeline API β 1-Line Inference for 20+ Tasks
Pipeline adalah API tertinggi (highest-level) di Hugging Face. Satu function call melakukan segalanya: download model dari Hub, tokenize input, jalankan inference, dan format output. Anda bahkan tidak perlu tahu arsitektur model yang digunakan.
Pipeline is the highest-level API in Hugging Face. One function call does everything: download model from Hub, tokenize input, run inference, and format output. You don't even need to know the model architecture being used.
from transformers import pipeline # =========================== # 1. Sentiment Analysis β one line! # =========================== classifier = pipeline("sentiment-analysis") # First run: downloads model (~270MB) β cached for future use result = classifier("I absolutely love this product! Best purchase ever.") print(result) # [{'label': 'POSITIVE', 'score': 0.9998}] # Multiple texts at once (batched!) results = classifier([ "This movie was fantastic!", "Terrible experience, waste of money.", "It was okay, nothing special." ]) for r in results: print(f" {r['label']:8s} ({r['score']:.1%})") # POSITIVE (99.9%) # NEGATIVE (99.8%) # POSITIVE (63.1%) β uncertain β neutral-ish # =========================== # 2. Specify a different model # =========================== classifier_multi = pipeline( "sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment" ) # Now supports 6 languages! (EN, DE, NL, ES, FR, IT) result = classifier_multi("Film ini sangat bagus!") # Indonesian! print(result) # [{'label': '5 stars', 'score': 0.73}] # =========================== # 3. GPU acceleration # =========================== classifier_gpu = pipeline("sentiment-analysis", device=0) # GPU:0 # device=0 β first GPU # device=-1 β CPU (default) # device="mps" β Apple Silicon # =========================== # 4. How pipeline() works internally # =========================== # pipeline("sentiment-analysis") is equivalent to: # 1. tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english") # 2. model = AutoModelForSequenceClassification.from_pretrained("...") # 3. inputs = tokenizer(text, return_tensors="pt") # 4. outputs = model(**inputs) # 5. predictions = softmax(outputs.logits) # 6. label = model.config.id2label[predicted_class] # Pipeline wraps ALL of this in one call!
π Pipeline: Apa yang Terjadi di Balik Layar?
Satu panggilan pipeline("sentiment-analysis")("text") melakukan 6 langkah:
1. Download model dari Hugging Face Hub (pertama kali saja, lalu di-cache)
2. Tokenize input β text β subword tokens β integer IDs + attention mask
3. Forward pass β jalankan model Transformer (BERT/DistilBERT/etc.)
4. Post-process β logits β softmax β probabilities
5. Map to labels β index β "POSITIVE"/"NEGATIVE"
6. Format output β return list of dicts dengan label dan score
Anda akan belajar SEMUA langkah ini secara manual di section 7-9!
π Pipeline: What Happens Behind the Scenes?
One call to pipeline("sentiment-analysis")("text") performs 6 steps:
1. Download model from Hugging Face Hub (first time only, then cached)
2. Tokenize input β text β subword tokens β integer IDs + attention mask
3. Forward pass β run Transformer model (BERT/DistilBERT/etc.)
4. Post-process β logits β softmax β probabilities
5. Map to labels β index β "POSITIVE"/"NEGATIVE"
6. Format output β return list of dicts with label and score
You'll learn ALL of these steps manually in sections 7-9!
4. Pipeline NLP Tasks β Sentiment, NER, QA, Translation, Summarization
4. Pipeline NLP Tasks β Sentiment, NER, QA, Translation, Summarization
from transformers import pipeline # =========================== # 1. Named Entity Recognition (NER) # Identifikasi entitas: orang, tempat, organisasi # =========================== ner = pipeline("ner", grouped_entities=True) result = ner("Joko Widodo visited Google headquarters in Mountain View, California.") for entity in result: print(f" {entity['word']:20s} β {entity['entity_group']:5s} ({entity['score']:.1%})") # Joko Widodo β PER (99.8%) # Google β ORG (99.6%) # Mountain View β LOC (99.9%) # California β LOC (99.9%) # =========================== # 2. Question Answering (extractive) # Jawab pertanyaan berdasarkan konteks # =========================== qa = pipeline("question-answering") result = qa( question="What is the capital of France?", context="France is a country in Europe. Its capital is Paris, a city known for the Eiffel Tower." ) print(f"Answer: {result['answer']} (score: {result['score']:.1%})") # Answer: Paris (score: 98.7%) # =========================== # 3. Text Summarization # =========================== summarizer = pipeline("summarization", model="facebook/bart-large-cnn") article = """ Hugging Face has raised $235 million in a Series D funding round, bringing the company's valuation to $4.5 billion. The round was led by Salesforce Ventures, with participation from Google, Amazon, NVIDIA, Intel, AMD, and Qualcomm. The company plans to use the funding to expand its open-source AI platform and hire more researchers. """ summary = summarizer(article, max_length=50, min_length=20) print(summary[0]['summary_text']) # "Hugging Face raised $235M at $4.5B valuation, led by Salesforce..." # =========================== # 4. Translation # =========================== translator = pipeline("translation_en_to_fr") result = translator("Hugging Face is the best AI platform.") print(result[0]['translation_text']) # "Hugging Face est la meilleure plateforme d'IA." # Multi-language: Helsinki-NLP models id_to_en = pipeline("translation", model="Helsinki-NLP/opus-mt-id-en") result = id_to_en("Saya sedang belajar kecerdasan buatan.") print(result[0]['translation_text']) # "I'm learning artificial intelligence." # =========================== # 5. Text Generation (GPT-style) # =========================== generator = pipeline("text-generation", model="gpt2") result = generator( "Artificial intelligence will", max_length=50, num_return_sequences=2, # generate 2 variations temperature=0.7, # creativity (0=deterministic, 1=random) do_sample=True ) for i, r in enumerate(result): print(f" Variation {i+1}: {r['generated_text'][:80]}...") # =========================== # 6. Fill-Mask (BERT-style) # =========================== fill = pipeline("fill-mask") results = fill("The capital of Indonesia is [MASK].") for r in results[:3]: print(f" {r['token_str']:10s} ({r['score']:.1%})") # Jakarta (92.3%) # Bandung (2.1%) # Surabaya (1.4%)
5. Pipeline Beyond NLP β Image, Audio, Zero-Shot
5. Pipeline Beyond NLP β Image, Audio, Zero-Shot
from transformers import pipeline # =========================== # 1. Image Classification # =========================== img_classifier = pipeline("image-classification") result = img_classifier("https://upload.wikimedia.org/wikipedia/commons/4/4d/Cat_November_2010-1a.jpg") for r in result[:3]: print(f" {r['label']:30s} ({r['score']:.1%})") # tabby, tabby cat (43.2%) # Egyptian cat (22.1%) # tiger cat (13.8%) # =========================== # 2. Object Detection # =========================== detector = pipeline("object-detection") results = detector("https://example.com/street_scene.jpg") for r in results: print(f" {r['label']:10s} ({r['score']:.1%}) at {r['box']}") # car (97.2%) at {'xmin': 12, 'ymin': 50, ...} # person (95.1%) at {'xmin': 200, 'ymin': 30, ...} # =========================== # 3. Zero-Shot Classification (NO TRAINING NEEDED!) # Classify text into ANY categories β even ones the model never saw! # =========================== zero_shot = pipeline("zero-shot-classification") result = zero_shot( "Harga saham Tesla naik 15% setelah pengumuman earnings Q4.", candidate_labels=["finance", "sports", "technology", "politics", "health"] ) for label, score in zip(result['labels'], result['scores']): print(f" {label:12s}: {score:.1%}") # finance : 78.3% # technology : 15.2% # politics : 3.8% # sports : 1.5% # health : 1.2% # =========================== # 4. Automatic Speech Recognition # =========================== # asr = pipeline("automatic-speech-recognition", model="openai/whisper-base") # result = asr("audio_file.mp3") # print(result["text"]) # "Hello, how are you today?" # Whisper supports 99 languages including Indonesian! # =========================== # 5. Text-to-Speech # =========================== # tts = pipeline("text-to-speech", model="microsoft/speecht5_tts") # audio = tts("Hello, welcome to the Hugging Face tutorial!") # # Returns audio array that can be saved as .wav
π Zero-Shot Classification β Superpower!
Zero-shot = klasifikasi tanpa training sama sekali. Anda cukup memberikan kategori yang diinginkan sebagai teks, dan model mencocokkan input dengan kategori tersebut menggunakan natural language understanding. Cocok untuk: prototyping cepat, label discovery, klasifikasi dengan kategori yang sering berubah.
π Zero-Shot Classification β Superpower!
Zero-shot = classification without any training. You just provide desired categories as text, and the model matches input to categories using natural language understanding. Great for: rapid prototyping, label discovery, classification with frequently changing categories.
| Pipeline Task | Deskripsi | Default Model | Input β Output |
|---|---|---|---|
| sentiment-analysis | Sentiment positif/negatif | DistilBERT SST-2 | teks β label + score |
| ner | Named Entity Recognition | BERT NER | teks β entitas + tipe |
| question-answering | Jawab dari konteks | DistilBERT SQuAD | question + context β answer |
| summarization | Ringkas teks panjang | BART CNN | teks panjang β ringkasan |
| translation_xx_to_yy | Terjemahan | Helsinki-NLP | teks bahasa A β bahasa B |
| text-generation | Generate teks (GPT-style) | GPT-2 | prompt β teks lanjutan |
| fill-mask | Prediksi kata yang hilang | BERT base | teks + [MASK] β kata |
| zero-shot-classification | Klasifikasi tanpa training | BART MNLI | teks + labels β scores |
| image-classification | Klasifikasi gambar | ViT ImageNet | gambar β label + score |
| object-detection | Deteksi objek | DETR | gambar β boxes + labels |
| automatic-speech-recognition | Speech to text | Whisper | audio β teks |
| Pipeline Task | Description | Default Model | Input β Output |
|---|---|---|---|
| sentiment-analysis | Positive/negative sentiment | DistilBERT SST-2 | text β label + score |
| ner | Named Entity Recognition | BERT NER | text β entities + types |
| question-answering | Answer from context | DistilBERT SQuAD | question + context β answer |
| summarization | Summarize long text | BART CNN | long text β summary |
| translation_xx_to_yy | Translation | Helsinki-NLP | language A text β language B |
| text-generation | Generate text (GPT-style) | GPT-2 | prompt β continuation |
| fill-mask | Predict missing word | BERT base | text + [MASK] β word |
| zero-shot-classification | Classify without training | BART MNLI | text + labels β scores |
| image-classification | Classify images | ViT ImageNet | image β label + score |
| object-detection | Detect objects | DETR | image β boxes + labels |
| automatic-speech-recognition | Speech to text | Whisper | audio β text |
6. Model Hub β 500k+ Models, Cara Memilih yang Tepat
6. Model Hub β 500k+ Models, Choosing the Right One
Dengan 500k+ model di Hub, bagaimana memilih yang tepat? Gunakan filter: task (sentiment, NER, dll), language (Indonesian, English), library (PyTorch, TensorFlow), dataset (model trained on what data), dan license (open vs restricted). Sort by downloads atau likes untuk model terpopuler.
With 500k+ models on the Hub, how to choose the right one? Use filters: task (sentiment, NER, etc.), language (Indonesian, English), library (PyTorch, TensorFlow), dataset (model trained on what data), and license (open vs restricted). Sort by downloads or likes for most popular models.
from huggingface_hub import HfApi, list_models # =========================== # 1. Search models programmatically # =========================== api = HfApi() models = api.list_models( filter="text-classification", sort="downloads", direction=-1, limit=5 ) for m in models: print(f" {m.id:50s} β{m.downloads:>10,}") # distilbert-base-uncased-finetuned-sst-2-english β 85,432,100 # nlptown/bert-base-multilingual-uncased-sentiment β 12,345,000 # cardiffnlp/twitter-roberta-base-sentiment-latest β 8,765,000 # =========================== # 2. Indonesian NLP models # =========================== id_models = api.list_models( filter="text-classification", search="indonesian", sort="downloads", direction=-1, limit=5 ) for m in id_models: print(f" {m.id}") # indobenchmark/indobert-base-p1 # indolem/indobert-base-uncased # cahya/bert-base-indonesian-522M # =========================== # 3. Model naming convention # =========================== # Format: organization/model-name # Examples: # google-bert/bert-base-uncased β Google's BERT # meta-llama/Llama-3.2-1B β Meta's LLaMA # openai-community/gpt2 β OpenAI's GPT-2 # facebook/bart-large-cnn β Meta's BART # sentence-transformers/all-MiniLM-L6-v2 β sentence embeddings # =========================== # 4. Download model manually (for offline use) # =========================== from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained("bert-base-uncased") tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Downloads to ~/.cache/huggingface/ (~420MB for BERT base) # Save locally model.save_pretrained("./my_bert") tokenizer.save_pretrained("./my_bert") # Load from local model = AutoModel.from_pretrained("./my_bert") tokenizer = AutoTokenizer.from_pretrained("./my_bert")
π Tips Memilih Model:
Prototyping: Mulai dengan default pipeline (biasanya DistilBERT β cepat dan bagus).
Production English: roberta-base atau deberta-v3-base (lebih akurat dari BERT).
Production Indonesian: indobert-base atau cahya/bert-base-indonesian.
Multilingual: xlm-roberta-base (100+ bahasa termasuk Indonesia).
Speed priority: DistilBERT (40% lebih cepat, 97% akurasi BERT).
LLM/Chat: meta-llama/Llama-3.2, Qwen/Qwen2.5, mistralai/Mistral.
π Tips for Choosing Models:
Prototyping: Start with default pipeline (usually DistilBERT β fast and good).
Production English: roberta-base or deberta-v3-base (more accurate than BERT).
Production Indonesian: indobert-base or cahya/bert-base-indonesian.
Multilingual: xlm-roberta-base (100+ languages including Indonesian).
Speed priority: DistilBERT (40% faster, 97% of BERT accuracy).
LLM/Chat: meta-llama/Llama-3.2, Qwen/Qwen2.5, mistralai/Mistral.
7. Auto Classes β AutoModel, AutoTokenizer, AutoConfig
7. Auto Classes β AutoModel, AutoTokenizer, AutoConfig
Auto Classes adalah abstraksi brilliant dari Hugging Face: Anda tidak perlu tahu apakah model itu BERT, RoBERTa, GPT-2, atau T5 β cukup gunakan AutoModel dan ia akan otomatis memilih class yang tepat. Ini memungkinkan Anda mengganti model tanpa mengubah kode.
Auto Classes are a brilliant abstraction from Hugging Face: you don't need to know if the model is BERT, RoBERTa, GPT-2, or T5 β just use AutoModel and it automatically selects the right class. This lets you swap models without changing code.
from transformers import ( AutoModel, AutoTokenizer, AutoConfig, AutoModelForSequenceClassification, AutoModelForTokenClassification, AutoModelForQuestionAnswering, AutoModelForCausalLM, AutoModelForSeq2SeqLM, ) # =========================== # 1. AutoTokenizer β universal tokenizer loader # =========================== # Doesn't matter if model uses WordPiece, BPE, or SentencePiece! tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased") # WordPiece tokenizer_gpt = AutoTokenizer.from_pretrained("gpt2") # BPE tokenizer_t5 = AutoTokenizer.from_pretrained("google-t5/t5-small") # SentencePiece tokenizer_llama = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B") # BPE # All have the SAME interface! for name, tok in [("BERT", tokenizer_bert), ("GPT-2", tokenizer_gpt), ("T5", tokenizer_t5)]: encoded = tok("Hello world", return_tensors="pt") print(f" {name:6s}: {encoded['input_ids'][0].tolist()}") # BERT : [101, 7592, 2088, 102] β [CLS] hello world [SEP] # GPT-2 : [15496, 995] β hello world (no special tokens) # T5 : [8774, 296, 1] β Helloβworld # =========================== # 2. AutoModel β base model (no head) # =========================== model = AutoModel.from_pretrained("bert-base-uncased") print(f"Type: {type(model).__name__}") # BertModel print(f"Params: {model.num_parameters():,}") # 109,482,240 # Output: last_hidden_state (batch, seq_len, hidden_size) # β Raw embeddings, NO classification head # =========================== # 3. AutoModelForSequenceClassification β with classifier head # =========================== model_cls = AutoModelForSequenceClassification.from_pretrained( "bert-base-uncased", num_labels=3 # positive, negative, neutral ) print(f"Type: {type(model_cls).__name__}") # BertForSequenceClassification # Output: logits (batch, num_labels) β ready for classification! # =========================== # 4. Task-specific Auto Classes # =========================== # AutoModelForSequenceClassification β sentiment, topic classification # AutoModelForTokenClassification β NER, POS tagging # AutoModelForQuestionAnswering β extractive QA # AutoModelForCausalLM β text generation (GPT-style) # AutoModelForSeq2SeqLM β translation, summarization (T5-style) # AutoModelForMaskedLM β fill-mask (BERT-style) # AutoModelForImageClassification β image classification (ViT) # AutoModelForObjectDetection β object detection (DETR) # =========================== # 5. AutoConfig β model configuration # =========================== config = AutoConfig.from_pretrained("bert-base-uncased") print(f"Hidden size: {config.hidden_size}") # 768 print(f"Num layers: {config.num_hidden_layers}") # 12 print(f"Num heads: {config.num_attention_heads}") # 12 print(f"Vocab size: {config.vocab_size}") # 30522
8. Tokenisasi Mendalam β WordPiece, BPE, SentencePiece
8. Deep Dive: Tokenization β WordPiece, BPE, SentencePiece
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # =========================== # 1. Step by step tokenization # =========================== text = "Hugging Face's tokenizers are incredibly fast!" # Step 1: Tokenize (split into subwords) tokens = tokenizer.tokenize(text) print(f"Tokens: {tokens}") # ['hugging', 'face', "'", 's', 'token', '##ize', '##rs', 'are', 'incredibly', 'fast', '!'] # Note: "tokenizers" β ["token", "##ize", "##rs"] (WordPiece subwords!) # "##" prefix means "continuation of previous word" # Step 2: Convert to IDs ids = tokenizer.convert_tokens_to_ids(tokens) print(f"IDs: {ids}") # [17662, 2227, 1005, 1055, 19204, 4697, 2869, 2024, 12978, 3435, 999] # Step 3: Add special tokens + create attention mask encoded = tokenizer(text, return_tensors="pt") print(f"input_ids: {encoded['input_ids'][0].tolist()}") print(f"attention_mask: {encoded['attention_mask'][0].tolist()}") print(f"token_type_ids: {encoded['token_type_ids'][0].tolist()}") # input_ids: [101, 17662, 2227, ..., 999, 102] β [CLS] ... [SEP] # attention_mask: [1, 1, 1, ..., 1, 1] β all real tokens # token_type_ids: [0, 0, 0, ..., 0, 0] β single sentence # =========================== # 2. Decode back to text # =========================== decoded = tokenizer.decode(encoded['input_ids'][0]) print(f"Decoded: {decoded}") # "[CLS] hugging face's tokenizers are incredibly fast! [SEP]" decoded_skip = tokenizer.decode(encoded['input_ids'][0], skip_special_tokens=True) print(f"Clean: {decoded_skip}") # "hugging face's tokenizers are incredibly fast!" # =========================== # 3. Padding & Truncation # =========================== texts = ["Short text.", "This is a much longer sentence that has more words in it."] # Without padding: different lengths β can't batch! for t in texts: enc = tokenizer(t) print(f" Length: {len(enc['input_ids'])}") # Length: 4 # Length: 14 β different! Can't make a tensor # With padding + truncation: same length β can batch! batch = tokenizer(texts, padding=True, # pad shorter sequences truncation=True, # truncate if too long max_length=128, # max sequence length return_tensors="pt" # return PyTorch tensors ) print(f"Batch shape: {batch['input_ids'].shape}") # Batch shape: torch.Size([2, 14]) β padded to longest! print(f"Attention mask: {batch['attention_mask'][0].tolist()}") # [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] # 1=real token, 0=padding β model IGNORES padding! # =========================== # 4. Special tokens per model # =========================== print(f"BERT special tokens:") print(f" CLS: {tokenizer.cls_token} (ID: {tokenizer.cls_token_id})") # [CLS] = 101 print(f" SEP: {tokenizer.sep_token} (ID: {tokenizer.sep_token_id})") # [SEP] = 102 print(f" PAD: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})") # [PAD] = 0 print(f" UNK: {tokenizer.unk_token} (ID: {tokenizer.unk_token_id})") # [UNK] = 100 print(f" Vocab size: {tokenizer.vocab_size}") # 30522 # =========================== # 5. Sentence pairs (for NLI, QA, etc.) # =========================== encoded_pair = tokenizer( "What is the capital?", # sentence A "The capital of France is Paris.", # sentence B return_tensors="pt" ) print(encoded_pair['token_type_ids'][0].tolist()) # [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1] # 0=sentence A, 1=sentence B # [CLS] What is the capital ? [SEP] The capital of France is Paris . [SEP]
π WordPiece vs BPE vs SentencePiece:
WordPiece (BERT): Split kata yang tidak dikenal menjadi subword. "tokenizers" β ["token", "##ize", "##rs"]. Prefix ## = lanjutan.
BPE (GPT-2, RoBERTa): Byte Pair Encoding β merge pasangan byte paling sering. "lower" β ["low", "er"]. Prefix Δ = awal kata baru.
SentencePiece (T5, LLaMA): Language-agnostic, treat semua input sebagai byte sequence. β = space/word boundary. Bekerja untuk SEMUA bahasa tanpa preprocessing.
Anda tidak perlu memilih β AutoTokenizer otomatis load tokenizer yang tepat untuk setiap model!
π WordPiece vs BPE vs SentencePiece:
WordPiece (BERT): Split unknown words into subwords. "tokenizers" β ["token", "##ize", "##rs"]. ## prefix = continuation.
BPE (GPT-2, RoBERTa): Byte Pair Encoding β merge most frequent byte pairs. "lower" β ["low", "er"]. Δ prefix = new word start.
SentencePiece (T5, LLaMA): Language-agnostic, treats all input as byte sequence. β = space/word boundary. Works for ALL languages without preprocessing.
You don't need to choose β AutoTokenizer automatically loads the right tokenizer for each model!
9. Dari Tokenizer ke Model β Full Forward Pass Manual
9. From Tokenizer to Model β Full Manual Forward Pass
import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification # =========================== # Step 1: Load tokenizer & model # =========================== model_name = "distilbert-base-uncased-finetuned-sst-2-english" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # =========================== # Step 2: Tokenize input # =========================== text = "I absolutely love learning about Hugging Face!" inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) print(f"Input IDs shape: {inputs['input_ids'].shape}") print(f"Tokens: {tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])}") # ['[CLS]', 'i', 'absolutely', 'love', 'learning', 'about', 'hugging', 'face', '!', '[SEP]'] # =========================== # Step 3: Forward pass (no gradient needed for inference!) # =========================== with torch.no_grad(): # disable gradient computation β faster + less memory outputs = model(**inputs) print(f"Output type: {type(outputs)}") # SequenceClassifierOutput print(f"Logits: {outputs.logits}") # tensor([[-4.2532, 4.5687]]) β raw scores (NOT probabilities!) # =========================== # Step 4: Post-process β logits β probabilities # =========================== probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1) print(f"Probabilities: {probabilities}") # tensor([[0.0001, 0.9999]]) β [NEGATIVE, POSITIVE] # =========================== # Step 5: Map to label # =========================== predicted_class = torch.argmax(probabilities, dim=-1).item() label = model.config.id2label[predicted_class] confidence = probabilities[0][predicted_class].item() print(f"\\nπ― Prediction: {label} ({confidence:.1%})") # π― Prediction: POSITIVE (99.99%) # =========================== # Compare with pipeline (should be identical!) # =========================== from transformers import pipeline pipe = pipeline("sentiment-analysis", model=model_name) print(f"Pipeline: {pipe(text)}") # [{'label': 'POSITIVE', 'score': 0.9999}] β identical! β
π Sekarang Anda Paham Seluruh Flow!
Pipeline = Step 1-5 di atas digabung jadi satu baris. Tapi memahami setiap langkah penting karena: (1) Anda bisa custom preprocessing, (2) Anda bisa custom postprocessing, (3) Anda bisa debug masalah, dan (4) Fine-tuning (Page 2-3) membutuhkan pemahaman tentang tokenizer + model secara terpisah.
π Now You Understand the Entire Flow!
Pipeline = Steps 1-5 above combined into one line. But understanding each step matters because: (1) you can customize preprocessing, (2) you can customize postprocessing, (3) you can debug issues, and (4) Fine-tuning (Pages 2-3) requires understanding tokenizer + model separately.
10. First Look: Fine-Tuning BERT β Preview Page 2
10. First Look: Fine-Tuning BERT β Page 2 Preview
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments from datasets import load_dataset # =========================== # Fine-tune BERT on IMDB β PREVIEW (Page 2 = full version) # =========================== # 1. Load dataset dataset = load_dataset("imdb") print(dataset) # DatasetDict({'train': Dataset(25000 rows), 'test': Dataset(25000 rows)}) # 2. Load tokenizer & model model_name = "distilbert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) # 3. Tokenize dataset def tokenize(batch): return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=256) tokenized = dataset.map(tokenize, batched=True) # 4. Training arguments args = TrainingArguments( output_dir="./results", num_train_epochs=3, per_device_train_batch_size=16, per_device_eval_batch_size=64, eval_strategy="epoch", learning_rate=2e-5, weight_decay=0.01, fp16=True, # mixed precision! ) # 5. Create Trainer & train! trainer = Trainer( model=model, args=args, train_dataset=tokenized["train"], eval_dataset=tokenized["test"], tokenizer=tokenizer, ) trainer.train() # β 93%+ accuracy on IMDB in ~15 minutes on Google Colab T4! # Compare: BiLSTM from TF series = 87%. BERT = 93%+. That's the power! # Page 2 will cover: full Trainer API, custom metrics, hyperparameter # tuning, data collators, and pushing models to the Hub.
π― Preview: 93%+ IMDB Accuracy dalam 15 Menit!
Bandingkan dengan seri sebelumnya:
β’ Seri NN (manual NumPy): ~80% (ratusan baris kode, berjam-jam training)
β’ Seri TF Page 5 (BiLSTM): ~87% (25 baris, 30 menit training)
β’ Seri TF Page 6 (BERT TF Hub): ~95% (lebih kompleks setup)
β’ Hugging Face (Trainer API): 93%+ (20 baris, 15 menit!) π
Page 2 akan membahas ini secara mendalam β stay tuned!
π― Preview: 93%+ IMDB Accuracy in 15 Minutes!
Compare with previous series:
β’ NN Series (manual NumPy): ~80% (hundreds of lines, hours of training)
β’ TF Series Page 5 (BiLSTM): ~87% (25 lines, 30 min training)
β’ TF Series Page 6 (BERT TF Hub): ~95% (more complex setup)
β’ Hugging Face (Trainer API): 93%+ (20 lines, 15 minutes!) π
Page 2 will cover this in depth β stay tuned!
11. Ringkasan Page 1
11. Page 1 Summary
| Konsep | Apa Itu | Kode Kunci |
|---|---|---|
| Pipeline | 1-line inference untuk 20+ tasks | pipeline("sentiment-analysis")(text) |
| Model Hub | 500k+ models siap download | huggingface.co/models |
| AutoTokenizer | Universal tokenizer loader | AutoTokenizer.from_pretrained(name) |
| AutoModel | Universal model loader | AutoModelForXxx.from_pretrained(name) |
| Tokenization | Text β tokens β IDs β tensors | tokenizer(text, return_tensors="pt") |
| Padding/Truncation | Fixed-length batching | padding=True, truncation=True |
| Forward Pass | model(**inputs) β logits | outputs = model(**inputs) |
| Post-process | logits β softmax β label | softmax(logits) β argmax β id2label |
| Zero-Shot | Classify tanpa training | pipeline("zero-shot-classification") |
| Trainer (preview) | Fine-tuning API | Trainer(model, args, train_dataset) |
| Concept | What It Is | Key Code |
|---|---|---|
| Pipeline | 1-line inference for 20+ tasks | pipeline("sentiment-analysis")(text) |
| Model Hub | 500k+ ready-to-download models | huggingface.co/models |
| AutoTokenizer | Universal tokenizer loader | AutoTokenizer.from_pretrained(name) |
| AutoModel | Universal model loader | AutoModelForXxx.from_pretrained(name) |
| Tokenization | Text β tokens β IDs β tensors | tokenizer(text, return_tensors="pt") |
| Padding/Truncation | Fixed-length batching | padding=True, truncation=True |
| Forward Pass | model(**inputs) β logits | outputs = model(**inputs) |
| Post-process | logits β softmax β label | softmax(logits) β argmax β id2label |
| Zero-Shot | Classify without training | pipeline("zero-shot-classification") |
| Trainer (preview) | Fine-tuning API | Trainer(model, args, train_dataset) |
Coming Next: Page 2 β Fine-Tuning BERT & Trainer API
Deep dive fine-tuning! Page 2 membahas: Datasets library (load, preprocess, tokenize), Trainer API lengkap (TrainingArguments, callbacks, logging), fine-tuning BERT/DistilBERT/RoBERTa untuk text classification, custom metrics (F1, precision, recall), data collator dan dynamic padding, push model ke Hugging Face Hub, dan hyperparameter tuning. Dari IMDB sentiment sampai custom dataset Anda sendiri!
Coming Next: Page 2 β Fine-Tuning BERT & Trainer API
Deep dive into fine-tuning! Page 2 covers: Datasets library (load, preprocess, tokenize), complete Trainer API (TrainingArguments, callbacks, logging), fine-tuning BERT/DistilBERT/RoBERTa for text classification, custom metrics (F1, precision, recall), data collator and dynamic padding, pushing models to Hugging Face Hub, and hyperparameter tuning. From IMDB sentiment to your own custom datasets!