Seri Belajar PyTorch Part 9: Advanced Training Techniques

📚 Seri Belajar PyTorch:

1 2 3 4 5 6 7 8 9 10

📑 Daftar Isi — Part 9

Mixed Precision Training — FP16 + FP32: 2× faster, half memory
Gradient Accumulation — Batch besar di GPU kecil
torch.compile — JIT compilation: 1.5-2× speedup gratis
Distributed Training (DDP) — Multi-GPU parallelism
Learning Rate Scheduling — Warmup, Cosine, OneCycle
Experiment Tracking — Weights & Biases (W&B)
Ringkasan & Preview Part 10

⚡

1. Mixed Precision Training

FP32 → FP16: 2× faster, separuh memory. Akurasi tetap sama.

30_mixed_precision.py

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()  # Handles FP16 gradient scaling

for images, labels in dataloader:
    optimizer.zero_grad()

    # Forward pass in FP16 (2× faster!)
    with autocast(device_type="cuda"):
        output = model(images)
        loss = loss_fn(output, labels)

    # Backward pass with gradient scaling
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

# Hasil: Training 1.5-2× lebih cepat
# Memory: 40-50% lebih hemat → bisa batch lebih besar
# Akurasi: Identik dengan FP32! ✅

📦

2. Gradient Accumulation

Simulasi batch_size=256 di GPU yang hanya muat batch_size=32

31_gradient_accumulation.py

accum_steps = 8  # Effective batch = 32 × 8 = 256

for i, (images, labels) in enumerate(dataloader):
    with autocast(device_type="cuda"):
        loss = loss_fn(model(images), labels)
        loss = loss / accum_steps  # Normalize loss

    scaler.scale(loss).backward()  # Accumulate gradients

    if (i + 1) % accum_steps == 0:
        scaler.step(optimizer)       # Update weights
        scaler.update()
        optimizer.zero_grad()        # Reset gradients

# GPU hanya muat batch 32, tapi efeknya seperti batch 256!

🚀

3. torch.compile — Speedup Gratis

PyTorch 2.0+: satu baris kode, 1.5-2× faster

32_torch_compile.py

# PyTorch 2.0+ magic: SATU BARIS!
model = torch.compile(model)

# Atau pilih mode:
model = torch.compile(model, mode="reduce-overhead")  # Fastest
model = torch.compile(model, mode="max-autotune")     # Best throughput

# First batch: lambat (compilation). Setelah itu: 1.5-2× faster!
# Cara kerja: PyTorch fuse operators, optimize graph,
# generate optimized kernels via Triton/TorchInductor

🖥️

4. Distributed Data Parallel (DDP)

Multi-GPU: split data, train paralel, sync gradients

33_ddp_training.py

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def train(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

    model = MyModel().to(rank)
    model = DDP(model, device_ids=[rank])

    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    loader  = DataLoader(dataset, sampler=sampler, batch_size=32)

    for epoch in range(10):
        sampler.set_epoch(epoch)
        for batch in loader:
            # Training loop biasa — DDP sync gradients otomatis!
            loss = loss_fn(model(batch), labels)
            loss.backward()
            optimizer.step()

# Run: torchrun --nproc_per_node=4 33_ddp_training.py
# 4 GPU → ~3.6× speedup (near-linear scaling!)

📈

5. Learning Rate Scheduling

LR yang tepat di waktu yang tepat = convergence lebih baik

34_lr_schedule.py

# 1. Cosine Annealing (paling populer)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=100, eta_min=1e-6
)

# 2. OneCycle (super-convergence)
scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer, max_lr=0.01,
    steps_per_epoch=len(loader), epochs=30
)

# 3. Warmup + Cosine (transformer standard)
def warmup_cosine(step, warmup=1000, total=50000):
    if step < warmup:
        return step / warmup           # Linear warmup
    progress = (step - warmup) / (total - warmup)
    return 0.5 * (1 + math.cos(math.pi * progress))

scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, warmup_cosine)

Scheduler	Pattern	Best For
StepLR	Drop LR setiap N epoch	Simple, CNN training
CosineAnnealing	Turun smooth (kosinus)	General purpose, popular
OneCycleLR	Naik lalu turun (1 cycle)	Super-convergence, fast training
Warmup + Cosine	Warmup dulu, lalu cosine	Transformer, LLM training
ReduceOnPlateau	Turun ketika loss stagnant	Adaptive, safe choice

📊

6. Experiment Tracking — Weights & Biases

Log, visualize, compare semua experiment

35_wandb_tracking.py

# pip install wandb
import wandb

wandb.init(project="pytorch-tutorial", config={
    "lr": 0.001, "batch_size": 64, "epochs": 30,
    "model": "ResNet-18", "optimizer": "AdamW"
})

for epoch in range(30):
    train_loss, val_acc = train_one_epoch(model)
    wandb.log({
        "train/loss": train_loss,
        "val/accuracy": val_acc,
        "lr": optimizer.param_groups[0]['lr']
    })

wandb.finish()
# Dashboard: wandb.ai → real-time charts, compare runs

📝

7. Ringkasan Part 9

Advanced training arsenal

Teknik	Speedup	Effort	Kode Kunci
Mixed Precision	1.5-2×	🟢 Mudah	`autocast() + GradScaler()`
torch.compile	1.5-2×	🟢 1 baris	`model = torch.compile(model)`
Gradient Accum	Batch besar	🟢 Mudah	`loss /= accum_steps`
DDP	~N× (N GPU)	🟡 Medium	`DDP(model, device_ids)`
LR Scheduling	Better convergence	🟢 Mudah	`CosineAnnealingLR`
W&B Tracking	Better decisions	🟢 Mudah	`wandb.log({...})`

🏆 Combo Terbaik (Production Recipe)

torch.compile + Mixed Precision + Cosine LR Schedule + Gradient Accumulation = training 3-5× lebih cepat tanpa sacrifice akurasi. Ini standard di semua AI labs top dunia. Tambahkan DDP jika punya multi-GPU.

🏆

FINALE: Part 10 — Capstone Project: End-to-End ML Pipeline

Gabungkan SEMUA yang sudah dipelajari Part 1-9 dalam satu proyek nyata: dari data loading → preprocessing → model selection → training → evaluation → deployment. Complete project yang bisa jadi portfolio!

🔥

Tech Review Desk — Seri Belajar PyTorch

Sumber: pytorch.org, NVIDIA AMP docs, Weights & Biases docs, PyTorch DDP tutorial.

📧 rominur@gmail.com • ✈️ t.me/Jekardah_AI