📑 Daftar Isi — Part 9
- Mixed Precision Training — FP16 + FP32: 2× faster, half memory
- Gradient Accumulation — Batch besar di GPU kecil
- torch.compile — JIT compilation: 1.5-2× speedup gratis
- Distributed Training (DDP) — Multi-GPU parallelism
- Learning Rate Scheduling — Warmup, Cosine, OneCycle
- Experiment Tracking — Weights & Biases (W&B)
- Ringkasan & Preview Part 10
⚡
1. Mixed Precision Training
FP32 → FP16: 2× faster, separuh memory. Akurasi tetap sama.from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler() # Handles FP16 gradient scaling
for images, labels in dataloader:
optimizer.zero_grad()
# Forward pass in FP16 (2× faster!)
with autocast(device_type="cuda"):
output = model(images)
loss = loss_fn(output, labels)
# Backward pass with gradient scaling
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
# Hasil: Training 1.5-2× lebih cepat
# Memory: 40-50% lebih hemat → bisa batch lebih besar
# Akurasi: Identik dengan FP32! ✅
📦
2. Gradient Accumulation
Simulasi batch_size=256 di GPU yang hanya muat batch_size=32accum_steps = 8 # Effective batch = 32 × 8 = 256
for i, (images, labels) in enumerate(dataloader):
with autocast(device_type="cuda"):
loss = loss_fn(model(images), labels)
loss = loss / accum_steps # Normalize loss
scaler.scale(loss).backward() # Accumulate gradients
if (i + 1) % accum_steps == 0:
scaler.step(optimizer) # Update weights
scaler.update()
optimizer.zero_grad() # Reset gradients
# GPU hanya muat batch 32, tapi efeknya seperti batch 256!
🚀
3. torch.compile — Speedup Gratis
PyTorch 2.0+: satu baris kode, 1.5-2× faster# PyTorch 2.0+ magic: SATU BARIS!
model = torch.compile(model)
# Atau pilih mode:
model = torch.compile(model, mode="reduce-overhead") # Fastest
model = torch.compile(model, mode="max-autotune") # Best throughput
# First batch: lambat (compilation). Setelah itu: 1.5-2× faster!
# Cara kerja: PyTorch fuse operators, optimize graph,
# generate optimized kernels via Triton/TorchInductor
🖥️
4. Distributed Data Parallel (DDP)
Multi-GPU: split data, train paralel, sync gradientsimport torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def train(rank, world_size):
dist.init_process_group("nccl", rank=rank, world_size=world_size)
model = MyModel().to(rank)
model = DDP(model, device_ids=[rank])
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
loader = DataLoader(dataset, sampler=sampler, batch_size=32)
for epoch in range(10):
sampler.set_epoch(epoch)
for batch in loader:
# Training loop biasa — DDP sync gradients otomatis!
loss = loss_fn(model(batch), labels)
loss.backward()
optimizer.step()
# Run: torchrun --nproc_per_node=4 33_ddp_training.py
# 4 GPU → ~3.6× speedup (near-linear scaling!)
📈
5. Learning Rate Scheduling
LR yang tepat di waktu yang tepat = convergence lebih baik# 1. Cosine Annealing (paling populer)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=100, eta_min=1e-6
)
# 2. OneCycle (super-convergence)
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer, max_lr=0.01,
steps_per_epoch=len(loader), epochs=30
)
# 3. Warmup + Cosine (transformer standard)
def warmup_cosine(step, warmup=1000, total=50000):
if step < warmup:
return step / warmup # Linear warmup
progress = (step - warmup) / (total - warmup)
return 0.5 * (1 + math.cos(math.pi * progress))
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, warmup_cosine)
| Scheduler | Pattern | Best For |
|---|---|---|
| StepLR | Drop LR setiap N epoch | Simple, CNN training |
| CosineAnnealing | Turun smooth (kosinus) | General purpose, popular |
| OneCycleLR | Naik lalu turun (1 cycle) | Super-convergence, fast training |
| Warmup + Cosine | Warmup dulu, lalu cosine | Transformer, LLM training |
| ReduceOnPlateau | Turun ketika loss stagnant | Adaptive, safe choice |
📊
6. Experiment Tracking — Weights & Biases
Log, visualize, compare semua experiment# pip install wandb
import wandb
wandb.init(project="pytorch-tutorial", config={
"lr": 0.001, "batch_size": 64, "epochs": 30,
"model": "ResNet-18", "optimizer": "AdamW"
})
for epoch in range(30):
train_loss, val_acc = train_one_epoch(model)
wandb.log({
"train/loss": train_loss,
"val/accuracy": val_acc,
"lr": optimizer.param_groups[0]['lr']
})
wandb.finish()
# Dashboard: wandb.ai → real-time charts, compare runs
📝
7. Ringkasan Part 9
Advanced training arsenal| Teknik | Speedup | Effort | Kode Kunci |
|---|---|---|---|
| Mixed Precision | 1.5-2× | 🟢 Mudah | autocast() + GradScaler() |
| torch.compile | 1.5-2× | 🟢 1 baris | model = torch.compile(model) |
| Gradient Accum | Batch besar | 🟢 Mudah | loss /= accum_steps |
| DDP | ~N× (N GPU) | 🟡 Medium | DDP(model, device_ids) |
| LR Scheduling | Better convergence | 🟢 Mudah | CosineAnnealingLR |
| W&B Tracking | Better decisions | 🟢 Mudah | wandb.log({...}) |
🏆 Combo Terbaik (Production Recipe)
torch.compile + Mixed Precision + Cosine LR Schedule + Gradient Accumulation = training 3-5× lebih cepat tanpa sacrifice akurasi. Ini standard di semua AI labs top dunia. Tambahkan DDP jika punya multi-GPU.
FINALE: Part 10 — Capstone Project: End-to-End ML Pipeline
Gabungkan SEMUA yang sudah dipelajari Part 1-9 dalam satu proyek nyata: dari data loading → preprocessing → model selection → training → evaluation → deployment. Complete project yang bisa jadi portfolio!