📝 Artikel ini ditulis dalam Bahasa Indonesia
🔥 Seri Belajar PyTorch Part 9

Advanced Training: Faster, Bigger, Better

Model sudah bagus? Sekarang buat training-nya 2-10× lebih cepat. Part 9 mengajarkan teknik yang dipakai labs AI top: Mixed Precision Training, Distributed Data Parallel, Gradient Accumulation, Learning Rate Scheduling, torch.compile, dan Weights & Biases experiment tracking.

📅 Maret 2026⏱ 30 menit baca🏷 Mixed Precision • DDP • torch.compile • LR Schedule • W&B
📚 Seri Belajar PyTorch:
1 2 3 4 5 6 7 8 9 10

📑 Daftar Isi — Part 9

  1. Mixed Precision Training — FP16 + FP32: 2× faster, half memory
  2. Gradient Accumulation — Batch besar di GPU kecil
  3. torch.compile — JIT compilation: 1.5-2× speedup gratis
  4. Distributed Training (DDP) — Multi-GPU parallelism
  5. Learning Rate Scheduling — Warmup, Cosine, OneCycle
  6. Experiment Tracking — Weights & Biases (W&B)
  7. Ringkasan & Preview Part 10

1. Mixed Precision Training

FP32 → FP16: 2× faster, separuh memory. Akurasi tetap sama.
30_mixed_precision.py
from torch.cuda.amp import autocast, GradScaler scaler = GradScaler() # Handles FP16 gradient scaling for images, labels in dataloader: optimizer.zero_grad() # Forward pass in FP16 (2× faster!) with autocast(device_type="cuda"): output = model(images) loss = loss_fn(output, labels) # Backward pass with gradient scaling scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() # Hasil: Training 1.5-2× lebih cepat # Memory: 40-50% lebih hemat → bisa batch lebih besar # Akurasi: Identik dengan FP32! ✅
📦

2. Gradient Accumulation

Simulasi batch_size=256 di GPU yang hanya muat batch_size=32
31_gradient_accumulation.py
accum_steps = 8 # Effective batch = 32 × 8 = 256 for i, (images, labels) in enumerate(dataloader): with autocast(device_type="cuda"): loss = loss_fn(model(images), labels) loss = loss / accum_steps # Normalize loss scaler.scale(loss).backward() # Accumulate gradients if (i + 1) % accum_steps == 0: scaler.step(optimizer) # Update weights scaler.update() optimizer.zero_grad() # Reset gradients # GPU hanya muat batch 32, tapi efeknya seperti batch 256!
🚀

3. torch.compile — Speedup Gratis

PyTorch 2.0+: satu baris kode, 1.5-2× faster
32_torch_compile.py
# PyTorch 2.0+ magic: SATU BARIS! model = torch.compile(model) # Atau pilih mode: model = torch.compile(model, mode="reduce-overhead") # Fastest model = torch.compile(model, mode="max-autotune") # Best throughput # First batch: lambat (compilation). Setelah itu: 1.5-2× faster! # Cara kerja: PyTorch fuse operators, optimize graph, # generate optimized kernels via Triton/TorchInductor
🖥️

4. Distributed Data Parallel (DDP)

Multi-GPU: split data, train paralel, sync gradients
33_ddp_training.py
import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP def train(rank, world_size): dist.init_process_group("nccl", rank=rank, world_size=world_size) model = MyModel().to(rank) model = DDP(model, device_ids=[rank]) sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank) loader = DataLoader(dataset, sampler=sampler, batch_size=32) for epoch in range(10): sampler.set_epoch(epoch) for batch in loader: # Training loop biasa — DDP sync gradients otomatis! loss = loss_fn(model(batch), labels) loss.backward() optimizer.step() # Run: torchrun --nproc_per_node=4 33_ddp_training.py # 4 GPU → ~3.6× speedup (near-linear scaling!)
📈

5. Learning Rate Scheduling

LR yang tepat di waktu yang tepat = convergence lebih baik
34_lr_schedule.py
# 1. Cosine Annealing (paling populer) scheduler = torch.optim.lr_scheduler.CosineAnnealingLR( optimizer, T_max=100, eta_min=1e-6 ) # 2. OneCycle (super-convergence) scheduler = torch.optim.lr_scheduler.OneCycleLR( optimizer, max_lr=0.01, steps_per_epoch=len(loader), epochs=30 ) # 3. Warmup + Cosine (transformer standard) def warmup_cosine(step, warmup=1000, total=50000): if step < warmup: return step / warmup # Linear warmup progress = (step - warmup) / (total - warmup) return 0.5 * (1 + math.cos(math.pi * progress)) scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, warmup_cosine)
SchedulerPatternBest For
StepLRDrop LR setiap N epochSimple, CNN training
CosineAnnealingTurun smooth (kosinus)General purpose, popular
OneCycleLRNaik lalu turun (1 cycle)Super-convergence, fast training
Warmup + CosineWarmup dulu, lalu cosineTransformer, LLM training
ReduceOnPlateauTurun ketika loss stagnantAdaptive, safe choice
📊

6. Experiment Tracking — Weights & Biases

Log, visualize, compare semua experiment
35_wandb_tracking.py
# pip install wandb import wandb wandb.init(project="pytorch-tutorial", config={ "lr": 0.001, "batch_size": 64, "epochs": 30, "model": "ResNet-18", "optimizer": "AdamW" }) for epoch in range(30): train_loss, val_acc = train_one_epoch(model) wandb.log({ "train/loss": train_loss, "val/accuracy": val_acc, "lr": optimizer.param_groups[0]['lr'] }) wandb.finish() # Dashboard: wandb.ai → real-time charts, compare runs
📝

7. Ringkasan Part 9

Advanced training arsenal
TeknikSpeedupEffortKode Kunci
Mixed Precision1.5-2×🟢 Mudahautocast() + GradScaler()
torch.compile1.5-2×🟢 1 barismodel = torch.compile(model)
Gradient AccumBatch besar🟢 Mudahloss /= accum_steps
DDP~N× (N GPU)🟡 MediumDDP(model, device_ids)
LR SchedulingBetter convergence🟢 MudahCosineAnnealingLR
W&B TrackingBetter decisions🟢 Mudahwandb.log({...})

🏆 Combo Terbaik (Production Recipe)

torch.compile + Mixed Precision + Cosine LR Schedule + Gradient Accumulation = training 3-5× lebih cepat tanpa sacrifice akurasi. Ini standard di semua AI labs top dunia. Tambahkan DDP jika punya multi-GPU.

🔥
Tech Review Desk — Seri Belajar PyTorch
Sumber: pytorch.org, NVIDIA AMP docs, Weights & Biases docs, PyTorch DDP tutorial.
📧 rominur@gmail.com  •  ✈️ t.me/Jekardah_AI