📝 Artikel ini ditulis dalam Bahasa Indonesia
🔥 Seri Belajar PyTorch Part 6

Deployment: Model ke Production

Model yang bagus tapi tidak bisa di-deploy = tidak berguna. Part 6 mengajarkan cara membawa model PyTorch dari notebook Jupyter ke production: TorchScript, ONNX export (3× lebih cepat), FastAPI serving, Docker containerization, dan optimasi dengan quantization.

📅 Maret 2026⏱ 30 menit baca🏷 Deployment • ONNX • FastAPI • Docker • Quantization
📚 Seri Belajar PyTorch:
1 2 3 4 5 6 7 8 9 10

📑 Daftar Isi — Part 6

  1. Gap: Research → Production
  2. TorchScript — Export tanpa Python dependency
  3. ONNX Export — Format universal, 3× latency reduction
  4. FastAPI Inference Server — REST API untuk model
  5. Docker Containerization — Package & deploy anywhere
  6. Quantization & Pruning — Model lebih kecil, lebih cepat
  7. Deployment Checklist
  8. Ringkasan & Preview Part 7
🚀

1. Gap: Research → Production

77% model ML gagal sampai production. Ini cara menutup gap-nya.

🚀 Deployment Pipeline — Dari Notebook ke Production

① Train model.pt PyTorch, Jupyter ② Export model.onnx TorchScript / ONNX ③ Serve FastAPI REST API + ONNX RT ④ Container Docker Dockerfile + compose ⑤ PRODUCTION AWS / GCP / Azure Monitoring + Auto-scaling
📦

2. TorchScript — Export Tanpa Python

Serialize model agar bisa jalan tanpa Python interpreter
21_torchscript_export.py
import torch # Load trained model model = MyModel() model.load_state_dict(torch.load("model_weights.pt")) model.eval() # Method 1: Tracing (recommended untuk model tanpa control flow) example_input = torch.randn(1, 3, 224, 224) traced = torch.jit.trace(model, example_input) traced.save("model_traced.pt") # Method 2: Scripting (untuk model dengan if/for) scripted = torch.jit.script(model) scripted.save("model_scripted.pt") # Load di production (tanpa definisi class!) loaded = torch.jit.load("model_traced.pt") output = loaded(example_input) # ✅ Bisa run di C++, mobile, tanpa Python!

3. ONNX Export — 3× Faster Inference

Open Neural Network Exchange: format universal untuk semua platform
22_onnx_export.py
import torch import onnxruntime as ort # ===== EXPORT ke ONNX ===== model.eval() dummy = torch.randn(1, 3, 224, 224) torch.onnx.export( model, dummy, "model.onnx", input_names=["input"], output_names=["output"], dynamic_axes={"input": {0: "batch"}}, # Dynamic batch size dynamo=True # PyTorch 2.5+ recommended exporter ) # ===== INFERENCE dengan ONNX Runtime ===== session = ort.InferenceSession("model.onnx") result = session.run( None, {"input": dummy.numpy()} ) # ✅ 3× faster than PyTorch eager mode! # ✅ Bisa deploy di: CPU, GPU, TensorRT, OpenVINO, mobile

📊 PyTorch Eager

Latency: ~45ms. Size: 44.7MB. Butuh Python + PyTorch. Flexible tapi lambat untuk production.

⚡ ONNX Runtime

Latency: ~15ms (3× faster). Size: 44.7MB. Tanpa Python dependency. Graph optimization otomatis.

🌐

4. FastAPI Inference Server

REST API: kirim gambar, terima prediksi. Production-ready.
23_fastapi_server.py — Production API
# pip install fastapi uvicorn onnxruntime pillow from fastapi import FastAPI, UploadFile import onnxruntime as ort import numpy as np from PIL import Image import io app = FastAPI(title="Image Classifier API") # Load ONNX model SEKALI saat startup session = ort.InferenceSession("model.onnx") CLASSES = ["cat", "dog", "bird", "fish", "horse"] def preprocess(image_bytes): img = Image.open(io.BytesIO(image_bytes)).resize((224, 224)) arr = np.array(img).transpose(2,0,1).astype(np.float32) / 255.0 arr = (arr - [.485,.456,.406]) / [.229,.224,.225] return arr[np.newaxis] @app.post("/predict") async def predict(file: UploadFile): data = await file.read() tensor = preprocess(data) result = session.run(None, {"input": tensor}) probs = np.exp(result[0]) / np.exp(result[0]).sum() idx = probs.argmax() return {"class": CLASSES[idx], "confidence": float(probs[0][idx])} # Run: uvicorn 23_fastapi_server:app --host 0.0.0.0 --port 8000 # Test: curl -X POST -F "file=@cat.jpg" http://localhost:8000/predict # → {"class": "cat", "confidence": 0.9847}
🐳

5. Docker Containerization

Package everything → deploy anywhere
Dockerfile
FROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY model.onnx . COPY 23_fastapi_server.py main.py EXPOSE 8000 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] # Build & Run: # docker build -t ml-api . # docker run -p 8000:8000 ml-api # Image size: ~350MB (tanpa GPU)
📐

6. Quantization — Model Lebih Kecil & Cepat

Float32 → Int8: ukuran 4× lebih kecil, inference 2-4× lebih cepat
24_quantization.py
import torch.quantization # Dynamic Quantization (paling mudah, CPU only) quantized = torch.quantization.quantize_dynamic( model, {torch.nn.Linear, torch.nn.LSTM}, # Layer yang di-quantize dtype=torch.qint8 # Float32 → Int8 ) # Bandingkan ukuran torch.save(model.state_dict(), "original.pt") # 44.7 MB torch.save(quantized.state_dict(), "quantized.pt") # 11.2 MB ← 4× lebih kecil! # Benchmark # Original: 45ms/inference, 44.7 MB # Quantized: 18ms/inference, 11.2 MB ← 2.5× faster, 4× smaller! # Akurasi drop: ~0.5-1% (hampir tidak terasa)
TeknikUkuranSpeedAkurasi DropDifficulty
Dynamic Quantization4× kecil2-4× cepat~0.5%🟢 Mudah
Static Quantization4× kecil3-5× cepat~0.3%🟡 Medium
Pruning2-10× kecil1.5-3× cepat~1%🟡 Medium
ONNX + Quantization4× kecil5-8× cepat~0.5%🟢 Mudah
Knowledge Distillation10-50× kecil10× cepat~2%🔴 Sulit

7. Deployment Checklist

10 langkah sebelum production
#StepTool
1model.eval() + torch.no_grad()PyTorch
2Export ke ONNX atau TorchScripttorch.onnx.export
3Validate: output ONNX ≈ output PyTorchnp.allclose()
4Quantize untuk speed/sizequantize_dynamic
5Build FastAPI inference serverFastAPI + Uvicorn
6Add input validation & error handlingPydantic
7Containerize dengan DockerDockerfile
8Load test (p50, p95, p99 latency)Locust / wrk
9Setup monitoring & loggingPrometheus + Grafana
10Deploy ke cloud + auto-scalingAWS / GCP / Azure
📝

8. Ringkasan Part 6

Deployment essentials
KonsepApa ItuKode Kunci
TorchScriptSerialize model tanpa Pythontorch.jit.trace(model, input)
ONNX ExportFormat universal, 3× fastertorch.onnx.export(model, ...)
ONNX RuntimeOptimized inference engineort.InferenceSession("model.onnx")
FastAPIREST API server@app.post("/predict")
DockerContainer → deploy anywheredocker build -t ml-api .
QuantizationFloat32 → Int8: 4× smallerquantize_dynamic(model, ...)
🔥
Tech Review Desk — Seri Belajar PyTorch
Sumber: pytorch.org, onnxruntime.ai, fastapi.tiangolo.com, PyImageSearch, Markaicode, DailyDoseOfDS.
📧 rominur@gmail.com  •  ✈️ t.me/Jekardah_AI