📑 Daftar Isi — Part 6
- Gap: Research → Production
- TorchScript — Export tanpa Python dependency
- ONNX Export — Format universal, 3× latency reduction
- FastAPI Inference Server — REST API untuk model
- Docker Containerization — Package & deploy anywhere
- Quantization & Pruning — Model lebih kecil, lebih cepat
- Deployment Checklist
- Ringkasan & Preview Part 7
🚀
1. Gap: Research → Production
77% model ML gagal sampai production. Ini cara menutup gap-nya.🚀 Deployment Pipeline — Dari Notebook ke Production
📦
2. TorchScript — Export Tanpa Python
Serialize model agar bisa jalan tanpa Python interpreterimport torch
# Load trained model
model = MyModel()
model.load_state_dict(torch.load("model_weights.pt"))
model.eval()
# Method 1: Tracing (recommended untuk model tanpa control flow)
example_input = torch.randn(1, 3, 224, 224)
traced = torch.jit.trace(model, example_input)
traced.save("model_traced.pt")
# Method 2: Scripting (untuk model dengan if/for)
scripted = torch.jit.script(model)
scripted.save("model_scripted.pt")
# Load di production (tanpa definisi class!)
loaded = torch.jit.load("model_traced.pt")
output = loaded(example_input)
# ✅ Bisa run di C++, mobile, tanpa Python!
⚡
3. ONNX Export — 3× Faster Inference
Open Neural Network Exchange: format universal untuk semua platformimport torch
import onnxruntime as ort
# ===== EXPORT ke ONNX =====
model.eval()
dummy = torch.randn(1, 3, 224, 224)
torch.onnx.export(
model, dummy, "model.onnx",
input_names=["input"],
output_names=["output"],
dynamic_axes={"input": {0: "batch"}}, # Dynamic batch size
dynamo=True # PyTorch 2.5+ recommended exporter
)
# ===== INFERENCE dengan ONNX Runtime =====
session = ort.InferenceSession("model.onnx")
result = session.run(
None, {"input": dummy.numpy()}
)
# ✅ 3× faster than PyTorch eager mode!
# ✅ Bisa deploy di: CPU, GPU, TensorRT, OpenVINO, mobile
📊 PyTorch Eager
Latency: ~45ms. Size: 44.7MB. Butuh Python + PyTorch. Flexible tapi lambat untuk production.
⚡ ONNX Runtime
Latency: ~15ms (3× faster). Size: 44.7MB. Tanpa Python dependency. Graph optimization otomatis.
🌐
4. FastAPI Inference Server
REST API: kirim gambar, terima prediksi. Production-ready.# pip install fastapi uvicorn onnxruntime pillow
from fastapi import FastAPI, UploadFile
import onnxruntime as ort
import numpy as np
from PIL import Image
import io
app = FastAPI(title="Image Classifier API")
# Load ONNX model SEKALI saat startup
session = ort.InferenceSession("model.onnx")
CLASSES = ["cat", "dog", "bird", "fish", "horse"]
def preprocess(image_bytes):
img = Image.open(io.BytesIO(image_bytes)).resize((224, 224))
arr = np.array(img).transpose(2,0,1).astype(np.float32) / 255.0
arr = (arr - [.485,.456,.406]) / [.229,.224,.225]
return arr[np.newaxis]
@app.post("/predict")
async def predict(file: UploadFile):
data = await file.read()
tensor = preprocess(data)
result = session.run(None, {"input": tensor})
probs = np.exp(result[0]) / np.exp(result[0]).sum()
idx = probs.argmax()
return {"class": CLASSES[idx], "confidence": float(probs[0][idx])}
# Run: uvicorn 23_fastapi_server:app --host 0.0.0.0 --port 8000
# Test: curl -X POST -F "file=@cat.jpg" http://localhost:8000/predict
# → {"class": "cat", "confidence": 0.9847}
🐳
5. Docker Containerization
Package everything → deploy anywhereFROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY model.onnx .
COPY 23_fastapi_server.py main.py
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
# Build & Run:
# docker build -t ml-api .
# docker run -p 8000:8000 ml-api
# Image size: ~350MB (tanpa GPU)
📐
6. Quantization — Model Lebih Kecil & Cepat
Float32 → Int8: ukuran 4× lebih kecil, inference 2-4× lebih cepatimport torch.quantization
# Dynamic Quantization (paling mudah, CPU only)
quantized = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear, torch.nn.LSTM}, # Layer yang di-quantize
dtype=torch.qint8 # Float32 → Int8
)
# Bandingkan ukuran
torch.save(model.state_dict(), "original.pt") # 44.7 MB
torch.save(quantized.state_dict(), "quantized.pt") # 11.2 MB ← 4× lebih kecil!
# Benchmark
# Original: 45ms/inference, 44.7 MB
# Quantized: 18ms/inference, 11.2 MB ← 2.5× faster, 4× smaller!
# Akurasi drop: ~0.5-1% (hampir tidak terasa)
| Teknik | Ukuran | Speed | Akurasi Drop | Difficulty |
|---|---|---|---|---|
| Dynamic Quantization | 4× kecil | 2-4× cepat | ~0.5% | 🟢 Mudah |
| Static Quantization | 4× kecil | 3-5× cepat | ~0.3% | 🟡 Medium |
| Pruning | 2-10× kecil | 1.5-3× cepat | ~1% | 🟡 Medium |
| ONNX + Quantization | 4× kecil | 5-8× cepat | ~0.5% | 🟢 Mudah |
| Knowledge Distillation | 10-50× kecil | 10× cepat | ~2% | 🔴 Sulit |
✅
7. Deployment Checklist
10 langkah sebelum production| # | Step | Tool |
|---|---|---|
| 1 | model.eval() + torch.no_grad() | PyTorch |
| 2 | Export ke ONNX atau TorchScript | torch.onnx.export |
| 3 | Validate: output ONNX ≈ output PyTorch | np.allclose() |
| 4 | Quantize untuk speed/size | quantize_dynamic |
| 5 | Build FastAPI inference server | FastAPI + Uvicorn |
| 6 | Add input validation & error handling | Pydantic |
| 7 | Containerize dengan Docker | Dockerfile |
| 8 | Load test (p50, p95, p99 latency) | Locust / wrk |
| 9 | Setup monitoring & logging | Prometheus + Grafana |
| 10 | Deploy ke cloud + auto-scaling | AWS / GCP / Azure |
📝
8. Ringkasan Part 6
Deployment essentials| Konsep | Apa Itu | Kode Kunci |
|---|---|---|
| TorchScript | Serialize model tanpa Python | torch.jit.trace(model, input) |
| ONNX Export | Format universal, 3× faster | torch.onnx.export(model, ...) |
| ONNX Runtime | Optimized inference engine | ort.InferenceSession("model.onnx") |
| FastAPI | REST API server | @app.post("/predict") |
| Docker | Container → deploy anywhere | docker build -t ml-api . |
| Quantization | Float32 → Int8: 4× smaller | quantize_dynamic(model, ...) |
Next: Part 7 — Generative AI: GANs & Autoencoders
Dari classify ke create! Belajar membuat model yang bisa menghasilkan gambar baru: Variational Autoencoder (VAE), DCGAN, dan generate wajah/angka/seni dari noise.