π Daftar Isi β Page 9
π Table of Contents β Page 9
- Deployment Overview β Dari notebook ke production
- SavedModel Format β Standard export untuk semua platform
- TF Serving β REST API & gRPC production server
- Docker Deployment β Container untuk konsistensi
- TFLite β Mobile & edge: quantization, optimization
- TF.js β Model di browser: convert & load
- Model Versioning β A/B testing & safe rollback
- Batch Prediction β High throughput offline inference
- Production Monitoring β Data drift, latency, accuracy
- Deployment Checklist β Sebelum ke production
- Ringkasan & Preview Page 10
- Deployment Overview β From notebook to production
- SavedModel Format β Standard export for all platforms
- TF Serving β REST API & gRPC production server
- Docker Deployment β Containers for consistency
- TFLite β Mobile & edge: quantization, optimization
- TF.js β Model in browser: convert & load
- Model Versioning β A/B testing & safe rollback
- Batch Prediction β High throughput offline inference
- Production Monitoring β Data drift, latency, accuracy
- Deployment Checklist β Before going to production
- Summary & Page 10 Preview
1. Deployment Overview β Dari Notebook ke Dunia Nyata
1. Deployment Overview β From Notebook to the Real World
Anda sudah bisa train model yang akurat (Pages 1-8). Tapi model di Jupyter notebook tidak bisa melayani user. Deployment berarti membuat model Anda bisa menerima input dan mengembalikan prediksi secara real-time, reliable, dan scalable. TensorFlow memiliki ekosistem deployment terlengkap β satu model bisa di-deploy ke server, handphone, browser, dan edge device.
You can now train accurate models (Pages 1-8). But a model in a Jupyter notebook can't serve users. Deployment means making your model accept input and return predictions in a real-time, reliable, and scalable way. TensorFlow has the most complete deployment ecosystem β one model can be deployed to server, phone, browser, and edge devices.
2. SavedModel Format β Export Standard
2. SavedModel Format β Standard Export
import tensorflow as tf from tensorflow import keras # =========================== # 1. Save β SavedModel format (RECOMMENDED for deployment) # =========================== model.save("saved_model/my_classifier") # Creates directory structure: # saved_model/my_classifier/ # βββ saved_model.pb β computation graph # βββ fingerprint.pb β integrity check # βββ variables/ # βββ variables.data-00000-of-00001 β weights # βββ variables.index # =========================== # 2. Save β Keras native format (.keras) # =========================== model.save("my_model.keras") # single file, includes architecture # Good for: development, sharing models with Keras users # Bad for: TF Serving (needs SavedModel format) # =========================== # 3. Save β Weights only # =========================== model.save_weights("weights/my_weights.weights.h5") # Only weights, no architecture. Must rebuild model first when loading. # Good for: checkpointing during training, transfer learning # =========================== # 4. Load models # =========================== # SavedModel loaded_sm = tf.keras.models.load_model("saved_model/my_classifier") predictions = loaded_sm.predict(X_test[:5]) # Keras format loaded_keras = tf.keras.models.load_model("my_model.keras") # Weights only (must have identical architecture!) new_model = build_model() # same architecture new_model.load_weights("weights/my_weights.weights.h5") # =========================== # 5. Inspect SavedModel with CLI # =========================== # saved_model_cli show --dir saved_model/my_classifier --all # Shows: input/output signatures, shapes, dtypes # This is what TF Serving uses to know the API! # =========================== # 6. Add custom serving signature # =========================== # For models with custom preprocessing: class ServableModel(tf.Module): def __init__(self, model): self.model = model @tf.function(input_signature=[tf.TensorSpec(shape=[None, 224, 224, 3], dtype=tf.float32)]) def serve(self, images): # Preprocessing included in serving! images = images / 255.0 predictions = self.model(images, training=False) return {"predictions": predictions} servable = ServableModel(model) tf.saved_model.save(servable, "saved_model/servable", signatures={"serving_default": servable.serve})
π SavedModel vs .keras vs .h5 β Kapan Pakai Apa?
SavedModel/ (directory): Deployment ke TF Serving, TFLite, TF.js. Standar production.
.keras (single file): Development, sharing, prototyping. Standar Keras.
.weights.h5 (weights only): Checkpointing, transfer learning. Butuh arsitektur identik.
Rule: Untuk production β selalu SavedModel. Untuk development β .keras.
π SavedModel vs .keras vs .h5 β When to Use What?
SavedModel/ (directory): Deploy to TF Serving, TFLite, TF.js. Production standard.
.keras (single file): Development, sharing, prototyping. Keras standard.
.weights.h5 (weights only): Checkpointing, transfer learning. Needs identical architecture.
Rule: For production β always SavedModel. For development β .keras.
3. TF Serving β Production REST & gRPC Server
3. TF Serving β Production REST & gRPC Server
TF Serving adalah server C++ high-performance untuk melayani model TensorFlow. Didesain untuk production: auto-batching request, GPU support, model hot-swapping (update model tanpa downtime), dan monitoring built-in. Dipakai oleh Google untuk menyajikan miliaran prediksi per hari.
TF Serving is a high-performance C++ server for serving TensorFlow models. Designed for production: auto-batching requests, GPU support, model hot-swapping (update models without downtime), and built-in monitoring. Used by Google to serve billions of predictions per day.
import tensorflow as tf import requests import json import numpy as np # =========================== # 1. Save model with version number # =========================== model.save("saved_model/my_classifier/1") # version 1 # Later: model_v2.save("saved_model/my_classifier/2") # version 2 # TF Serving auto-detects and serves the LATEST version! # Directory structure for TF Serving: # saved_model/my_classifier/ # βββ 1/ β version 1 # β βββ saved_model.pb # β βββ variables/ # βββ 2/ β version 2 (latest, auto-served) # βββ saved_model.pb # βββ variables/ # =========================== # 2. REST API Client # =========================== # TF Serving runs on port 8501 (REST) and 8500 (gRPC) # Prepare input data test_images = X_test[:5].tolist() # must be JSON-serializable # Send REST request url = "http://localhost:8501/v1/models/my_classifier:predict" payload = json.dumps({"instances": test_images}) headers = {"Content-Type": "application/json"} response = requests.post(url, data=payload, headers=headers) result = response.json() predictions = np.array(result["predictions"]) predicted_classes = np.argmax(predictions, axis=1) print(f"Predicted: {predicted_classes}") # Predicted: [7, 2, 1, 0, 4] # Request specific version: # url = "http://localhost:8501/v1/models/my_classifier/versions/1:predict" # =========================== # 3. gRPC Client (faster for production!) # =========================== # pip install tensorflow-serving-api import grpc # from tensorflow_serving.apis import predict_pb2, prediction_service_pb2_grpc # # channel = grpc.insecure_channel('localhost:8500') # stub = prediction_service_pb2_grpc.PredictionServiceStub(channel) # # request = predict_pb2.PredictRequest() # request.model_spec.name = 'my_classifier' # request.inputs['input_1'].CopyFrom( # tf.make_tensor_proto(test_images, dtype=tf.float32)) # # response = stub.Predict(request, timeout=10.0) # predictions = tf.make_ndarray(response.outputs['dense_1']) # gRPC vs REST: # gRPC: ~2-5Γ faster (binary protocol, no JSON serialization) # REST: easier to debug, works with curl, browser-friendly # Production recommendation: gRPC for internal services, REST for external API
4. Docker Deployment β Container untuk Konsistensi
4. Docker Deployment β Containers for Consistency
# =========================== # 1. Pull TF Serving Docker image # =========================== docker pull tensorflow/serving # CPU version docker pull tensorflow/serving:latest-gpu # GPU version (needs nvidia-docker) # =========================== # 2. Run TF Serving container # =========================== docker run -d --name tf_serving -p 8501:8501 -p 8500:8500 --mount type=bind,source=$(pwd)/saved_model/my_classifier,target=/models/my_classifier -e MODEL_NAME=my_classifier -e MODEL_BASE_PATH=/models/my_classifier tensorflow/serving # =========================== # 3. Test with curl # =========================== curl -X POST http://localhost:8501/v1/models/my_classifier:predict -H "Content-Type: application/json" -d '{"instances": [[0.1, 0.2, 0.3, 0.4]]}' # Check model status curl http://localhost:8501/v1/models/my_classifier # =========================== # 4. Custom Dockerfile (with model baked in) # =========================== # FROM tensorflow/serving # COPY saved_model/my_classifier /models/my_classifier # ENV MODEL_NAME=my_classifier # EXPOSE 8501 8500 # # docker build -t my-ml-service . # docker run -p 8501:8501 my-ml-service # =========================== # 5. Docker Compose (multiple services) # =========================== # version: '3' # services: # tf-serving: # image: tensorflow/serving # ports: ["8501:8501", "8500:8500"] # volumes: ["./saved_model:/models"] # environment: # - MODEL_NAME=my_classifier # api: # build: ./api # ports: ["5000:5000"] # depends_on: [tf-serving]
5. TFLite β Mobile & Edge Deployment
5. TFLite β Mobile & Edge Deployment
import tensorflow as tf import numpy as np # =========================== # 1. Basic conversion (no optimization) # =========================== converter = tf.lite.TFLiteConverter.from_saved_model("saved_model/my_classifier/1") tflite_model = converter.convert() with open("model.tflite", "wb") as f: f.write(tflite_model) original_size = 25.6 # MB (SavedModel) tflite_size = len(tflite_model) / (1024 * 1024) print(f"Original: {original_size:.1f} MB") print(f"TFLite: {tflite_size:.1f} MB") print(f"Reduction: {original_size/tflite_size:.1f}Γ") # =========================== # 2. Dynamic range quantization (RECOMMENDED default) # =========================== converter = tf.lite.TFLiteConverter.from_saved_model("saved_model/my_classifier/1") converter.optimizations = [tf.lite.Optimize.DEFAULT] # dynamic range! tflite_quant = converter.convert() print(f"Quantized: {len(tflite_quant)/1024/1024:.1f} MB") # ~4Γ smaller than original! (float32 β int8 weights) # Accuracy loss: typically < 1% # =========================== # 3. Full integer quantization (smallest, fastest) # =========================== def representative_dataset(): """Provide sample data for calibration""" for i in range(100): sample = X_train[i:i+1].astype(np.float32) yield [sample] converter = tf.lite.TFLiteConverter.from_saved_model("saved_model/my_classifier/1") converter.optimizations = [tf.lite.Optimize.DEFAULT] converter.representative_dataset = representative_dataset converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] converter.inference_input_type = tf.int8 # or tf.uint8 converter.inference_output_type = tf.int8 tflite_int8 = converter.convert() print(f"Full INT8: {len(tflite_int8)/1024/1024:.1f} MB") # ~4Γ smaller AND ~2-4Γ faster inference on mobile! # Runs on CPU integer units β no GPU needed! # =========================== # 4. Float16 quantization (GPU-friendly mobile) # =========================== converter = tf.lite.TFLiteConverter.from_saved_model("saved_model/my_classifier/1") converter.optimizations = [tf.lite.Optimize.DEFAULT] converter.target_spec.supported_types = [tf.float16] tflite_fp16 = converter.convert() # ~2Γ smaller, runs on mobile GPU (faster than INT8 on GPU) # =========================== # 5. Test TFLite model in Python # =========================== interpreter = tf.lite.Interpreter(model_content=tflite_quant) interpreter.allocate_tensors() input_details = interpreter.get_input_details() output_details = interpreter.get_output_details() # Run inference test_input = X_test[0:1].astype(np.float32) interpreter.set_tensor(input_details[0]['index'], test_input) interpreter.invoke() output = interpreter.get_tensor(output_details[0]['index']) print(f"TFLite prediction: {np.argmax(output)}") # Should match original model prediction!
| Quantization | Size Reduction | Speed | Accuracy Impact | Best For |
|---|---|---|---|---|
| No quantization | 1Γ (baseline) | 1Γ (baseline) | 0% | Maximum accuracy |
| Dynamic range | ~4Γ smaller | ~2Γ faster | < 1% | Default choice β |
| Float16 | ~2Γ smaller | ~1.5Γ faster | ~0% | Mobile GPU |
| Full INT8 | ~4Γ smaller | ~3Γ faster | 1-3% | Edge/IoT devices |
| Quantization | Size Reduction | Speed | Accuracy Impact | Best For |
|---|---|---|---|---|
| No quantization | 1Γ (baseline) | 1Γ (baseline) | 0% | Maximum accuracy |
| Dynamic range | ~4Γ smaller | ~2Γ faster | < 1% | Default choice β |
| Float16 | ~2Γ smaller | ~1.5Γ faster | ~0% | Mobile GPU |
| Full INT8 | ~4Γ smaller | ~3Γ faster | 1-3% | Edge/IoT devices |
6. TF.js β Model di Browser
6. TF.js β Model in the Browser
# Install converter pip install tensorflowjs # Convert SavedModel β TF.js format tensorflowjs_converter --input_format=tf_saved_model --output_format=tfjs_graph_model --quantize_uint8 saved_model/my_classifier/1 web_model/ # Output files: # web_model/ # βββ model.json β architecture + weight manifest # βββ group1-shard1of1.bin β weights binary
<!-- Load TF.js library --> <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></script> <script> async function loadAndPredict() { // Load model from your web server const model = await tf.loadGraphModel('web_model/model.json'); // Create input tensor (e.g., from canvas/image) const input = tf.zeros([1, 224, 224, 3]); // Run inference const prediction = model.predict(input); const classIndex = prediction.argMax(1).dataSync()[0]; console.log(`Predicted class: ${classIndex}`); // Clean up tensors (prevent memory leak!) input.dispose(); prediction.dispose(); } loadAndPredict(); </script> <!-- Use cases: --> <!-- β’ Real-time webcam classification --> <!-- β’ Image editing/filtering --> <!-- β’ Text sentiment analysis --> <!-- β’ Pose detection (PoseNet) --> <!-- β’ No server costs! No data leaves user's device! -->
7. Model Versioning β A/B Testing & Rollback
7. Model Versioning β A/B Testing & Rollback
import tensorflow as tf import requests import json # =========================== # 1. Save with version numbers # =========================== model_v1.save("models/classifier/1") # January model model_v2.save("models/classifier/2") # February model (improved) model_v3.save("models/classifier/3") # March model (latest) # TF Serving auto-serves LATEST version (v3) # But you can request ANY version: # =========================== # 2. Request specific version # =========================== # Latest (default): url_latest = "http://localhost:8501/v1/models/classifier:predict" # Specific version: url_v1 = "http://localhost:8501/v1/models/classifier/versions/1:predict" url_v2 = "http://localhost:8501/v1/models/classifier/versions/2:predict" # =========================== # 3. A/B Testing β compare versions # =========================== import random def predict_with_ab_test(input_data, traffic_split=0.1): """Send 10% traffic to new model, 90% to stable model""" if random.random() < traffic_split: url = url_v3 # new model (10%) version = "v3_new" else: url = url_v2 # stable model (90%) version = "v2_stable" resp = requests.post(url, json={"instances": input_data}) result = resp.json()["predictions"] # Log for comparison log_prediction(version, input_data, result) return result # =========================== # 4. Rollback β if new model performs poorly # =========================== # Option 1: Delete version folder # rm -rf models/classifier/3 β TF Serving auto-falls back to v2 # Option 2: Model config file # model_config.txt: # model_config_list { # config { # name: 'classifier' # base_path: '/models/classifier' # model_platform: 'tensorflow' # model_version_policy { # specific { versions: 2 } β force serve v2 only # } # } # }
8. Batch Prediction β High Throughput Offline
8. Batch Prediction β High Throughput Offline
import tensorflow as tf import numpy as np # =========================== # 1. Simple batch prediction # =========================== model = tf.keras.models.load_model("saved_model/my_classifier") # Predict on large dataset all_predictions = model.predict(X_large, batch_size=256, verbose=1) # Progress bar: 100000/100000 [==============================] - 45s # =========================== # 2. tf.data pipeline for batch prediction (memory-efficient) # =========================== predict_ds = (tf.data.Dataset.from_tensor_slices(X_large) .batch(256) .prefetch(tf.data.AUTOTUNE)) all_preds = [] for batch in predict_ds: preds = model(batch, training=False) all_preds.append(preds.numpy()) all_predictions = np.concatenate(all_preds, axis=0) print(f"Predicted {len(all_predictions)} samples") # =========================== # 3. Predict from files (no need to load all into RAM) # =========================== file_ds = (tf.data.TFRecordDataset('data/large_dataset.tfrecord') .map(parse_fn, num_parallel_calls=tf.data.AUTOTUNE) .batch(256) .prefetch(tf.data.AUTOTUNE)) results = model.predict(file_ds) # Processes terabytes of data without running out of memory!
9. Production Monitoring β Deteksi Masalah Sebelum Terlambat
9. Production Monitoring β Detect Problems Before It's Too Late
π 5 Hal yang Harus Dimonitor di Production:
1. Latency: Berapa ms per prediksi? Target: <100ms untuk real-time. Alert jika >500ms.
2. Throughput: Berapa requests/second? Apakah server kewalahan?
3. Error Rate: Berapa % request yang gagal (500 error, timeout)?
4. Data Drift: Apakah distribusi input berubah dari training data? Jika ya, model mungkin sudah tidak relevan.
5. Prediction Drift: Apakah distribusi output berubah? Misalnya, tiba-tiba 90% prediksi jadi kelas A β mungkin ada bug atau data shift.
π 5 Things to Monitor in Production:
1. Latency: How many ms per prediction? Target: <100ms for real-time. Alert if >500ms.
2. Throughput: How many requests/second? Is the server overwhelmed?
3. Error Rate: What % of requests fail (500 errors, timeouts)?
4. Data Drift: Has input distribution changed from training data? If so, model may be stale.
5. Prediction Drift: Has output distribution changed? E.g., suddenly 90% predictions are class A β might be a bug or data shift.
import time import numpy as np from collections import defaultdict # Simple prediction logger class PredictionMonitor: def __init__(self): self.latencies = [] self.predictions = defaultdict(int) self.errors = 0 self.total = 0 def predict_and_log(self, model, input_data): self.total += 1 start = time.time() try: pred = model.predict(input_data, verbose=0) latency = (time.time() - start) * 1000 # ms self.latencies.append(latency) pred_class = np.argmax(pred, axis=1)[0] self.predictions[pred_class] += 1 # Alert on high latency if latency > 500: print(f"β οΈ HIGH LATENCY: {latency:.0f}ms") return pred except Exception as e: self.errors += 1 print(f"β ERROR: {e}") return None def report(self): print(f"\nπ Monitoring Report:") print(f" Total requests: {self.total}") print(f" Error rate: {self.errors/max(self.total,1):.1%}") if self.latencies: print(f" Avg latency: {np.mean(self.latencies):.1f}ms") print(f" P99 latency: {np.percentile(self.latencies, 99):.1f}ms") print(f" Class distribution: {dict(self.predictions)}")
10. Deployment Checklist β Sebelum ke Production
10. Deployment Checklist β Before Going to Production
| # | Langkah | Detail | Check |
|---|---|---|---|
| 1 | Test akurasi di held-out set | Pastikan performa sesuai ekspektasi | β |
| 2 | Test dengan data edge case | Input kosong, gambar blank, teks sangat panjang | β |
| 3 | Benchmark latency | Target: <100ms untuk real-time, <1s untuk batch | β |
| 4 | Quantize jika mobile | Dynamic range quantization β 4Γ smaller | β |
| 5 | Test TFLite accuracy | Pastikan quantization tidak merusak akurasi | β |
| 6 | Setup versioning | SavedModel/1, /2, /3 β selalu bisa rollback | β |
| 7 | Docker container | Reproducible environment, easy scaling | β |
| 8 | Load testing | Berapa concurrent requests sebelum crash? | β |
| 9 | Monitoring setup | Latency, error rate, prediction distribution | β |
| 10 | Rollback plan | Jika model baru bermasalah, bisa kembali ke v-1 | β |
| # | Step | Detail | Check |
|---|---|---|---|
| 1 | Test accuracy on held-out set | Ensure performance meets expectations | β |
| 2 | Test with edge case data | Empty input, blank images, very long text | β |
| 3 | Benchmark latency | Target: <100ms real-time, <1s batch | β |
| 4 | Quantize if mobile | Dynamic range quantization β 4Γ smaller | β |
| 5 | Test TFLite accuracy | Ensure quantization doesn't hurt accuracy | β |
| 6 | Setup versioning | SavedModel/1, /2, /3 β always rollback-ready | β |
| 7 | Docker container | Reproducible environment, easy scaling | β |
| 8 | Load testing | How many concurrent requests before crash? | β |
| 9 | Monitoring setup | Latency, error rate, prediction distribution | β |
| 10 | Rollback plan | If new model fails, can revert to v-1 | β |
11. Ringkasan Page 9
11. Page 9 Summary
| Konsep | Apa Itu | Kode Kunci |
|---|---|---|
| SavedModel | Format export universal | model.save("saved_model/v1") |
| TF Serving | Production REST/gRPC server | docker run tensorflow/serving |
| TFLite | Mobile & edge deployment | TFLiteConverter + quantize |
| TF.js | Browser inference | tensorflowjs_converter |
| Quantization | Model compression 2-4Γ | Optimize.DEFAULT |
| Docker | Container deployment | docker run -p 8501:8501 |
| Versioning | A/B test & rollback | saved_model/name/1, /2, /3 |
| Monitoring | Latency, drift, errors | PredictionMonitor class |
| Concept | What It Is | Key Code |
|---|---|---|
| SavedModel | Universal export format | model.save("saved_model/v1") |
| TF Serving | Production REST/gRPC server | docker run tensorflow/serving |
| TFLite | Mobile & edge deployment | TFLiteConverter + quantize |
| TF.js | Browser inference | tensorflowjs_converter |
| Quantization | Model compression 2-4Γ | Optimize.DEFAULT |
| Docker | Container deployment | docker run -p 8501:8501 |
| Versioning | A/B test & rollback | saved_model/name/1, /2, /3 |
| Monitoring | Latency, drift, errors | PredictionMonitor class |
Page 8 β GAN & Generative Models
Coming Next: Page 10 β Capstone: End-to-End ML Project π
Grand finale! Gabungkan SEMUA dari Page 1-9 dalam satu proyek lengkap: tf.data pipeline β augmentation β EfficientNet transfer learning β custom training β TensorBoard monitoring β SavedModel export β TFLite β TF Serving β Docker deployment. Plus roadmap TFX, Vertex AI, dan JAX. Penutup seri!
Coming Next: Page 10 β Capstone: End-to-End ML Project π
Grand finale! Combine EVERYTHING from Pages 1-9 in one complete project: tf.data pipeline β augmentation β EfficientNet transfer learning β custom training β TensorBoard monitoring β SavedModel export β TFLite β TF Serving β Docker deployment. Plus TFX, Vertex AI, and JAX roadmap. Series finale!