Belajar TensorFlow Page 9 — TF Serving & Deployment

📑 Daftar Isi — Page 9

📑 Table of Contents — Page 9

Deployment Overview — Dari notebook ke production
SavedModel Format — Standard export untuk semua platform
TF Serving — REST API & gRPC production server
Docker Deployment — Container untuk konsistensi
TFLite — Mobile & edge: quantization, optimization
TF.js — Model di browser: convert & load
Model Versioning — A/B testing & safe rollback
Batch Prediction — High throughput offline inference
Production Monitoring — Data drift, latency, accuracy
Deployment Checklist — Sebelum ke production
Ringkasan & Preview Page 10

Deployment Overview — From notebook to production
SavedModel Format — Standard export for all platforms
TF Serving — REST API & gRPC production server
Docker Deployment — Containers for consistency
TFLite — Mobile & edge: quantization, optimization
TF.js — Model in browser: convert & load
Model Versioning — A/B testing & safe rollback
Batch Prediction — High throughput offline inference
Production Monitoring — Data drift, latency, accuracy
Deployment Checklist — Before going to production
Summary & Page 10 Preview

🗺️

1. Deployment Overview — Dari Notebook ke Dunia Nyata

1. Deployment Overview — From Notebook to the Real World

Training = 10% dari ML project. Deployment & maintenance = 90%.

Training = 10% of an ML project. Deployment & maintenance = 90%.

Anda sudah bisa train model yang akurat (Pages 1-8). Tapi model di Jupyter notebook tidak bisa melayani user. Deployment berarti membuat model Anda bisa menerima input dan mengembalikan prediksi secara real-time, reliable, dan scalable. TensorFlow memiliki ekosistem deployment terlengkap — satu model bisa di-deploy ke server, handphone, browser, dan edge device.

You can now train accurate models (Pages 1-8). But a model in a Jupyter notebook can't serve users. Deployment means making your model accept input and return predictions in a real-time, reliable, and scalable way. TensorFlow has the most complete deployment ecosystem — one model can be deployed to server, phone, browser, and edge devices.

TensorFlow Deployment Ecosystem — One Model, Every Platform Train Export Deploy ───── ────── ────── model.fit() → model.save() → SavedModel/ (Pages 1-8) (standard format) │ ├── TF Serving → REST/gRPC API (server) │ └── Docker container │ ├── TFLite → Android / iOS / RPi │ └── .tflite (quantized, 5-10× smaller) │ ├── TF.js → Browser (client-side) │ └── model.json + weights │ └── TFX Pipeline → Full MLOps (Page 10) └── Vertex AI (Google Cloud) All paths start from SavedModel — the universal exchange format.

💾

2. SavedModel Format — Export Standard

2. SavedModel Format — Standard Export

Satu format untuk semua: serving, TFLite, TF.js, TFX — semuanya mulai dari sini

One format for all: serving, TFLite, TF.js, TFX — everything starts here

60_savedmodel.py — Export & Load Modelspython

import tensorflow as tf
from tensorflow import keras

# ===========================
# 1. Save — SavedModel format (RECOMMENDED for deployment)
# ===========================
model.save("saved_model/my_classifier")
# Creates directory structure:
# saved_model/my_classifier/
#   ├── saved_model.pb          ← computation graph
#   ├── fingerprint.pb          ← integrity check
#   └── variables/
#       ├── variables.data-00000-of-00001  ← weights
#       └── variables.index

# ===========================
# 2. Save — Keras native format (.keras)
# ===========================
model.save("my_model.keras")         # single file, includes architecture
# Good for: development, sharing models with Keras users
# Bad for: TF Serving (needs SavedModel format)

# ===========================
# 3. Save — Weights only
# ===========================
model.save_weights("weights/my_weights.weights.h5")
# Only weights, no architecture. Must rebuild model first when loading.
# Good for: checkpointing during training, transfer learning

# ===========================
# 4. Load models
# ===========================
# SavedModel
loaded_sm = tf.keras.models.load_model("saved_model/my_classifier")
predictions = loaded_sm.predict(X_test[:5])

# Keras format
loaded_keras = tf.keras.models.load_model("my_model.keras")

# Weights only (must have identical architecture!)
new_model = build_model()  # same architecture
new_model.load_weights("weights/my_weights.weights.h5")

# ===========================
# 5. Inspect SavedModel with CLI
# ===========================
# saved_model_cli show --dir saved_model/my_classifier --all
# Shows: input/output signatures, shapes, dtypes
# This is what TF Serving uses to know the API!

# ===========================
# 6. Add custom serving signature
# ===========================
# For models with custom preprocessing:
class ServableModel(tf.Module):
    def __init__(self, model):
        self.model = model

    @tf.function(input_signature=[tf.TensorSpec(shape=[None, 224, 224, 3], dtype=tf.float32)])
    def serve(self, images):
        # Preprocessing included in serving!
        images = images / 255.0
        predictions = self.model(images, training=False)
        return {"predictions": predictions}

servable = ServableModel(model)
tf.saved_model.save(servable, "saved_model/servable",
                    signatures={"serving_default": servable.serve})

🎓 SavedModel vs .keras vs .h5 — Kapan Pakai Apa?
SavedModel/ (directory): Deployment ke TF Serving, TFLite, TF.js. Standar production.
.keras (single file): Development, sharing, prototyping. Standar Keras.
.weights.h5 (weights only): Checkpointing, transfer learning. Butuh arsitektur identik.
Rule: Untuk production → selalu SavedModel. Untuk development → .keras.

🎓 SavedModel vs .keras vs .h5 — When to Use What?
SavedModel/ (directory): Deploy to TF Serving, TFLite, TF.js. Production standard.
.keras (single file): Development, sharing, prototyping. Keras standard.
.weights.h5 (weights only): Checkpointing, transfer learning. Needs identical architecture.
Rule: For production → always SavedModel. For development → .keras.

🖥️

3. TF Serving — Production REST & gRPC Server

Server production-grade dari Google: auto batching, model versioning, GPU support

Production-grade server from Google: auto batching, model versioning, GPU support

TF Serving adalah server C++ high-performance untuk melayani model TensorFlow. Didesain untuk production: auto-batching request, GPU support, model hot-swapping (update model tanpa downtime), dan monitoring built-in. Dipakai oleh Google untuk menyajikan miliaran prediksi per hari.

TF Serving is a high-performance C++ server for serving TensorFlow models. Designed for production: auto-batching requests, GPU support, model hot-swapping (update models without downtime), and built-in monitoring. Used by Google to serve billions of predictions per day.

61_tf_serving.py — Setup & Client Codepython

import tensorflow as tf
import requests
import json
import numpy as np

# ===========================
# 1. Save model with version number
# ===========================
model.save("saved_model/my_classifier/1")   # version 1
# Later: model_v2.save("saved_model/my_classifier/2")  # version 2
# TF Serving auto-detects and serves the LATEST version!

# Directory structure for TF Serving:
# saved_model/my_classifier/
#   ├── 1/                    ← version 1
#   │   ├── saved_model.pb
#   │   └── variables/
#   └── 2/                    ← version 2 (latest, auto-served)
#       ├── saved_model.pb
#       └── variables/

# ===========================
# 2. REST API Client
# ===========================
# TF Serving runs on port 8501 (REST) and 8500 (gRPC)

# Prepare input data
test_images = X_test[:5].tolist()  # must be JSON-serializable

# Send REST request
url = "http://localhost:8501/v1/models/my_classifier:predict"
payload = json.dumps({"instances": test_images})
headers = {"Content-Type": "application/json"}

response = requests.post(url, data=payload, headers=headers)
result = response.json()

predictions = np.array(result["predictions"])
predicted_classes = np.argmax(predictions, axis=1)
print(f"Predicted: {predicted_classes}")
# Predicted: [7, 2, 1, 0, 4]

# Request specific version:
# url = "http://localhost:8501/v1/models/my_classifier/versions/1:predict"

# ===========================
# 3. gRPC Client (faster for production!)
# ===========================
# pip install tensorflow-serving-api
import grpc
# from tensorflow_serving.apis import predict_pb2, prediction_service_pb2_grpc
# 
# channel = grpc.insecure_channel('localhost:8500')
# stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
# 
# request = predict_pb2.PredictRequest()
# request.model_spec.name = 'my_classifier'
# request.inputs['input_1'].CopyFrom(
#     tf.make_tensor_proto(test_images, dtype=tf.float32))
# 
# response = stub.Predict(request, timeout=10.0)
# predictions = tf.make_ndarray(response.outputs['dense_1'])

# gRPC vs REST:
# gRPC: ~2-5× faster (binary protocol, no JSON serialization)
# REST: easier to debug, works with curl, browser-friendly
# Production recommendation: gRPC for internal services, REST for external API

🐳

4. Docker Deployment — Container untuk Konsistensi

4. Docker Deployment — Containers for Consistency

"Works on my machine" → works everywhere dengan Docker

"Works on my machine" → works everywhere with Docker

62_docker_deploy.sh — Docker TF Serving Setupbash

# ===========================
# 1. Pull TF Serving Docker image
# ===========================
docker pull tensorflow/serving              # CPU version
docker pull tensorflow/serving:latest-gpu   # GPU version (needs nvidia-docker)

# ===========================
# 2. Run TF Serving container
# ===========================
docker run -d --name tf_serving   -p 8501:8501   -p 8500:8500   --mount type=bind,source=$(pwd)/saved_model/my_classifier,target=/models/my_classifier   -e MODEL_NAME=my_classifier   -e MODEL_BASE_PATH=/models/my_classifier   tensorflow/serving

# ===========================
# 3. Test with curl
# ===========================
curl -X POST http://localhost:8501/v1/models/my_classifier:predict   -H "Content-Type: application/json"   -d '{"instances": [[0.1, 0.2, 0.3, 0.4]]}'

# Check model status
curl http://localhost:8501/v1/models/my_classifier

# ===========================
# 4. Custom Dockerfile (with model baked in)
# ===========================
# FROM tensorflow/serving
# COPY saved_model/my_classifier /models/my_classifier
# ENV MODEL_NAME=my_classifier
# EXPOSE 8501 8500
# 
# docker build -t my-ml-service .
# docker run -p 8501:8501 my-ml-service

# ===========================
# 5. Docker Compose (multiple services)
# ===========================
# version: '3'
# services:
#   tf-serving:
#     image: tensorflow/serving
#     ports: ["8501:8501", "8500:8500"]
#     volumes: ["./saved_model:/models"]
#     environment:
#       - MODEL_NAME=my_classifier
#   api:
#     build: ./api
#     ports: ["5000:5000"]
#     depends_on: [tf-serving]

📱

5. TFLite — Mobile & Edge Deployment

Model 5-10× lebih kecil, inference di Android/iOS/Raspberry Pi tanpa internet

5-10× smaller models, inference on Android/iOS/Raspberry Pi without internet

63_tflite.py — Convert, Optimize, & Deploy to Mobilepython

import tensorflow as tf
import numpy as np

# ===========================
# 1. Basic conversion (no optimization)
# ===========================
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model/my_classifier/1")
tflite_model = converter.convert()

with open("model.tflite", "wb") as f:
    f.write(tflite_model)

original_size = 25.6  # MB (SavedModel)
tflite_size = len(tflite_model) / (1024 * 1024)
print(f"Original: {original_size:.1f} MB")
print(f"TFLite:   {tflite_size:.1f} MB")
print(f"Reduction: {original_size/tflite_size:.1f}×")

# ===========================
# 2. Dynamic range quantization (RECOMMENDED default)
# ===========================
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model/my_classifier/1")
converter.optimizations = [tf.lite.Optimize.DEFAULT]  # dynamic range!
tflite_quant = converter.convert()

print(f"Quantized: {len(tflite_quant)/1024/1024:.1f} MB")
# ~4× smaller than original! (float32 → int8 weights)
# Accuracy loss: typically < 1%

# ===========================
# 3. Full integer quantization (smallest, fastest)
# ===========================
def representative_dataset():
    """Provide sample data for calibration"""
    for i in range(100):
        sample = X_train[i:i+1].astype(np.float32)
        yield [sample]

converter = tf.lite.TFLiteConverter.from_saved_model("saved_model/my_classifier/1")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8   # or tf.uint8
converter.inference_output_type = tf.int8
tflite_int8 = converter.convert()

print(f"Full INT8: {len(tflite_int8)/1024/1024:.1f} MB")
# ~4× smaller AND ~2-4× faster inference on mobile!
# Runs on CPU integer units — no GPU needed!

# ===========================
# 4. Float16 quantization (GPU-friendly mobile)
# ===========================
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model/my_classifier/1")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_fp16 = converter.convert()
# ~2× smaller, runs on mobile GPU (faster than INT8 on GPU)

# ===========================
# 5. Test TFLite model in Python
# ===========================
interpreter = tf.lite.Interpreter(model_content=tflite_quant)
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Run inference
test_input = X_test[0:1].astype(np.float32)
interpreter.set_tensor(input_details[0]['index'], test_input)
interpreter.invoke()
output = interpreter.get_tensor(output_details[0]['index'])
print(f"TFLite prediction: {np.argmax(output)}")
# Should match original model prediction!

Quantization	Size Reduction	Speed	Accuracy Impact	Best For
No quantization	1× (baseline)	1× (baseline)	0%	Maximum accuracy
Dynamic range	~4× smaller	~2× faster	< 1%	Default choice ⭐
Float16	~2× smaller	~1.5× faster	~0%	Mobile GPU
Full INT8	~4× smaller	~3× faster	1-3%	Edge/IoT devices

Quantization	Size Reduction	Speed	Accuracy Impact	Best For
No quantization	1× (baseline)	1× (baseline)	0%	Maximum accuracy
Dynamic range	~4× smaller	~2× faster	< 1%	Default choice ⭐
Float16	~2× smaller	~1.5× faster	~0%	Mobile GPU
Full INT8	~4× smaller	~3× faster	1-3%	Edge/IoT devices

🌐

6. TF.js — Model di Browser

6. TF.js — Model in the Browser

Inference client-side: no server needed, privacy-preserving, instant response

Client-side inference: no server needed, privacy-preserving, instant response

64_tfjs.sh — Convert to TF.jsbash

# Install converter
pip install tensorflowjs

# Convert SavedModel → TF.js format
tensorflowjs_converter   --input_format=tf_saved_model   --output_format=tfjs_graph_model   --quantize_uint8   saved_model/my_classifier/1   web_model/

# Output files:
# web_model/
#   ├── model.json          ← architecture + weight manifest
#   └── group1-shard1of1.bin ← weights binary

65_tfjs_browser.html — Load & Predict in Browserhtml

<!-- Load TF.js library -->
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></script>

<script>
async function loadAndPredict() {
    // Load model from your web server
    const model = await tf.loadGraphModel('web_model/model.json');

    // Create input tensor (e.g., from canvas/image)
    const input = tf.zeros([1, 224, 224, 3]);

    // Run inference
    const prediction = model.predict(input);
    const classIndex = prediction.argMax(1).dataSync()[0];

    console.log(`Predicted class: ${classIndex}`);

    // Clean up tensors (prevent memory leak!)
    input.dispose();
    prediction.dispose();
}

loadAndPredict();
</script>

<!-- Use cases: -->
<!-- • Real-time webcam classification -->
<!-- • Image editing/filtering -->
<!-- • Text sentiment analysis -->
<!-- • Pose detection (PoseNet) -->
<!-- • No server costs! No data leaves user's device! -->

📋

7. Model Versioning — A/B Testing & Rollback

Deploy model baru tanpa downtime, bandingkan performa, rollback jika bermasalah

Deploy new models without downtime, compare performance, rollback if issues arise

66_versioning.py — Model Version Managementpython

import tensorflow as tf
import requests
import json

# ===========================
# 1. Save with version numbers
# ===========================
model_v1.save("models/classifier/1")    # January model
model_v2.save("models/classifier/2")    # February model (improved)
model_v3.save("models/classifier/3")    # March model (latest)

# TF Serving auto-serves LATEST version (v3)
# But you can request ANY version:

# ===========================
# 2. Request specific version
# ===========================
# Latest (default):
url_latest = "http://localhost:8501/v1/models/classifier:predict"

# Specific version:
url_v1 = "http://localhost:8501/v1/models/classifier/versions/1:predict"
url_v2 = "http://localhost:8501/v1/models/classifier/versions/2:predict"

# ===========================
# 3. A/B Testing — compare versions
# ===========================
import random

def predict_with_ab_test(input_data, traffic_split=0.1):
    """Send 10% traffic to new model, 90% to stable model"""
    if random.random() < traffic_split:
        url = url_v3          # new model (10%)
        version = "v3_new"
    else:
        url = url_v2          # stable model (90%)
        version = "v2_stable"

    resp = requests.post(url, json={"instances": input_data})
    result = resp.json()["predictions"]

    # Log for comparison
    log_prediction(version, input_data, result)
    return result

# ===========================
# 4. Rollback — if new model performs poorly
# ===========================
# Option 1: Delete version folder
# rm -rf models/classifier/3   → TF Serving auto-falls back to v2

# Option 2: Model config file
# model_config.txt:
# model_config_list {
#   config {
#     name: 'classifier'
#     base_path: '/models/classifier'
#     model_platform: 'tensorflow'
#     model_version_policy {
#       specific { versions: 2 }  ← force serve v2 only
#     }
#   }
# }

📊

8. Batch Prediction — High Throughput Offline

Proses jutaan prediksi sekaligus — untuk analytics, scoring, pipeline

Process millions of predictions at once — for analytics, scoring, pipelines

67_batch_prediction.py — Efficient Batch Inferencepython

import tensorflow as tf
import numpy as np

# ===========================
# 1. Simple batch prediction
# ===========================
model = tf.keras.models.load_model("saved_model/my_classifier")

# Predict on large dataset
all_predictions = model.predict(X_large, batch_size=256, verbose=1)
# Progress bar: 100000/100000 [==============================] - 45s

# ===========================
# 2. tf.data pipeline for batch prediction (memory-efficient)
# ===========================
predict_ds = (tf.data.Dataset.from_tensor_slices(X_large)
    .batch(256)
    .prefetch(tf.data.AUTOTUNE))

all_preds = []
for batch in predict_ds:
    preds = model(batch, training=False)
    all_preds.append(preds.numpy())

all_predictions = np.concatenate(all_preds, axis=0)
print(f"Predicted {len(all_predictions)} samples")

# ===========================
# 3. Predict from files (no need to load all into RAM)
# ===========================
file_ds = (tf.data.TFRecordDataset('data/large_dataset.tfrecord')
    .map(parse_fn, num_parallel_calls=tf.data.AUTOTUNE)
    .batch(256)
    .prefetch(tf.data.AUTOTUNE))

results = model.predict(file_ds)
# Processes terabytes of data without running out of memory!

📡

9. Production Monitoring — Deteksi Masalah Sebelum Terlambat

9. Production Monitoring — Detect Problems Before It's Too Late

Model yang bagus di lab bisa buruk di production — monitor terus!

A model that's great in the lab can fail in production — monitor continuously!

🎓 5 Hal yang Harus Dimonitor di Production:
1. Latency: Berapa ms per prediksi? Target: <100ms untuk real-time. Alert jika >500ms.
2. Throughput: Berapa requests/second? Apakah server kewalahan?
3. Error Rate: Berapa % request yang gagal (500 error, timeout)?
4. Data Drift: Apakah distribusi input berubah dari training data? Jika ya, model mungkin sudah tidak relevan.
5. Prediction Drift: Apakah distribusi output berubah? Misalnya, tiba-tiba 90% prediksi jadi kelas A — mungkin ada bug atau data shift.

🎓 5 Things to Monitor in Production:
1. Latency: How many ms per prediction? Target: <100ms for real-time. Alert if >500ms.
2. Throughput: How many requests/second? Is the server overwhelmed?
3. Error Rate: What % of requests fail (500 errors, timeouts)?
4. Data Drift: Has input distribution changed from training data? If so, model may be stale.
5. Prediction Drift: Has output distribution changed? E.g., suddenly 90% predictions are class A — might be a bug or data shift.

68_monitoring.py — Basic Prediction Monitoringpython

import time
import numpy as np
from collections import defaultdict

# Simple prediction logger
class PredictionMonitor:
    def __init__(self):
        self.latencies = []
        self.predictions = defaultdict(int)
        self.errors = 0
        self.total = 0

    def predict_and_log(self, model, input_data):
        self.total += 1
        start = time.time()
        try:
            pred = model.predict(input_data, verbose=0)
            latency = (time.time() - start) * 1000  # ms
            self.latencies.append(latency)

            pred_class = np.argmax(pred, axis=1)[0]
            self.predictions[pred_class] += 1

            # Alert on high latency
            if latency > 500:
                print(f"⚠️ HIGH LATENCY: {latency:.0f}ms")

            return pred
        except Exception as e:
            self.errors += 1
            print(f"❌ ERROR: {e}")
            return None

    def report(self):
        print(f"\n📊 Monitoring Report:")
        print(f"  Total requests: {self.total}")
        print(f"  Error rate: {self.errors/max(self.total,1):.1%}")
        if self.latencies:
            print(f"  Avg latency: {np.mean(self.latencies):.1f}ms")
            print(f"  P99 latency: {np.percentile(self.latencies, 99):.1f}ms")
        print(f"  Class distribution: {dict(self.predictions)}")

✅

10. Deployment Checklist — Sebelum ke Production

10. Deployment Checklist — Before Going to Production

10 langkah wajib sebelum model Anda melayani user sungguhan

10 mandatory steps before your model serves real users

#	Langkah	Detail	Check
1	Test akurasi di held-out set	Pastikan performa sesuai ekspektasi	☐
2	Test dengan data edge case	Input kosong, gambar blank, teks sangat panjang	☐
3	Benchmark latency	Target: <100ms untuk real-time, <1s untuk batch	☐
4	Quantize jika mobile	Dynamic range quantization → 4× smaller	☐
5	Test TFLite accuracy	Pastikan quantization tidak merusak akurasi	☐
6	Setup versioning	SavedModel/1, /2, /3 — selalu bisa rollback	☐
7	Docker container	Reproducible environment, easy scaling	☐
8	Load testing	Berapa concurrent requests sebelum crash?	☐
9	Monitoring setup	Latency, error rate, prediction distribution	☐
10	Rollback plan	Jika model baru bermasalah, bisa kembali ke v-1	☐

#	Step	Detail	Check
1	Test accuracy on held-out set	Ensure performance meets expectations	☐
2	Test with edge case data	Empty input, blank images, very long text	☐
3	Benchmark latency	Target: <100ms real-time, <1s batch	☐
4	Quantize if mobile	Dynamic range quantization → 4× smaller	☐
5	Test TFLite accuracy	Ensure quantization doesn't hurt accuracy	☐
6	Setup versioning	SavedModel/1, /2, /3 — always rollback-ready	☐
7	Docker container	Reproducible environment, easy scaling	☐
8	Load testing	How many concurrent requests before crash?	☐
9	Monitoring setup	Latency, error rate, prediction distribution	☐
10	Rollback plan	If new model fails, can revert to v-1	☐

📝

11. Ringkasan Page 9

11. Page 9 Summary

Semua yang sudah kita pelajari

Everything we learned

Deployment Decision Tree — Pilih Platform yang Tepat Di mana model akan berjalan? ├── Server/Cloud → TF Serving + Docker │ ├── Real-time API? → REST (simple) atau gRPC (fast) │ └── Batch processing? → model.predict() + tf.data pipeline │ ├── Mobile/Edge → TFLite │ ├── Android/iOS app? → .tflite + TFLite interpreter │ ├── Raspberry Pi/IoT? → .tflite + INT8 quantization │ └── Coral Edge TPU? → .tflite + EdgeTPU compiler │ └── Browser → TF.js ├── Client-side inference? → tfjs_graph_model ├── Privacy requirement? → no data leaves device! └── No server costs? → all computation on user's device All paths start from: model.save("saved_model/name/version")

Konsep	Apa Itu	Kode Kunci
SavedModel	Format export universal	`model.save("saved_model/v1")`
TF Serving	Production REST/gRPC server	`docker run tensorflow/serving`
TFLite	Mobile & edge deployment	`TFLiteConverter + quantize`
TF.js	Browser inference	`tensorflowjs_converter`
Quantization	Model compression 2-4×	`Optimize.DEFAULT`
Docker	Container deployment	`docker run -p 8501:8501`
Versioning	A/B test & rollback	`saved_model/name/1, /2, /3`
Monitoring	Latency, drift, errors	`PredictionMonitor class`

Concept	What It Is	Key Code
SavedModel	Universal export format	`model.save("saved_model/v1")`
TF Serving	Production REST/gRPC server	`docker run tensorflow/serving`
TFLite	Mobile & edge deployment	`TFLiteConverter + quantize`
TF.js	Browser inference	`tensorflowjs_converter`
Quantization	Model compression 2-4×	`Optimize.DEFAULT`
Docker	Container deployment	`docker run -p 8501:8501`
Versioning	A/B test & rollback	`saved_model/name/1, /2, /3`
Monitoring	Latency, drift, errors	`PredictionMonitor class`

← Page Sebelumnya← Previous Page

Page 8 — GAN & Generative Models

📘

Coming Next: Page 10 — Capstone: End-to-End ML Project 🏆

Grand finale! Gabungkan SEMUA dari Page 1-9 dalam satu proyek lengkap: tf.data pipeline → augmentation → EfficientNet transfer learning → custom training → TensorBoard monitoring → SavedModel export → TFLite → TF Serving → Docker deployment. Plus roadmap TFX, Vertex AI, dan JAX. Penutup seri!

📘

Coming Next: Page 10 — Capstone: End-to-End ML Project 🏆

Grand finale! Combine EVERYTHING from Pages 1-9 in one complete project: tf.data pipeline → augmentation → EfficientNet transfer learning → custom training → TensorBoard monitoring → SavedModel export → TFLite → TF Serving → Docker deployment. Plus TFX, Vertex AI, and JAX roadmap. Series finale!

TF Serving &
Model Deployment

TF Serving &
Model Deployment

📑 Daftar Isi — Page 9

📑 Table of Contents — Page 9

1. Deployment Overview — Dari Notebook ke Dunia Nyata

1. Deployment Overview — From Notebook to the Real World

2. SavedModel Format — Export Standard

2. SavedModel Format — Standard Export

3. TF Serving — Production REST & gRPC Server

3. TF Serving — Production REST & gRPC Server

4. Docker Deployment — Container untuk Konsistensi

4. Docker Deployment — Containers for Consistency

5. TFLite — Mobile & Edge Deployment

5. TFLite — Mobile & Edge Deployment

6. TF.js — Model di Browser

6. TF.js — Model in the Browser

7. Model Versioning — A/B Testing & Rollback

7. Model Versioning — A/B Testing & Rollback

8. Batch Prediction — High Throughput Offline

8. Batch Prediction — High Throughput Offline

9. Production Monitoring — Deteksi Masalah Sebelum Terlambat

9. Production Monitoring — Detect Problems Before It's Too Late

10. Deployment Checklist — Sebelum ke Production

10. Deployment Checklist — Before Going to Production

11. Ringkasan Page 9

11. Page 9 Summary

Page 8 — GAN & Generative Models

Coming Next: Page 10 — Capstone: End-to-End ML Project 🏆

Coming Next: Page 10 — Capstone: End-to-End ML Project 🏆