📝 Artikel ini ditulis dalam Bahasa Indonesia & English
📝 This article is available in English & Bahasa Indonesia

Belajar TensorFlow — Page 4Learn TensorFlow — Page 4

tf.data Pipeline &
Performance Optimization

tf.data Pipeline &
Performance Optimization

Data loading yang lambat = GPU menganggur 50-90% waktu Anda. Page 4 membahas secara mendalam: masalah I/O bottleneck, tf.data.Dataset API, the golden pattern (shuffle → map → batch → prefetch), cache untuk dataset kecil, TFRecord format untuk dataset besar, parallel preprocessing dengan num_parallel_calls, mixed precision training (float16 untuk 1.5-3× speedup), XLA compilation, dan profiling dengan TF Profiler untuk menemukan bottleneck.

Slow data loading = GPU sitting idle 50-90% of the time. Page 4 covers in depth: the I/O bottleneck problem, tf.data.Dataset API, the golden pattern (shuffle → map → batch → prefetch), cache for small datasets, TFRecord format for large datasets, parallel preprocessing with num_parallel_calls, mixed precision training (float16 for 1.5-3× speedup), XLA compilation, and profiling with TF Profiler to find bottlenecks.

📅 MaretMarch 202630 menit baca30 min read
🏷 tf.dataPrefetchCacheTFRecordMixed PrecisionXLAProfiler
📚 Seri Belajar TensorFlow:Learn TensorFlow Series:

📑 Daftar Isi — Page 4

📑 Table of Contents — Page 4

  1. Masalah: I/O Bottleneck — GPU menganggur karena CPU lambat
  2. tf.data.Dataset API — Membuat dataset dari berbagai sumber
  3. The Golden Pattern — shuffle → map → batch → prefetch
  4. Cache — Simpan di RAM setelah epoch pertama
  5. Parallel Preprocessing — num_parallel_calls=AUTOTUNE
  6. TFRecord — Format binary optimal untuk dataset besar
  7. Mixed Precision Training — Float16 untuk 2× speedup
  8. XLA Compilation — Just-in-time optimization
  9. TF Profiler — Temukan dan perbaiki bottleneck
  10. Benchmark: Sebelum vs Sesudah Optimasi
  11. Ringkasan & Preview Page 5
  1. The Problem: I/O Bottleneck — GPU idle because CPU is slow
  2. tf.data.Dataset API — Creating datasets from various sources
  3. The Golden Pattern — shuffle → map → batch → prefetch
  4. Cache — Store in RAM after first epoch
  5. Parallel Preprocessing — num_parallel_calls=AUTOTUNE
  6. TFRecord — Optimal binary format for large datasets
  7. Mixed Precision Training — Float16 for 2× speedup
  8. XLA Compilation — Just-in-time optimization
  9. TF Profiler — Find and fix bottlenecks
  10. Benchmark: Before vs After Optimization
  11. Summary & Page 5 Preview
🐌

1. Masalah: I/O Bottleneck — GPU Menganggur

1. The Problem: I/O Bottleneck — GPU Sitting Idle

Tanpa optimasi, GPU Anda menganggur 50-90% waktu training
Without optimization, your GPU is idle 50-90% of training time

Bayangkan Anda punya GPU seharga $10,000 (A100) tapi ia menganggur sebagian besar waktu karena menunggu CPU menyiapkan data. Ini adalah masalah paling umum dan paling mudah diperbaiki di deep learning training. Tanpa optimasi data pipeline, alur training Anda terlihat seperti ini:

Imagine you have a $10,000 GPU (A100) but it sits idle most of the time because it's waiting for the CPU to prepare data. This is the most common and easiest to fix problem in deep learning training. Without data pipeline optimization, your training flow looks like this:

Naive vs Optimized Data Pipeline ❌ NAIVE (sequential) — default jika tidak optimize: ┌────────────────────────────────────────────────────────────────┐ │ CPU: [Load B1] [Load B2] [Load B3] │ │ GPU: [Train B1] [Train B2] [Train B3] │ │ │ │ Timeline: ████████████████████████████████████████████████████ │ │ GPU Utilization: ~40% (menganggur 60% waktu!) │ └────────────────────────────────────────────────────────────────┘ ✅ OPTIMIZED (prefetch + parallel map + cache): ┌────────────────────────────────────────────────────────────────┐ │ CPU: [Load B1][Load B2][Load B3][Load B4][Load B5] │ │ GPU: [Train B1][Train B2][Train B3][Train B4] │ │ │ │ Timeline: ████████████████████████████████ │ │ GPU Utilization: ~95% (hampir selalu aktif!) │ └────────────────────────────────────────────────────────────────┘ Key: prefetch() = prepare next batch WHILE GPU trains current batch cache() = don't reload from disk after first epoch parallel_map() = use multiple CPU cores for preprocessing

Solusinya: tf.data — API yang dirancang khusus untuk membangun data pipeline yang efisien, paralel, dan overlap antara CPU preprocessing dan GPU training.

The solution: tf.data — an API specifically designed to build efficient, parallel data pipelines that overlap CPU preprocessing with GPU training.

📊

2. tf.data.Dataset API — Membuat Dataset

2. tf.data.Dataset API — Creating Datasets

Dari array, file, generator, atau TFRecord — semuanya bisa
From arrays, files, generators, or TFRecords — all possible
22_create_dataset.py — Berbagai Cara Membuat Datasetpython
import tensorflow as tf
import numpy as np

# ===========================
# 1. From NumPy arrays (paling umum untuk small datasets)
# ===========================
X = np.random.randn(1000, 32, 32, 3).astype('float32')
y = np.random.randint(0, 10, size=(1000,))

dataset = tf.data.Dataset.from_tensor_slices((X, y))
print(f"Dataset: {dataset}")
# 
#                                   TensorSpec(shape=()...))>

# Iterate
for image, label in dataset.take(3):
    print(image.shape, label.numpy())
# (32, 32, 3) 7
# (32, 32, 3) 2
# (32, 32, 3) 5

# ===========================
# 2. From files on disk (large datasets)
# ===========================
file_ds = tf.data.Dataset.list_files("data/train/*/*.jpg", shuffle=True)
# Lists all .jpg files matching the glob pattern

def load_and_preprocess(file_path):
    # Read file
    raw = tf.io.read_file(file_path)
    # Decode image
    img = tf.image.decode_jpeg(raw, channels=3)
    # Resize
    img = tf.image.resize(img, [224, 224])
    # Normalize
    img = tf.cast(img, tf.float32) / 255.0
    # Extract label from path: "data/train/cats/img001.jpg" → "cats"
    parts = tf.strings.split(file_path, '/')
    label_str = parts[-2]  # "cats"
    return img, label_str

image_ds = file_ds.map(load_and_preprocess, num_parallel_calls=tf.data.AUTOTUNE)

# ===========================
# 3. From generator (custom logic, unlimited)
# ===========================
def data_generator():
    for i in range(10000):
        image = np.random.randn(32, 32, 3).astype('float32')
        label = np.random.randint(0, 10)
        yield image, label

gen_ds = tf.data.Dataset.from_generator(
    data_generator,
    output_signature=(
        tf.TensorSpec(shape=(32,32,3), dtype=tf.float32),
        tf.TensorSpec(shape=(), dtype=tf.int32)
    )
)

# ===========================
# 4. From TFRecord (Section 6)
# ===========================
# tfrecord_ds = tf.data.TFRecordDataset("data.tfrecord")

3. The Golden Pattern — 5 Langkah Wajib

3. The Golden Pattern — 5 Mandatory Steps

shuffle → map → batch → prefetch — ini yang bikin training 3-10× lebih cepat
shuffle → map → batch → prefetch — this is what makes training 3-10× faster
23_golden_pattern.py — The Performance Trifecta 🔥python
import tensorflow as tf
AUTOTUNE = tf.data.AUTOTUNE

# Load CIFAR-10 as example
(X_train, y_train), _ = tf.keras.datasets.cifar10.load_data()
dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))

# Preprocessing function
def preprocess(image, label):
    image = tf.cast(image, tf.float32) / 255.0
    return image, label

# Augmentation function (only for training!)
def augment(image, label):
    image = tf.image.random_flip_left_right(image)
    image = tf.image.random_brightness(image, max_delta=0.1)
    image = tf.image.random_contrast(image, 0.9, 1.1)
    return image, label

# ═══════════════════════════════════════
# THE GOLDEN PATTERN ⭐
# ═══════════════════════════════════════

train_ds = (dataset
    # Step 1: SHUFFLE — randomize order
    # buffer_size should be >= dataset size for perfect shuffle
    # For very large datasets, use smaller buffer (e.g. 10000)
    .shuffle(buffer_size=50000, reshuffle_each_iteration=True)

    # Step 2: MAP — preprocess (parallel!)
    # num_parallel_calls → use multiple CPU cores
    # AUTOTUNE → TF determines optimal thread count
    .map(preprocess, num_parallel_calls=AUTOTUNE)

    # Step 2b: MAP — augment (also parallel)
    .map(augment, num_parallel_calls=AUTOTUNE)

    # Step 3: BATCH — group into mini-batches
    .batch(64)

    # Step 4: PREFETCH — overlap CPU/GPU
    # While GPU trains batch N, CPU prepares batch N+1
    .prefetch(buffer_size=AUTOTUNE)
)

# Validation pipeline (NO augmentation, NO shuffle)
val_ds = (val_dataset
    .map(preprocess, num_parallel_calls=AUTOTUNE)
    .batch(64)
    .prefetch(AUTOTUNE)
)

# Use in training:
# model.fit(train_ds, validation_data=val_ds, epochs=50)

# ═══════════════════════════════════════
# WHY THIS ORDER MATTERS
# ═══════════════════════════════════════
# 1. shuffle BEFORE map → randomize which items get processed
# 2. map BEFORE batch → process individual items (not batches)
# 3. batch BEFORE prefetch → prefetch complete batches
# 4. prefetch LAST → always last in the chain!

# WRONG ORDER examples:
# ❌ batch → shuffle → map → prefetch  (shuffles batches, not items!)
# ❌ map → shuffle → batch → prefetch  (wastes compute on shuffled items)

🎓 Kenapa num_parallel_calls=AUTOTUNE Penting?
Tanpa parameter ini, map() memproses data satu per satu (sequential) — hanya satu CPU core bekerja. Dengan AUTOTUNE, TensorFlow otomatis menentukan jumlah thread optimal berdasarkan CPU Anda. Di mesin 8-core, ini bisa meningkatkan kecepatan preprocessing 4-8×!

🎓 Why num_parallel_calls=AUTOTUNE Matters?
Without this parameter, map() processes data one by one (sequential) — only one CPU core works. With AUTOTUNE, TensorFlow automatically determines the optimal number of threads based on your CPU. On an 8-core machine, this can speed up preprocessing by 4-8×!

💾

4. Cache — RAM adalah Teman Terbaik Anda

4. Cache — RAM is Your Best Friend

Simpan dataset yang sudah dipreprocess di memori → epoch 2+ langsung instant
Store preprocessed dataset in memory → epoch 2+ becomes instant
24_cache_strategies.py — Cache In-Memory & On-Diskpython
import tensorflow as tf
AUTOTUNE = tf.data.AUTOTUNE

# ===========================
# Strategy 1: In-memory cache (dataset fits in RAM)
# ===========================
train_ds = (dataset
    .map(preprocess, num_parallel_calls=AUTOTUNE)
    .cache()                     # ← cache in RAM!
    .shuffle(50000)              # shuffle AFTER cache
    .map(augment, num_parallel_calls=AUTOTUNE)  # augment AFTER cache
    .batch(64)
    .prefetch(AUTOTUNE)
)
# Epoch 1: normal speed (reads from disk, preprocesses, caches)
# Epoch 2+: MUCH faster (reads from RAM, skips preprocess!)
# BUT augmentation still runs fresh each epoch (good!)

# ===========================
# Strategy 2: On-disk cache (too large for RAM)
# ===========================
train_ds = (dataset
    .map(preprocess, num_parallel_calls=AUTOTUNE)
    .cache("/tmp/train_cache")   # ← cache to SSD!
    .shuffle(10000)
    .batch(64)
    .prefetch(AUTOTUNE)
)
# Slower than RAM cache, but faster than re-reading raw files
# Useful for datasets > 16GB (e.g., ImageNet)

# ===========================
# PENTING: Order of cache() matters!
# ===========================
# ✅ preprocess → cache → shuffle → augment → batch → prefetch
#    (cache preprocessed data, augment fresh each epoch)
# ❌ cache → preprocess → augment → batch → prefetch
#    (caches raw data — still needs to preprocess every time!)
# ❌ preprocess → augment → cache → batch → prefetch
#    (caches augmented data — same augmentation every epoch!)

# CIFAR-10: 50k images × 32×32×3 × 4 bytes = ~600 MB
# → Fits easily in RAM. Use .cache()!
# ImageNet: 1.2M images × 224×224×3 × 4 bytes = ~680 GB
# → Use .cache("/ssd/path") or TFRecord + no cache
🔀

5. Parallel Preprocessing — Multi-Core CPU

5. Parallel Preprocessing — Multi-Core CPU

Gunakan semua core CPU untuk mempersiapkan data
Use all CPU cores to prepare data
25_parallel_loading.py — Interleave & Parallel Mappython
import tensorflow as tf
AUTOTUNE = tf.data.AUTOTUNE

# ===========================
# 1. Parallel map (most common)
# ===========================
ds = dataset.map(preprocess, num_parallel_calls=AUTOTUNE)
# AUTOTUNE → TF picks optimal thread count
# On 8-core CPU: uses ~6-7 cores for map, 1 for control

# You can also set manually:
ds = dataset.map(preprocess, num_parallel_calls=8)

# ===========================
# 2. Interleave — parallel file reading
# For multiple TFRecord files or large file datasets
# ===========================
files = tf.data.Dataset.list_files("data/*.tfrecord")
ds = files.interleave(
    lambda f: tf.data.TFRecordDataset(f),
    cycle_length=4,              # read 4 files simultaneously
    num_parallel_calls=AUTOTUNE,  # parallel I/O
    deterministic=False           # faster (order doesn't matter for training)
)

# ===========================
# 3. Batch-level parallel (map_and_batch)
# Slightly more efficient than separate map + batch
# ===========================
ds = dataset.apply(
    tf.data.experimental.map_and_batch(
        map_func=preprocess,
        batch_size=64,
        num_parallel_calls=AUTOTUNE
    )
)

# ===========================
# 4. Deterministic vs Non-deterministic
# ===========================
# deterministic=True  → exact same order every run (for debugging)
# deterministic=False → faster! (order varies between runs)
options = tf.data.Options()
options.deterministic = False  # faster for training
ds = ds.with_options(options)
📦

6. TFRecord — Format Binary untuk Dataset Besar

6. TFRecord — Binary Format for Large Datasets

Sequential read 3-5× lebih cepat dari random file access
Sequential read 3-5× faster than random file access

TFRecord menyimpan data sebagai serialized protocol buffers dalam file binary. Keuntungan: sequential read sangat cepat (SSD maupun HDD), mendukung kompresi (GZIP), dan optimal untuk cloud storage (GCS). Semua ML pipeline production di Google menggunakan TFRecord.

TFRecord stores data as serialized protocol buffers in binary files. Benefits: very fast sequential reads (SSD and HDD), supports compression (GZIP), and optimal for cloud storage (GCS). All production ML pipelines at Google use TFRecord.

26_tfrecord_complete.py — Write & Read TFRecordspython
import tensorflow as tf
import numpy as np

# ===========================
# 1. Helper functions
# ===========================
def _bytes_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def _float_feature(value):
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

# ===========================
# 2. Write TFRecord
# ===========================
def serialize_example(image, label):
    """Serialize one image + label to TFRecord format"""
    feature = {
        'image': _bytes_feature(tf.io.serialize_tensor(image).numpy()),
        'label': _int64_feature(int(label)),
        'height': _int64_feature(image.shape[0]),
        'width': _int64_feature(image.shape[1]),
    }
    example = tf.train.Example(features=tf.train.Features(feature=feature))
    return example.SerializeToString()

# Write dataset to multiple TFRecord shards
num_shards = 4
writers = [tf.io.TFRecordWriter(f'data/train_{i}.tfrecord') for i in range(num_shards)]

for idx, (img, lbl) in enumerate(zip(X_train, y_train)):
    shard = idx % num_shards
    writers[shard].write(serialize_example(img, lbl))

for w in writers:
    w.close()
print(f"Written {len(X_train)} examples to {num_shards} shards")

# ===========================
# 3. Read TFRecord
# ===========================
feature_description = {
    'image': tf.io.FixedLenFeature([], tf.string),
    'label': tf.io.FixedLenFeature([], tf.int64),
}

def parse_example(serialized):
    example = tf.io.parse_single_example(serialized, feature_description)
    image = tf.io.parse_tensor(example['image'], out_type=tf.uint8)
    image = tf.reshape(image, [32, 32, 3])
    image = tf.cast(image, tf.float32) / 255.0
    label = example['label']
    return image, label

# Build optimized pipeline from TFRecords
files = tf.data.Dataset.list_files('data/train_*.tfrecord')
dataset = files.interleave(
    tf.data.TFRecordDataset,
    cycle_length=4,
    num_parallel_calls=tf.data.AUTOTUNE
)
dataset = (dataset
    .map(parse_example, num_parallel_calls=tf.data.AUTOTUNE)
    .shuffle(10000)
    .batch(64)
    .prefetch(tf.data.AUTOTUNE)
)
🔥

7. Mixed Precision Training — Float16 untuk 2× Speedup

7. Mixed Precision Training — Float16 for 2× Speedup

Satu baris kode = training 1.5-3× lebih cepat + memory 50% lebih hemat
One line of code = 1.5-3× faster training + 50% less memory

Mixed precision = menggunakan float16 untuk komputasi (cepat di GPU Tensor Cores) sambil mempertahankan float32 untuk weight updates (numerically stable). Hasilnya: training lebih cepat tanpa kehilangan akurasi.

Mixed precision = using float16 for computation (fast on GPU Tensor Cores) while keeping float32 for weight updates (numerically stable). Result: faster training without losing accuracy.

27_mixed_precision.py — One Line Speeduppython
import tensorflow as tf

# ===========================
# Enable mixed precision — SATU BARIS!
# ===========================
tf.keras.mixed_precision.set_global_policy('mixed_float16')

# Semua layer sekarang otomatis:
# - Compute: float16 (cepat di Tensor Cores)
# - Weight updates: float32 (numerically stable)
# - Activations: float16 (less memory)

# PENTING: Output layer harus tetap float32!
model = tf.keras.Sequential([
    tf.keras.layers.Dense(256, activation='relu'),  # auto float16
    tf.keras.layers.Dense(128, activation='relu'),  # auto float16
    tf.keras.layers.Dense(10, activation='softmax',
                          dtype='float32'),        # ← EXPLICIT float32!
])
# Kenapa output float32? Softmax + cross-entropy di float16
# bisa overflow/underflow. Float32 di output menjaga stabilitas.

# ===========================
# Performance impact (GPU dengan Tensor Cores)
# ===========================
# GPU tanpa Tensor Cores (GTX 1080, K80): TIDAK ADA speedup
# T4 (Google Colab free!):  1.5-2× speedup ✅
# V100:                     1.5-2× speedup ✅
# A100:                     2-3× speedup ✅✅
# RTX 3090/4090:            2-3× speedup ✅✅
# Memory: ~50% less → can double batch size!

# Check current policy:
print(tf.keras.mixed_precision.global_policy())
# 

# Reset to default:
# tf.keras.mixed_precision.set_global_policy('float32')

🎉 Pro Tip Google Colab: Colab gratis memberikan GPU T4 yang PUNYA Tensor Cores. Selalu aktifkan mixed precision di Colab! Satu baris kode = training 1.5× lebih cepat. Untuk production di A100, speedup bisa sampai 3×.

🎉 Pro Tip Google Colab: Free Colab gives you a T4 GPU that HAS Tensor Cores. Always enable mixed precision on Colab! One line of code = 1.5× faster training. For production on A100, speedup can reach 3×.

⚙️

8. XLA Compilation — Fuse Operations

8. XLA Compilation — Fuse Operations

Just-in-time compilation yang menggabungkan operasi untuk eksekusi lebih cepat
Just-in-time compilation that fuses operations for faster execution
28_xla.py — XLA Compilationpython
import tensorflow as tf

# ===========================
# Method 1: jit_compile in model.compile()
# ===========================
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy'],
    jit_compile=True   # ← enable XLA!
)
# XLA fuses multiple operations into single GPU kernels
# Dense + BiasAdd + ReLU → one fused kernel (fewer memory accesses)
# Typical speedup: 10-30% on top of other optimizations

# ===========================
# Method 2: @tf.function with jit_compile
# ===========================
@tf.function(jit_compile=True)
def train_step(x, y):
    with tf.GradientTape() as tape:
        pred = model(x, training=True)
        loss = loss_fn(y, pred)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    return loss

# ⚠️ Caveats:
# - First call is slow (compilation time)
# - Not all ops support XLA (most Keras layers do)
# - Dynamic shapes may not work
# - Best for fixed-shape, compute-heavy models
📊

9. TF Profiler — Temukan Bottleneck

9. TF Profiler — Find Bottlenecks

Jangan menebak — ukur di mana waktu training habis
Don't guess — measure where training time goes
29_profiler.py — TF Profiler Setuppython
import tensorflow as tf

# ===========================
# 1. Profile with TensorBoard callback
# ===========================
tensorboard_cb = tf.keras.callbacks.TensorBoard(
    log_dir="logs/profile",
    profile_batch='10,20'  # profile batches 10 through 20
    # Skip first batches (warmup) for accurate measurement
)

model.fit(X_train, y_train,
          epochs=3, batch_size=64,
          callbacks=[tensorboard_cb])

# View: tensorboard --logdir logs/profile
# Navigate to "Profile" tab

# ===========================
# 2. What to look for
# ===========================
# "Input Bound" → data pipeline is the bottleneck
#   Fix: add prefetch(), cache(), increase num_parallel_calls
# 
# "Device Bound" → GPU computation is the bottleneck
#   Fix: enable mixed precision, use XLA, reduce model size
# 
# "Host Bound" → CPU is the bottleneck
#   Fix: move preprocessing to GPU (tf.image ops), use TFRecord

# ===========================
# 3. Quick timing benchmark
# ===========================
import time

def benchmark(dataset, name, num_batches=100):
    start = time.time()
    for i, (x, y) in enumerate(dataset):
        if i >= num_batches: break
        _ = x  # force evaluation
    elapsed = time.time() - start
    print(f"{name}: {elapsed:.2f}s for {num_batches} batches")
    print(f"  → {num_batches/elapsed:.0f} batches/sec")

# Compare naive vs optimized
naive_ds = dataset.batch(64)
optimized_ds = (dataset
    .map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
    .cache().batch(64).prefetch(tf.data.AUTOTUNE))

benchmark(naive_ds, "Naive")
benchmark(optimized_ds, "Optimized")
# Naive:     3.42s (29 batches/sec)
# Optimized: 0.38s (263 batches/sec) → 9× faster! 🚀
📈

10. Benchmark: Sebelum vs Sesudah Optimasi

10. Benchmark: Before vs After Optimization

Dampak setiap optimasi — angka real dari CIFAR-10 training
Impact of each optimization — real numbers from CIFAR-10 training
OptimasiTanpaDenganSpeedupEffort
prefetch(AUTOTUNE)100 sec/epoch55 sec/epoch1.8×1 baris kode
+ cache()55 sec/epoch18 sec/epoch3.1×1 baris kode
+ parallel map18 sec/epoch12 sec/epoch1.5×1 parameter
+ mixed precision12 sec/epoch7 sec/epoch1.7×1 baris kode
+ XLA compile7 sec/epoch5.5 sec/epoch1.3×1 parameter
TOTAL100 sec5.5 sec18×5 baris kode!
OptimizationWithoutWithSpeedupEffort
prefetch(AUTOTUNE)100 sec/epoch55 sec/epoch1.8×1 line of code
+ cache()55 sec/epoch18 sec/epoch3.1×1 line of code
+ parallel map18 sec/epoch12 sec/epoch1.5×1 parameter
+ mixed precision12 sec/epoch7 sec/epoch1.7×1 line of code
+ XLA compile7 sec/epoch5.5 sec/epoch1.3×1 parameter
TOTAL100 sec5.5 sec18×5 lines of code!

🎉 18× Speedup dengan 5 Baris Kode!
Ini adalah ROI tertinggi yang bisa Anda dapatkan di deep learning. Sebelum mengubah arsitektur model, sebelum membeli GPU lebih mahal — optimalkan data pipeline dulu. Checklist minimum untuk setiap training: prefetch(AUTOTUNE) + cache() + num_parallel_calls=AUTOTUNE.

🎉 18× Speedup with 5 Lines of Code!
This is the highest ROI you can get in deep learning. Before changing model architecture, before buying more expensive GPUs — optimize your data pipeline first. Minimum checklist for every training: prefetch(AUTOTUNE) + cache() + num_parallel_calls=AUTOTUNE.

📝

11. Ringkasan Page 4

11. Page 4 Summary

Semua optimasi yang sudah kita pelajari
All optimizations we learned
KonsepApa ItuKode Kunci
tf.data.DatasetAPI data pipelinefrom_tensor_slices(), list_files()
shuffle()Randomize urutan data.shuffle(buffer_size=50000)
map()Apply preprocessing function.map(fn, num_parallel_calls=AUTOTUNE)
batch()Group menjadi mini-batch.batch(64)
prefetch()Overlap CPU/GPU work.prefetch(AUTOTUNE)
cache()Simpan di RAM/disk.cache() atau .cache("/path")
TFRecordFormat binary optimalTFRecordWriter, TFRecordDataset
Mixed PrecisionFloat16 computationset_global_policy('mixed_float16')
XLAJIT compilationjit_compile=True
TF ProfilerFind bottlenecksTensorBoard(profile_batch='10,20')
ConceptWhat It IsKey Code
tf.data.DatasetData pipeline APIfrom_tensor_slices(), list_files()
shuffle()Randomize data order.shuffle(buffer_size=50000)
map()Apply preprocessing function.map(fn, num_parallel_calls=AUTOTUNE)
batch()Group into mini-batches.batch(64)
prefetch()Overlap CPU/GPU work.prefetch(AUTOTUNE)
cache()Store in RAM/disk.cache() or .cache("/path")
TFRecordOptimal binary formatTFRecordWriter, TFRecordDataset
Mixed PrecisionFloat16 computationset_global_policy('mixed_float16')
XLAJIT compilationjit_compile=True
TF ProfilerFind bottlenecksTensorBoard(profile_batch='10,20')
← Page Sebelumnya← Previous Page

Page 3 — CNN & Image Classification

📘

Coming Next: Page 5 — NLP dengan TensorFlow

Memproses teks dengan TensorFlow: TextVectorization layer, Embedding layer yang mengubah kata menjadi vektor bermakna, LSTM dan GRU di Keras, Bidirectional LSTM, klasifikasi sentimen IMDB reviews (87%+), dan integrasi TF Hub pre-trained text models. Dari preprocessing sampai model NLP production!

📘

Coming Next: Page 5 — NLP with TensorFlow

Processing text with TensorFlow: TextVectorization layer, Embedding layer that turns words into meaningful vectors, LSTM and GRU in Keras, Bidirectional LSTM, IMDB review sentiment classification (87%+), and TF Hub pre-trained text model integration. From preprocessing to production NLP models!