📑 Daftar Isi — Page 4
📑 Table of Contents — Page 4
- Masalah: I/O Bottleneck — GPU menganggur karena CPU lambat
- tf.data.Dataset API — Membuat dataset dari berbagai sumber
- The Golden Pattern — shuffle → map → batch → prefetch
- Cache — Simpan di RAM setelah epoch pertama
- Parallel Preprocessing — num_parallel_calls=AUTOTUNE
- TFRecord — Format binary optimal untuk dataset besar
- Mixed Precision Training — Float16 untuk 2× speedup
- XLA Compilation — Just-in-time optimization
- TF Profiler — Temukan dan perbaiki bottleneck
- Benchmark: Sebelum vs Sesudah Optimasi
- Ringkasan & Preview Page 5
- The Problem: I/O Bottleneck — GPU idle because CPU is slow
- tf.data.Dataset API — Creating datasets from various sources
- The Golden Pattern — shuffle → map → batch → prefetch
- Cache — Store in RAM after first epoch
- Parallel Preprocessing — num_parallel_calls=AUTOTUNE
- TFRecord — Optimal binary format for large datasets
- Mixed Precision Training — Float16 for 2× speedup
- XLA Compilation — Just-in-time optimization
- TF Profiler — Find and fix bottlenecks
- Benchmark: Before vs After Optimization
- Summary & Page 5 Preview
1. Masalah: I/O Bottleneck — GPU Menganggur
1. The Problem: I/O Bottleneck — GPU Sitting Idle
Bayangkan Anda punya GPU seharga $10,000 (A100) tapi ia menganggur sebagian besar waktu karena menunggu CPU menyiapkan data. Ini adalah masalah paling umum dan paling mudah diperbaiki di deep learning training. Tanpa optimasi data pipeline, alur training Anda terlihat seperti ini:
Imagine you have a $10,000 GPU (A100) but it sits idle most of the time because it's waiting for the CPU to prepare data. This is the most common and easiest to fix problem in deep learning training. Without data pipeline optimization, your training flow looks like this:
Solusinya: tf.data — API yang dirancang khusus untuk membangun data pipeline yang efisien, paralel, dan overlap antara CPU preprocessing dan GPU training.
The solution: tf.data — an API specifically designed to build efficient, parallel data pipelines that overlap CPU preprocessing with GPU training.
2. tf.data.Dataset API — Membuat Dataset
2. tf.data.Dataset API — Creating Datasets
import tensorflow as tf import numpy as np # =========================== # 1. From NumPy arrays (paling umum untuk small datasets) # =========================== X = np.random.randn(1000, 32, 32, 3).astype('float32') y = np.random.randint(0, 10, size=(1000,)) dataset = tf.data.Dataset.from_tensor_slices((X, y)) print(f"Dataset: {dataset}") ## TensorSpec(shape=()...))> # Iterate for image, label in dataset.take(3): print(image.shape, label.numpy()) # (32, 32, 3) 7 # (32, 32, 3) 2 # (32, 32, 3) 5 # =========================== # 2. From files on disk (large datasets) # =========================== file_ds = tf.data.Dataset.list_files("data/train/*/*.jpg", shuffle=True) # Lists all .jpg files matching the glob pattern def load_and_preprocess(file_path): # Read file raw = tf.io.read_file(file_path) # Decode image img = tf.image.decode_jpeg(raw, channels=3) # Resize img = tf.image.resize(img, [224, 224]) # Normalize img = tf.cast(img, tf.float32) / 255.0 # Extract label from path: "data/train/cats/img001.jpg" → "cats" parts = tf.strings.split(file_path, '/') label_str = parts[-2] # "cats" return img, label_str image_ds = file_ds.map(load_and_preprocess, num_parallel_calls=tf.data.AUTOTUNE) # =========================== # 3. From generator (custom logic, unlimited) # =========================== def data_generator(): for i in range(10000): image = np.random.randn(32, 32, 3).astype('float32') label = np.random.randint(0, 10) yield image, label gen_ds = tf.data.Dataset.from_generator( data_generator, output_signature=( tf.TensorSpec(shape=(32,32,3), dtype=tf.float32), tf.TensorSpec(shape=(), dtype=tf.int32) ) ) # =========================== # 4. From TFRecord (Section 6) # =========================== # tfrecord_ds = tf.data.TFRecordDataset("data.tfrecord")
3. The Golden Pattern — 5 Langkah Wajib
3. The Golden Pattern — 5 Mandatory Steps
import tensorflow as tf AUTOTUNE = tf.data.AUTOTUNE # Load CIFAR-10 as example (X_train, y_train), _ = tf.keras.datasets.cifar10.load_data() dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train)) # Preprocessing function def preprocess(image, label): image = tf.cast(image, tf.float32) / 255.0 return image, label # Augmentation function (only for training!) def augment(image, label): image = tf.image.random_flip_left_right(image) image = tf.image.random_brightness(image, max_delta=0.1) image = tf.image.random_contrast(image, 0.9, 1.1) return image, label # ═══════════════════════════════════════ # THE GOLDEN PATTERN ⭐ # ═══════════════════════════════════════ train_ds = (dataset # Step 1: SHUFFLE — randomize order # buffer_size should be >= dataset size for perfect shuffle # For very large datasets, use smaller buffer (e.g. 10000) .shuffle(buffer_size=50000, reshuffle_each_iteration=True) # Step 2: MAP — preprocess (parallel!) # num_parallel_calls → use multiple CPU cores # AUTOTUNE → TF determines optimal thread count .map(preprocess, num_parallel_calls=AUTOTUNE) # Step 2b: MAP — augment (also parallel) .map(augment, num_parallel_calls=AUTOTUNE) # Step 3: BATCH — group into mini-batches .batch(64) # Step 4: PREFETCH — overlap CPU/GPU # While GPU trains batch N, CPU prepares batch N+1 .prefetch(buffer_size=AUTOTUNE) ) # Validation pipeline (NO augmentation, NO shuffle) val_ds = (val_dataset .map(preprocess, num_parallel_calls=AUTOTUNE) .batch(64) .prefetch(AUTOTUNE) ) # Use in training: # model.fit(train_ds, validation_data=val_ds, epochs=50) # ═══════════════════════════════════════ # WHY THIS ORDER MATTERS # ═══════════════════════════════════════ # 1. shuffle BEFORE map → randomize which items get processed # 2. map BEFORE batch → process individual items (not batches) # 3. batch BEFORE prefetch → prefetch complete batches # 4. prefetch LAST → always last in the chain! # WRONG ORDER examples: # ❌ batch → shuffle → map → prefetch (shuffles batches, not items!) # ❌ map → shuffle → batch → prefetch (wastes compute on shuffled items)
🎓 Kenapa num_parallel_calls=AUTOTUNE Penting?
Tanpa parameter ini, map() memproses data satu per satu (sequential) — hanya satu CPU core bekerja. Dengan AUTOTUNE, TensorFlow otomatis menentukan jumlah thread optimal berdasarkan CPU Anda. Di mesin 8-core, ini bisa meningkatkan kecepatan preprocessing 4-8×!
🎓 Why num_parallel_calls=AUTOTUNE Matters?
Without this parameter, map() processes data one by one (sequential) — only one CPU core works. With AUTOTUNE, TensorFlow automatically determines the optimal number of threads based on your CPU. On an 8-core machine, this can speed up preprocessing by 4-8×!
4. Cache — RAM adalah Teman Terbaik Anda
4. Cache — RAM is Your Best Friend
import tensorflow as tf AUTOTUNE = tf.data.AUTOTUNE # =========================== # Strategy 1: In-memory cache (dataset fits in RAM) # =========================== train_ds = (dataset .map(preprocess, num_parallel_calls=AUTOTUNE) .cache() # ← cache in RAM! .shuffle(50000) # shuffle AFTER cache .map(augment, num_parallel_calls=AUTOTUNE) # augment AFTER cache .batch(64) .prefetch(AUTOTUNE) ) # Epoch 1: normal speed (reads from disk, preprocesses, caches) # Epoch 2+: MUCH faster (reads from RAM, skips preprocess!) # BUT augmentation still runs fresh each epoch (good!) # =========================== # Strategy 2: On-disk cache (too large for RAM) # =========================== train_ds = (dataset .map(preprocess, num_parallel_calls=AUTOTUNE) .cache("/tmp/train_cache") # ← cache to SSD! .shuffle(10000) .batch(64) .prefetch(AUTOTUNE) ) # Slower than RAM cache, but faster than re-reading raw files # Useful for datasets > 16GB (e.g., ImageNet) # =========================== # PENTING: Order of cache() matters! # =========================== # ✅ preprocess → cache → shuffle → augment → batch → prefetch # (cache preprocessed data, augment fresh each epoch) # ❌ cache → preprocess → augment → batch → prefetch # (caches raw data — still needs to preprocess every time!) # ❌ preprocess → augment → cache → batch → prefetch # (caches augmented data — same augmentation every epoch!) # CIFAR-10: 50k images × 32×32×3 × 4 bytes = ~600 MB # → Fits easily in RAM. Use .cache()! # ImageNet: 1.2M images × 224×224×3 × 4 bytes = ~680 GB # → Use .cache("/ssd/path") or TFRecord + no cache
5. Parallel Preprocessing — Multi-Core CPU
5. Parallel Preprocessing — Multi-Core CPU
import tensorflow as tf AUTOTUNE = tf.data.AUTOTUNE # =========================== # 1. Parallel map (most common) # =========================== ds = dataset.map(preprocess, num_parallel_calls=AUTOTUNE) # AUTOTUNE → TF picks optimal thread count # On 8-core CPU: uses ~6-7 cores for map, 1 for control # You can also set manually: ds = dataset.map(preprocess, num_parallel_calls=8) # =========================== # 2. Interleave — parallel file reading # For multiple TFRecord files or large file datasets # =========================== files = tf.data.Dataset.list_files("data/*.tfrecord") ds = files.interleave( lambda f: tf.data.TFRecordDataset(f), cycle_length=4, # read 4 files simultaneously num_parallel_calls=AUTOTUNE, # parallel I/O deterministic=False # faster (order doesn't matter for training) ) # =========================== # 3. Batch-level parallel (map_and_batch) # Slightly more efficient than separate map + batch # =========================== ds = dataset.apply( tf.data.experimental.map_and_batch( map_func=preprocess, batch_size=64, num_parallel_calls=AUTOTUNE ) ) # =========================== # 4. Deterministic vs Non-deterministic # =========================== # deterministic=True → exact same order every run (for debugging) # deterministic=False → faster! (order varies between runs) options = tf.data.Options() options.deterministic = False # faster for training ds = ds.with_options(options)
6. TFRecord — Format Binary untuk Dataset Besar
6. TFRecord — Binary Format for Large Datasets
TFRecord menyimpan data sebagai serialized protocol buffers dalam file binary. Keuntungan: sequential read sangat cepat (SSD maupun HDD), mendukung kompresi (GZIP), dan optimal untuk cloud storage (GCS). Semua ML pipeline production di Google menggunakan TFRecord.
TFRecord stores data as serialized protocol buffers in binary files. Benefits: very fast sequential reads (SSD and HDD), supports compression (GZIP), and optimal for cloud storage (GCS). All production ML pipelines at Google use TFRecord.
import tensorflow as tf import numpy as np # =========================== # 1. Helper functions # =========================== def _bytes_feature(value): return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value])) def _int64_feature(value): return tf.train.Feature(int64_list=tf.train.Int64List(value=[value])) def _float_feature(value): return tf.train.Feature(float_list=tf.train.FloatList(value=[value])) # =========================== # 2. Write TFRecord # =========================== def serialize_example(image, label): """Serialize one image + label to TFRecord format""" feature = { 'image': _bytes_feature(tf.io.serialize_tensor(image).numpy()), 'label': _int64_feature(int(label)), 'height': _int64_feature(image.shape[0]), 'width': _int64_feature(image.shape[1]), } example = tf.train.Example(features=tf.train.Features(feature=feature)) return example.SerializeToString() # Write dataset to multiple TFRecord shards num_shards = 4 writers = [tf.io.TFRecordWriter(f'data/train_{i}.tfrecord') for i in range(num_shards)] for idx, (img, lbl) in enumerate(zip(X_train, y_train)): shard = idx % num_shards writers[shard].write(serialize_example(img, lbl)) for w in writers: w.close() print(f"Written {len(X_train)} examples to {num_shards} shards") # =========================== # 3. Read TFRecord # =========================== feature_description = { 'image': tf.io.FixedLenFeature([], tf.string), 'label': tf.io.FixedLenFeature([], tf.int64), } def parse_example(serialized): example = tf.io.parse_single_example(serialized, feature_description) image = tf.io.parse_tensor(example['image'], out_type=tf.uint8) image = tf.reshape(image, [32, 32, 3]) image = tf.cast(image, tf.float32) / 255.0 label = example['label'] return image, label # Build optimized pipeline from TFRecords files = tf.data.Dataset.list_files('data/train_*.tfrecord') dataset = files.interleave( tf.data.TFRecordDataset, cycle_length=4, num_parallel_calls=tf.data.AUTOTUNE ) dataset = (dataset .map(parse_example, num_parallel_calls=tf.data.AUTOTUNE) .shuffle(10000) .batch(64) .prefetch(tf.data.AUTOTUNE) )
7. Mixed Precision Training — Float16 untuk 2× Speedup
7. Mixed Precision Training — Float16 for 2× Speedup
Mixed precision = menggunakan float16 untuk komputasi (cepat di GPU Tensor Cores) sambil mempertahankan float32 untuk weight updates (numerically stable). Hasilnya: training lebih cepat tanpa kehilangan akurasi.
Mixed precision = using float16 for computation (fast on GPU Tensor Cores) while keeping float32 for weight updates (numerically stable). Result: faster training without losing accuracy.
import tensorflow as tf # =========================== # Enable mixed precision — SATU BARIS! # =========================== tf.keras.mixed_precision.set_global_policy('mixed_float16') # Semua layer sekarang otomatis: # - Compute: float16 (cepat di Tensor Cores) # - Weight updates: float32 (numerically stable) # - Activations: float16 (less memory) # PENTING: Output layer harus tetap float32! model = tf.keras.Sequential([ tf.keras.layers.Dense(256, activation='relu'), # auto float16 tf.keras.layers.Dense(128, activation='relu'), # auto float16 tf.keras.layers.Dense(10, activation='softmax', dtype='float32'), # ← EXPLICIT float32! ]) # Kenapa output float32? Softmax + cross-entropy di float16 # bisa overflow/underflow. Float32 di output menjaga stabilitas. # =========================== # Performance impact (GPU dengan Tensor Cores) # =========================== # GPU tanpa Tensor Cores (GTX 1080, K80): TIDAK ADA speedup # T4 (Google Colab free!): 1.5-2× speedup ✅ # V100: 1.5-2× speedup ✅ # A100: 2-3× speedup ✅✅ # RTX 3090/4090: 2-3× speedup ✅✅ # Memory: ~50% less → can double batch size! # Check current policy: print(tf.keras.mixed_precision.global_policy()) ## Reset to default: # tf.keras.mixed_precision.set_global_policy('float32')
🎉 Pro Tip Google Colab: Colab gratis memberikan GPU T4 yang PUNYA Tensor Cores. Selalu aktifkan mixed precision di Colab! Satu baris kode = training 1.5× lebih cepat. Untuk production di A100, speedup bisa sampai 3×.
🎉 Pro Tip Google Colab: Free Colab gives you a T4 GPU that HAS Tensor Cores. Always enable mixed precision on Colab! One line of code = 1.5× faster training. For production on A100, speedup can reach 3×.
8. XLA Compilation — Fuse Operations
8. XLA Compilation — Fuse Operations
import tensorflow as tf # =========================== # Method 1: jit_compile in model.compile() # =========================== model.compile( optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'], jit_compile=True # ← enable XLA! ) # XLA fuses multiple operations into single GPU kernels # Dense + BiasAdd + ReLU → one fused kernel (fewer memory accesses) # Typical speedup: 10-30% on top of other optimizations # =========================== # Method 2: @tf.function with jit_compile # =========================== @tf.function(jit_compile=True) def train_step(x, y): with tf.GradientTape() as tape: pred = model(x, training=True) loss = loss_fn(y, pred) grads = tape.gradient(loss, model.trainable_variables) optimizer.apply_gradients(zip(grads, model.trainable_variables)) return loss # ⚠️ Caveats: # - First call is slow (compilation time) # - Not all ops support XLA (most Keras layers do) # - Dynamic shapes may not work # - Best for fixed-shape, compute-heavy models
9. TF Profiler — Temukan Bottleneck
9. TF Profiler — Find Bottlenecks
import tensorflow as tf # =========================== # 1. Profile with TensorBoard callback # =========================== tensorboard_cb = tf.keras.callbacks.TensorBoard( log_dir="logs/profile", profile_batch='10,20' # profile batches 10 through 20 # Skip first batches (warmup) for accurate measurement ) model.fit(X_train, y_train, epochs=3, batch_size=64, callbacks=[tensorboard_cb]) # View: tensorboard --logdir logs/profile # Navigate to "Profile" tab # =========================== # 2. What to look for # =========================== # "Input Bound" → data pipeline is the bottleneck # Fix: add prefetch(), cache(), increase num_parallel_calls # # "Device Bound" → GPU computation is the bottleneck # Fix: enable mixed precision, use XLA, reduce model size # # "Host Bound" → CPU is the bottleneck # Fix: move preprocessing to GPU (tf.image ops), use TFRecord # =========================== # 3. Quick timing benchmark # =========================== import time def benchmark(dataset, name, num_batches=100): start = time.time() for i, (x, y) in enumerate(dataset): if i >= num_batches: break _ = x # force evaluation elapsed = time.time() - start print(f"{name}: {elapsed:.2f}s for {num_batches} batches") print(f" → {num_batches/elapsed:.0f} batches/sec") # Compare naive vs optimized naive_ds = dataset.batch(64) optimized_ds = (dataset .map(preprocess, num_parallel_calls=tf.data.AUTOTUNE) .cache().batch(64).prefetch(tf.data.AUTOTUNE)) benchmark(naive_ds, "Naive") benchmark(optimized_ds, "Optimized") # Naive: 3.42s (29 batches/sec) # Optimized: 0.38s (263 batches/sec) → 9× faster! 🚀
10. Benchmark: Sebelum vs Sesudah Optimasi
10. Benchmark: Before vs After Optimization
| Optimasi | Tanpa | Dengan | Speedup | Effort |
|---|---|---|---|---|
| prefetch(AUTOTUNE) | 100 sec/epoch | 55 sec/epoch | 1.8× | 1 baris kode |
| + cache() | 55 sec/epoch | 18 sec/epoch | 3.1× | 1 baris kode |
| + parallel map | 18 sec/epoch | 12 sec/epoch | 1.5× | 1 parameter |
| + mixed precision | 12 sec/epoch | 7 sec/epoch | 1.7× | 1 baris kode |
| + XLA compile | 7 sec/epoch | 5.5 sec/epoch | 1.3× | 1 parameter |
| TOTAL | 100 sec | 5.5 sec | 18× | 5 baris kode! |
| Optimization | Without | With | Speedup | Effort |
|---|---|---|---|---|
| prefetch(AUTOTUNE) | 100 sec/epoch | 55 sec/epoch | 1.8× | 1 line of code |
| + cache() | 55 sec/epoch | 18 sec/epoch | 3.1× | 1 line of code |
| + parallel map | 18 sec/epoch | 12 sec/epoch | 1.5× | 1 parameter |
| + mixed precision | 12 sec/epoch | 7 sec/epoch | 1.7× | 1 line of code |
| + XLA compile | 7 sec/epoch | 5.5 sec/epoch | 1.3× | 1 parameter |
| TOTAL | 100 sec | 5.5 sec | 18× | 5 lines of code! |
🎉 18× Speedup dengan 5 Baris Kode!
Ini adalah ROI tertinggi yang bisa Anda dapatkan di deep learning. Sebelum mengubah arsitektur model, sebelum membeli GPU lebih mahal — optimalkan data pipeline dulu. Checklist minimum untuk setiap training: prefetch(AUTOTUNE) + cache() + num_parallel_calls=AUTOTUNE.
🎉 18× Speedup with 5 Lines of Code!
This is the highest ROI you can get in deep learning. Before changing model architecture, before buying more expensive GPUs — optimize your data pipeline first. Minimum checklist for every training: prefetch(AUTOTUNE) + cache() + num_parallel_calls=AUTOTUNE.
11. Ringkasan Page 4
11. Page 4 Summary
| Konsep | Apa Itu | Kode Kunci |
|---|---|---|
| tf.data.Dataset | API data pipeline | from_tensor_slices(), list_files() |
| shuffle() | Randomize urutan data | .shuffle(buffer_size=50000) |
| map() | Apply preprocessing function | .map(fn, num_parallel_calls=AUTOTUNE) |
| batch() | Group menjadi mini-batch | .batch(64) |
| prefetch() | Overlap CPU/GPU work | .prefetch(AUTOTUNE) |
| cache() | Simpan di RAM/disk | .cache() atau .cache("/path") |
| TFRecord | Format binary optimal | TFRecordWriter, TFRecordDataset |
| Mixed Precision | Float16 computation | set_global_policy('mixed_float16') |
| XLA | JIT compilation | jit_compile=True |
| TF Profiler | Find bottlenecks | TensorBoard(profile_batch='10,20') |
| Concept | What It Is | Key Code |
|---|---|---|
| tf.data.Dataset | Data pipeline API | from_tensor_slices(), list_files() |
| shuffle() | Randomize data order | .shuffle(buffer_size=50000) |
| map() | Apply preprocessing function | .map(fn, num_parallel_calls=AUTOTUNE) |
| batch() | Group into mini-batches | .batch(64) |
| prefetch() | Overlap CPU/GPU work | .prefetch(AUTOTUNE) |
| cache() | Store in RAM/disk | .cache() or .cache("/path") |
| TFRecord | Optimal binary format | TFRecordWriter, TFRecordDataset |
| Mixed Precision | Float16 computation | set_global_policy('mixed_float16') |
| XLA | JIT compilation | jit_compile=True |
| TF Profiler | Find bottlenecks | TensorBoard(profile_batch='10,20') |
Page 3 — CNN & Image Classification
Coming Next: Page 5 — NLP dengan TensorFlow
Memproses teks dengan TensorFlow: TextVectorization layer, Embedding layer yang mengubah kata menjadi vektor bermakna, LSTM dan GRU di Keras, Bidirectional LSTM, klasifikasi sentimen IMDB reviews (87%+), dan integrasi TF Hub pre-trained text models. Dari preprocessing sampai model NLP production!
Coming Next: Page 5 — NLP with TensorFlow
Processing text with TensorFlow: TextVectorization layer, Embedding layer that turns words into meaningful vectors, LSTM and GRU in Keras, Bidirectional LSTM, IMDB review sentiment classification (87%+), and TF Hub pre-trained text model integration. From preprocessing to production NLP models!