Belajar TensorFlow Page 5 — NLP dengan TensorFlow

📑 Daftar Isi — Page 5

📑 Table of Contents — Page 5

Teks → Angka — Kenapa dan bagaimana
TextVectorization — Preprocessing di dalam model
Embedding Layer — Kata → vektor bermakna
RNN & LSTM di Keras — Memproses sequence
GRU — Alternatif LSTM yang lebih ringan
Bidirectional LSTM — Konteks maju dan mundur
Proyek: IMDB Sentiment Classifier — 87%+ akurasi
Padding & Masking — Menangani sequence panjang berbeda
Pre-trained Embeddings — GloVe, TF Hub, Universal Sentence Encoder
Complete NLP Pipeline — Dari raw text sampai prediksi
Ringkasan & Preview Page 6

Text → Numbers — Why and how
TextVectorization — Preprocessing inside the model
Embedding Layer — Words → meaningful vectors
RNN & LSTM in Keras — Processing sequences
GRU — Lighter LSTM alternative
Bidirectional LSTM — Forward and backward context
Project: IMDB Sentiment Classifier — 87%+ accuracy
Padding & Masking — Handling variable-length sequences
Pre-trained Embeddings — GloVe, TF Hub, Universal Sentence Encoder
Complete NLP Pipeline — From raw text to prediction
Summary & Page 6 Preview

🔤

1. Teks → Angka — Fondasi NLP

1. Text → Numbers — NLP Foundation

Neural network hanya bisa memproses angka — teks harus dikonversi

Neural networks can only process numbers — text must be converted

Neural network tidak bisa langsung memproses string teks. Semua teks harus dikonversi menjadi representasi numerik. Di seri Neural Network Page 6, kita membahas Word2Vec dan one-hot encoding. Di TensorFlow, prosesnya otomatis melalui TextVectorization dan Embedding layers.

Neural networks cannot directly process text strings. All text must be converted to numerical representations. In Neural Network series Page 6, we discussed Word2Vec and one-hot encoding. In TensorFlow, this process is automated through TextVectorization and Embedding layers.

NLP Pipeline: Text → Prediction "I love this movie!" ↓ Step 1: Tokenize (split into words/subwords) ["i", "love", "this", "movie"] ↓ Step 2: Encode (word → integer index) [42, 156, 8, 2041] ↓ Step 3: Pad/Truncate (fixed length, e.g., 200) [42, 156, 8, 2041, 0, 0, 0, ... 0] ↓ Step 4: Embed (integer → dense vector) [[0.12, -0.34, 0.56, ...], ← "i" = 64-dim vector [0.78, 0.23, -0.91, ...], ← "love" = 64-dim vector ...] shape: (200, 64) ↓ Step 5: Sequence Model (LSTM/GRU/Transformer) [0.82, -0.13, ...] ← sentence representation ↓ Step 6: Classify (Dense + Sigmoid/Softmax) 0.94 → Positive sentiment! ✅ TensorFlow melakukan SEMUA langkah ini dalam satu model! TextVectorization handles Steps 1-3, Embedding handles Step 4.

📦

2. TextVectorization — Preprocessing di Dalam Model

2. TextVectorization — Preprocessing Inside the Model

Tokenize, encode, dan pad teks — semuanya dalam satu layer Keras

Tokenize, encode, and pad text — all in one Keras layer

TextVectorization adalah layer Keras yang mengubah string teks menjadi integer indices. Keuntungan besar: preprocessing ada di dalam model, jadi saat Anda export model, preprocessing ikut — tidak perlu preprocessing terpisah saat inference.

TextVectorization is a Keras layer that converts text strings to integer indices. Big advantage: preprocessing is inside the model, so when you export the model, preprocessing comes with it — no separate preprocessing needed at inference time.

30_text_vectorization.py — TextVectorization Deep Divepython

import tensorflow as tf
from tensorflow.keras import layers

# ===========================
# 1. Create TextVectorization layer
# ===========================
vectorizer = layers.TextVectorization(
    max_tokens=10000,            # vocabulary size (top 10k words)
    output_mode='int',            # output integer indices
    output_sequence_length=200,   # pad/truncate to 200 tokens
    standardize='lower_and_strip_punctuation',  # lowercase + remove punct
    split='whitespace',           # split on spaces (default)
)

# ===========================
# 2. Adapt — build vocabulary from training data
# ===========================
train_texts = [
    "I love this movie, it was amazing!",
    "Terrible film, waste of time.",
    "Great acting and wonderful story.",
    "The worst movie I have ever seen.",
    # ... thousands more
]

vectorizer.adapt(train_texts)  # builds vocabulary!

# Check vocabulary
vocab = vectorizer.get_vocabulary()
print(f"Vocab size: {len(vocab)}")
print(f"First 20: {vocab[:20]}")
# ['', '[UNK]', 'the', 'i', 'movie', 'was', 'this', ...]
# Index 0 = padding, Index 1 = unknown word

# ===========================
# 3. Vectorize text
# ===========================
sample = tf.constant(["I love this movie"])
encoded = vectorizer(sample)
print(encoded)
# tf.Tensor([[3, 42, 6, 4, 0, 0, 0, ... 0]], shape=(1, 200))
#            i  love this movie  pad  pad  pad    pad

# ===========================
# 4. Use INSIDE a model (BEST approach!)
# ===========================
model = tf.keras.Sequential([
    vectorizer,                          # text → integers (in-model!)
    layers.Embedding(10000, 64),         # integers → vectors
    layers.Bidirectional(layers.LSTM(64)),
    layers.Dense(64, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

# Train directly with raw strings!
# model.fit(train_texts, train_labels, epochs=10)
# At inference: model.predict(["Great movie!"]) → works directly!
# No separate preprocessing needed — it's ALL in the model.

# ===========================
# 5. output_mode options
# ===========================
# 'int'       → [42, 156, 8, 2041, 0, ...] (for LSTM/Transformer)
# 'multi_hot' → [0, 1, 0, 0, 1, 1, 0, ...] (bag of words)
# 'count'     → [0, 2, 0, 0, 1, 3, 0, ...] (word counts)
# 'tf_idf'    → [0, 0.7, 0, 0, 1.2, ...] (TF-IDF weights)

🎓 Kenapa TextVectorization di Dalam Model?
Tanpa: Anda perlu preprocessing terpisah saat training DAN inference. Risiko: preprocessing mismatch antara training dan production → bug tersembunyi.
Dengan: Preprocessing = bagian dari model. Export model (.keras atau SavedModel) → preprocessing ikut. model.predict("raw text") langsung bekerja. Ini best practice untuk production NLP.

🎓 Why TextVectorization Inside the Model?
Without: You need separate preprocessing during training AND inference. Risk: preprocessing mismatch between training and production → hidden bugs.
With: Preprocessing = part of the model. Export model (.keras or SavedModel) → preprocessing included. model.predict("raw text") works directly. This is the best practice for production NLP.

🧊

3. Embedding Layer — Kata → Vektor Bermakna

3. Embedding Layer — Words → Meaningful Vectors

Mengubah integer index menjadi dense vector yang menangkap makna kata

Turning integer indices into dense vectors that capture word meaning

Di seri NN Page 6, kita membahas Word2Vec dari nol — bagaimana kata-kata yang mirip memiliki vektor yang berdekatan. Embedding layer di Keras melakukan hal yang sama: setiap kata dipetakan ke vektor dense (misalnya 64 dimensi). Vektor ini dipelajari selama training, sehingga kata-kata dengan makna serupa akan memiliki vektor yang mirip.

In NN series Page 6, we discussed Word2Vec from scratch — how similar words have nearby vectors. The Embedding layer in Keras does the same thing: each word is mapped to a dense vector (e.g., 64 dimensions). These vectors are learned during training, so words with similar meanings will have similar vectors.

31_embedding_layer.py — Embedding Deep Divepython

import tensorflow as tf
from tensorflow.keras import layers
import numpy as np

# ===========================
# 1. How Embedding works
# ===========================
embedding = layers.Embedding(
    input_dim=10000,     # vocabulary size
    output_dim=64,       # embedding dimension
    input_length=200,    # sequence length (optional)
)

# Input: batch of word indices
x = tf.constant([[42, 156, 8, 2041, 0]])  # 1 sentence, 5 words
output = embedding(x)
print(output.shape)  # (1, 5, 64) — each word → 64-dim vector!

# Internally, this is just a LOOKUP TABLE:
# embedding.weights[0] has shape (10000, 64)
# Word 42 → row 42 of the table → [0.12, -0.34, ...]
# Word 156 → row 156 → [0.78, 0.23, ...]
# These rows are LEARNED during training!

print(f"Embedding matrix: {embedding.weights[0].shape}")
# (10000, 64) = 640,000 learnable parameters

# ===========================
# 2. Visualize learned embeddings
# ===========================
# After training, similar words have similar vectors:
# cosine_similarity("good", "great") ≈ 0.85
# cosine_similarity("good", "terrible") ≈ -0.72
# cosine_similarity("king" - "man" + "woman") ≈ "queen"

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Get vectors after training:
# weights = embedding.get_weights()[0]
# vec_good = weights[vocab.index("good")]
# vec_great = weights[vocab.index("great")]
# print(cosine_similarity(vec_good, vec_great))  # ≈ 0.85

# ===========================
# 3. Embedding dimension guidelines
# ===========================
# Vocab 1k-10k:   embedding_dim = 32-64
# Vocab 10k-50k:  embedding_dim = 64-128
# Vocab 50k-100k: embedding_dim = 128-256
# Rule of thumb: dim ≈ vocab^(1/4)
# Too small: can't capture nuances
# Too large: overfitting, slow training

# ===========================
# 4. Pre-trained embeddings (GloVe, Word2Vec)
# ===========================
# Load GloVe: https://nlp.stanford.edu/projects/glove/
# glove_matrix = np.zeros((10000, 100))  # build from GloVe file
# for word, idx in word_index.items():
#     if word in glove_dict:
#         glove_matrix[idx] = glove_dict[word]

# Use pre-trained weights:
# embedding = layers.Embedding(10000, 100,
#     embeddings_initializer=tf.keras.initializers.Constant(glove_matrix),
#     trainable=False)  # freeze! (or True to fine-tune)

One-Hot vs Embedding — Why Embedding Wins One-Hot Encoding (sparse, high-dim): "cat" → [0, 0, 0, 1, 0, 0, ..., 0] (10,000-dim, mostly zeros!) "dog" → [0, 0, 0, 0, 0, 1, ..., 0] (no similarity info!) similarity("cat", "dog") = 0 ← WRONG! They're both animals! Embedding (dense, low-dim): "cat" → [0.32, -0.18, 0.74, ..., 0.12] (64-dim, every value matters!) "dog" → [0.28, -0.21, 0.71, ..., 0.15] (similar to cat!) similarity("cat", "dog") = 0.93 ← CORRECT! Both are animals. Embedding advantages: ✅ Low-dimensional (64 vs 10,000) ✅ Captures semantic similarity ✅ Learned from data (or pre-trained) ✅ Works as input to LSTM/Transformer

🔄

4. RNN & LSTM di Keras — Memproses Sequence

4. RNN & LSTM in Keras — Processing Sequences

Di seri NN Page 5, kita buat LSTM manual. Sekarang: satu baris Keras.

In NN series Page 5, we built LSTM manually. Now: one line of Keras.

Di seri Neural Network Page 5, kita mengimplementasi VanillaRNN dan LSTM dari nol — forget gate, input gate, output gate, cell state — ratusan baris NumPy. Sekarang di Keras: layers.LSTM(64). Tapi konsepnya tetap identik.

In Neural Network series Page 5, we implemented VanillaRNN and LSTM from scratch — forget gate, input gate, output gate, cell state — hundreds of lines of NumPy. Now in Keras: layers.LSTM(64). But the concepts are identical.

32_lstm_keras.py — LSTM & GRU in Keraspython

import tensorflow as tf
from tensorflow.keras import layers

# ===========================
# 1. Simple LSTM
# ===========================
# Input: (batch, timesteps, features)
# For text: (batch, seq_len, embedding_dim)

lstm_layer = layers.LSTM(
    units=64,                   # output dimension (hidden size)
    return_sequences=False,     # only return LAST output
    # return_sequences=True,     # return ALL timestep outputs
    dropout=0.2,                # dropout on inputs
    recurrent_dropout=0.2,      # dropout on recurrent state
)
# Input:  (batch, 200, 64)  → 200 timesteps, 64 features
# Output: (batch, 64)       → last hidden state (return_sequences=False)
# Output: (batch, 200, 64)  → all hidden states (return_sequences=True)

# ===========================
# 2. Stacking LSTM layers
# ===========================
model = tf.keras.Sequential([
    layers.Embedding(10000, 64, input_length=200),

    # Layer 1: return ALL timestep outputs → feed to next LSTM
    layers.LSTM(128, return_sequences=True, dropout=0.2),

    # Layer 2: return only LAST output → feed to Dense
    layers.LSTM(64, return_sequences=False, dropout=0.2),

    layers.Dense(64, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')
])

# CRITICAL: return_sequences
# When stacking LSTMs: all layers EXCEPT the last need return_sequences=True
# Last LSTM: return_sequences=False (or use GlobalAveragePooling1D)

# ===========================
# 3. Parameter count
# ===========================
# LSTM(64) with input_dim=64:
# Parameters = 4 × ((input_dim + units + 1) × units)
#            = 4 × ((64 + 64 + 1) × 64)
#            = 4 × 8,256 = 33,024
# The "4" = 4 gates: forget, input, cell candidate, output

model.summary()
# Embedding:  10000 × 64 = 640,000
# LSTM 128:   4 × (64+128+1) × 128 = 98,816
# LSTM 64:    4 × (128+64+1) × 64 = 49,408
# Dense:      64 × 64 + 64 = 4,160
# Total:      ~792k parameters

🎓 return_sequences: Kapan True vs False?
return_sequences=False (default): output hanya hidden state dari timestep terakhir. Shape: (batch, units). Gunakan saat: layer LSTM terakhir sebelum Dense, atau untuk classification.
return_sequences=True: output hidden state dari setiap timestep. Shape: (batch, timesteps, units). Gunakan saat: stacking LSTM layers, atau untuk sequence-to-sequence tasks (translation, tagging).

Rule: Stacking LSTM? Semua return_sequences=True kecuali LSTM terakhir.

🎓 return_sequences: When True vs False?
return_sequences=False (default): output only the hidden state from the last timestep. Shape: (batch, units). Use when: last LSTM layer before Dense, or for classification.
return_sequences=True: output hidden state from every timestep. Shape: (batch, timesteps, units). Use when: stacking LSTM layers, or for sequence-to-sequence tasks (translation, tagging).

Rule: Stacking LSTMs? All return_sequences=True except the last LSTM.

⚡

5. GRU — Alternatif LSTM yang Lebih Ringan

5. GRU — Lighter LSTM Alternative

2 gates vs 4 gates — lebih cepat, sering sama bagusnya

2 gates vs 4 gates — faster, often just as good

33_gru.py — GRU vs LSTMpython

from tensorflow.keras import layers

# GRU: 2 gates (reset, update) vs LSTM: 3 gates + cell state
gru_model = tf.keras.Sequential([
    layers.Embedding(10000, 64),
    layers.GRU(64, return_sequences=True, dropout=0.2),
    layers.GRU(32, dropout=0.2),
    layers.Dense(1, activation='sigmoid')
])

# GRU(64) params = 3 × ((64 + 64 + 1) × 64) = 24,768
# LSTM(64) params = 4 × ((64 + 64 + 1) × 64) = 33,024
# GRU has 25% fewer parameters → faster training!

# ===========================
# When to use which?
# ===========================
# GRU:  faster, fewer params, good for shorter sequences
#       Try GRU first for speed, switch to LSTM if needed
# LSTM: more expressive (separate cell state), better for
#       very long sequences, slightly more accurate on some tasks
# In practice: difference is often < 1% accuracy

Aspek	LSTM	GRU
Gates	3 (forget, input, output) + cell	2 (reset, update)
Parameters	4 × (input + hidden + 1) × hidden	3 × (input + hidden + 1) × hidden
Speed	Slower	~25% faster
Memory	Separate cell state	Combined hidden/cell
Long Sequences	Sedikit lebih baik	Good enough
Recommendation	Default untuk NLP	Coba dulu untuk speed

Aspect	LSTM	GRU
Gates	3 (forget, input, output) + cell	2 (reset, update)
Parameters	4 × (input + hidden + 1) × hidden	3 × (input + hidden + 1) × hidden
Speed	Slower	~25% faster
Memory	Separate cell state	Combined hidden/cell
Long Sequences	Slightly better	Good enough
Recommendation	Default for NLP	Try first for speed

↔️

6. Bidirectional LSTM — Konteks Dua Arah

6. Bidirectional LSTM — Two-Way Context

Baca sequence maju DAN mundur — tangkap konteks dari kedua arah

Read sequence forward AND backward — capture context from both directions

LSTM biasa hanya membaca dari kiri ke kanan. Tapi dalam bahasa, konteks sering datang dari kedua arah. Contoh: "The movie was not good" — kata "not" mengubah makna "good" yang datang setelahnya. Bidirectional LSTM menjalankan dua LSTM secara paralel: satu maju (→) dan satu mundur (←), lalu menggabungkan output keduanya.

Regular LSTM only reads left to right. But in language, context often comes from both directions. Example: "The movie was not good" — the word "not" changes the meaning of "good" that comes after it. Bidirectional LSTM runs two LSTMs in parallel: one forward (→) and one backward (←), then combines both outputs.

34_bidirectional.py — BiLSTM Completepython

from tensorflow.keras import layers
import tensorflow as tf

# ===========================
# 1. Bidirectional wrapper
# ===========================
bilstm = layers.Bidirectional(
    layers.LSTM(64, return_sequences=True),
    merge_mode='concat'  # default: concatenate forward + backward
    # merge_mode='sum'   # add forward + backward
    # merge_mode='mul'   # multiply
    # merge_mode='ave'   # average
)
# Input:  (batch, 200, 64)    → 200 timesteps, 64 features
# Output: (batch, 200, 128)   → 64 forward + 64 backward = 128!

# ===========================
# 2. Full BiLSTM model for text classification
# ===========================
model = tf.keras.Sequential([
    layers.Embedding(10000, 64, input_length=200),

    # BiLSTM layer 1: captures bidirectional patterns
    layers.Bidirectional(layers.LSTM(64, return_sequences=True,
                                     dropout=0.2, recurrent_dropout=0.2)),
    # Output: (batch, 200, 128) — 64 forward + 64 backward

    # BiLSTM layer 2: further refine
    layers.Bidirectional(layers.LSTM(32, dropout=0.2)),
    # Output: (batch, 64) — 32 forward + 32 backward

    # Classifier
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()
# Embedding: 640k | BiLSTM1: 66k | BiLSTM2: 25k | Dense: 4k
# Total: ~735k parameters

# ===========================
# 3. How Bidirectional helps
# ===========================
# "The movie was NOT good"
# Forward LSTM (→): sees "not" BEFORE "good" → context captured
# Backward LSTM (←): sees "good" BEFORE "not" → also captured!
# Combined: model understands "not good" = negative from BOTH sides

# Compare accuracy:
# Unidirectional LSTM: ~84% on IMDB
# Bidirectional LSTM:  ~87% on IMDB ← +3% from bidirectional!

🎬

7. Proyek: IMDB Sentiment Classifier — 87%+ Akurasi

7. Project: IMDB Sentiment Classifier — 87%+ Accuracy

25,000 reviews positif + 25,000 negatif — binary sentiment classification

25,000 positive reviews + 25,000 negative — binary sentiment classification

35_imdb_complete.py — IMDB Sentiment Analysis 🔥python

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

# ===========================
# 1. LOAD IMDB DATASET
# ===========================
VOCAB_SIZE = 10000    # top 10k most common words
MAX_LEN = 200         # pad/truncate to 200 words

(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data(
    num_words=VOCAB_SIZE)

print(f"Train: {len(X_train)} reviews")   # 25,000
print(f"Test:  {len(X_test)} reviews")    # 25,000
print(f"Sample lengths: {[len(x) for x in X_train[:5]]}")
# [218, 189, 141, 550, 147] — variable length!

# Pad sequences to fixed length
X_train = keras.utils.pad_sequences(X_train, maxlen=MAX_LEN,
                                     padding='post', truncating='post')
X_test = keras.utils.pad_sequences(X_test, maxlen=MAX_LEN,
                                    padding='post', truncating='post')
print(f"After padding: {X_train.shape}")  # (25000, 200)

# ===========================
# 2. BUILD BiLSTM MODEL
# ===========================
model = keras.Sequential([
    # Embedding: word index → dense vector
    layers.Embedding(VOCAB_SIZE, 64, input_length=MAX_LEN,
                     mask_zero=True),  # mask padding (0s)!

    # BiLSTM layers
    layers.Bidirectional(layers.LSTM(64, return_sequences=True,
                                     dropout=0.2, recurrent_dropout=0.2)),
    layers.Bidirectional(layers.LSTM(32, dropout=0.2)),

    # Classifier
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')
])

# ===========================
# 3. COMPILE
# ===========================
model.compile(
    optimizer=keras.optimizers.Adam(1e-3),
    loss='binary_crossentropy',
    metrics=['accuracy']
)
model.summary()

# ===========================
# 4. TRAIN
# ===========================
history = model.fit(
    X_train, y_train,
    epochs=15,
    batch_size=64,
    validation_split=0.2,
    callbacks=[
        keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True),
        keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=2)
    ]
)

# ===========================
# 5. EVALUATE
# ===========================
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"\n🎬 Test Accuracy: {test_acc:.1%}")
# 🎬 Test Accuracy: 87.2% with BiLSTM! 🎉
# Compare: simple LSTM = 84%, BiLSTM = 87%, BERT (Page 6) = 95%+

# ===========================
# 6. PREDICT on new reviews
# ===========================
word_index = keras.datasets.imdb.get_word_index()
reverse_index = {v+3: k for k, v in word_index.items()}

def encode_review(text):
    words = text.lower().split()
    encoded = [word_index.get(w, 2) + 3 for w in words]
    padded = keras.utils.pad_sequences([encoded], maxlen=MAX_LEN)
    return padded

review = "This film was absolutely terrible waste of time"
pred = model.predict(encode_review(review))[0,0]
print(f"'{review}'")
print(f"Sentiment: {'Positive' if pred > 0.5 else 'Negative'} ({pred:.1%})")
# Sentiment: Negative (8.3%) ✓

🎬 87.2% Akurasi dengan BiLSTM!
Bandingkan evolusi akurasi kita:
• Seri NN (manual NumPy): ~80% (ratusan baris kode)
• Simple LSTM Keras: ~84% (20 baris kode)
• BiLSTM Keras: 87.2% (25 baris kode)
• BERT fine-tuned (Page 6): 95%+ (akan datang!)
Setiap teknik memberikan peningkatan — tapi BERT akan menjadi game-changer total.

🎬 87.2% Accuracy with BiLSTM!
Compare our accuracy evolution:
• NN Series (manual NumPy): ~80% (hundreds of lines)
• Simple LSTM Keras: ~84% (20 lines of code)
• BiLSTM Keras: 87.2% (25 lines of code)
• BERT fine-tuned (Page 6): 95%+ (coming next!)
Each technique brings improvement — but BERT will be a total game-changer.

📏

8. Padding & Masking — Sequence Panjang Berbeda

8. Padding & Masking — Variable-Length Sequences

Menangani kalimat yang panjangnya berbeda-beda tanpa mencemari model

Handling sentences of different lengths without polluting the model

36_padding_masking.py — Padding & Maskingpython

import tensorflow as tf
from tensorflow.keras import layers

# ===========================
# 1. Padding — make all sequences same length
# ===========================
sequences = [[4, 2, 8],               # 3 words
             [1, 5, 9, 3, 7],        # 5 words
             [6]]                     # 1 word

padded = tf.keras.utils.pad_sequences(sequences, maxlen=5,
    padding='post',      # add zeros at END (recommended for RNN)
    truncating='post',   # if too long, cut from END
    value=0              # padding value (0 = default)
)
print(padded)
# [[4, 2, 8, 0, 0],
#  [1, 5, 9, 3, 7],
#  [6, 0, 0, 0, 0]]

# ===========================
# 2. Masking — tell model to IGNORE padding
# ===========================
# Method 1: mask_zero=True in Embedding
embedding = layers.Embedding(10000, 64, mask_zero=True)
# This tells downstream layers: "index 0 = padding, ignore it!"
# LSTM and GRU automatically use this mask → skip padded timesteps

# Method 2: Masking layer (explicit)
model = tf.keras.Sequential([
    layers.Embedding(10000, 64),
    layers.Masking(mask_value=0.0),   # explicit mask on zero vectors
    layers.LSTM(64),
    layers.Dense(1, activation='sigmoid')
])

# ===========================
# 3. Why masking matters
# ===========================
# Without masking: LSTM processes padding tokens as real data
#   → "I love this movie 0 0 0 0 0 0" 
#   → LSTM "reads" 6 zeros → pollutes hidden state!
# With masking: LSTM skips padding tokens completely
#   → Only processes "I love this movie" → cleaner output!
# Impact: +1-2% accuracy improvement, especially for short texts

🌐

9. Pre-trained Embeddings — GloVe & TF Hub

Manfaatkan embedding yang sudah belajar dari miliaran kata

Leverage embeddings already trained on billions of words

37_pretrained_embeddings.py — GloVe & TF Hubpython

import tensorflow as tf
import numpy as np

# ===========================
# Method 1: Load GloVe embeddings
# Download: https://nlp.stanford.edu/projects/glove/
# ===========================
def load_glove(filepath, embedding_dim=100):
    """Load GloVe vectors from file"""
    embeddings = {}
    with open(filepath, encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector
    print(f"Loaded {len(embeddings)} word vectors")
    return embeddings

# glove = load_glove('glove.6B.100d.txt')  # 400k words, 100-dim

# Build embedding matrix for your vocabulary
def build_embedding_matrix(word_index, glove, vocab_size, embed_dim):
    matrix = np.zeros((vocab_size, embed_dim))
    for word, idx in word_index.items():
        if idx < vocab_size and word in glove:
            matrix[idx] = glove[word]
    return matrix

# embedding_matrix = build_embedding_matrix(word_index, glove, 10000, 100)

# Use in model (freeze or fine-tune):
# embedding = layers.Embedding(10000, 100,
#     embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
#     trainable=False)  # freeze pre-trained weights

# ===========================
# Method 2: TF Hub — pre-trained sentence encoders
# ===========================
# Universal Sentence Encoder — maps ANY sentence to 512-dim vector
# import tensorflow_hub as hub
# 
# embed = hub.KerasLayer(
#     "https://tfhub.dev/google/universal-sentence-encoder/4",
#     trainable=False)  # 512-dim output per sentence
# 
# # Super simple classifier:
# model = tf.keras.Sequential([
#     embed,                                  # sentence → 512-dim
#     layers.Dense(64, activation='relu'),
#     layers.Dropout(0.3),
#     layers.Dense(1, activation='sigmoid')
# ])
# → 90%+ accuracy with minimal code!
# → Works with raw string input — no tokenization needed!

# ===========================
# Method 3: NNLM (Neural Network Language Model)
# ===========================
# embed = hub.KerasLayer(
#     "https://tfhub.dev/google/nnlm-en-dim128/2",
#     trainable=True)  # 128-dim, fine-tunable

🎓 Kapan Pakai Pre-trained Embeddings?
Dataset kecil (<10k samples): Selalu gunakan pre-trained! Model Anda tidak punya cukup data untuk belajar embedding yang bagus dari nol.
Dataset besar (>100k samples): Bisa train from scratch, tapi pre-trained + fine-tune sering tetap lebih baik.
TF Hub USE vs GloVe: USE lebih mudah (input = raw string), GloVe lebih fleksibel (per-word embeddings). Untuk classification cepat: USE. Untuk research: GloVe/Word2Vec.

🎓 When to Use Pre-trained Embeddings?
Small dataset (<10k samples): Always use pre-trained! Your model doesn't have enough data to learn good embeddings from scratch.
Large dataset (>100k samples): Can train from scratch, but pre-trained + fine-tune is often still better.
TF Hub USE vs GloVe: USE is easier (input = raw strings), GloVe is more flexible (per-word embeddings). For quick classification: USE. For research: GloVe/Word2Vec.

🔧

10. Complete NLP Pipeline — Raw Text → Prediksi

10. Complete NLP Pipeline — Raw Text → Prediction

Template production: TextVectorization + Embedding + BiLSTM + inference

Production template: TextVectorization + Embedding + BiLSTM + inference

38_nlp_pipeline.py — Production NLP Template 🔥python

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# ═══════════════════════════════════════
# 🔥 PRODUCTION NLP PIPELINE TEMPLATE
# Input: raw text strings
# Output: sentiment/class prediction
# ═══════════════════════════════════════

# Config
VOCAB_SIZE = 20000
MAX_LEN = 200
EMBED_DIM = 64

# 1. Vectorizer (adapt on training data)
vectorizer = layers.TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_sequence_length=MAX_LEN)
vectorizer.adapt(train_texts)  # build vocabulary

# 2. Model (preprocessing INSIDE!)
model = keras.Sequential([
    vectorizer,                                      # text → integers
    layers.Embedding(VOCAB_SIZE, EMBED_DIM, mask_zero=True),
    layers.Bidirectional(layers.LSTM(64, return_sequences=True,
                                     dropout=0.2)),
    layers.Bidirectional(layers.LSTM(32, dropout=0.2)),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')          # binary
    # layers.Dense(NUM_CLASSES, activation='softmax')  # multi-class
])

# 3. Train with RAW STRINGS!
model.compile(optimizer='adam', loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(
    train_texts, train_labels,   # raw strings + labels!
    epochs=15, batch_size=64,
    validation_split=0.2,
    callbacks=[keras.callbacks.EarlyStopping(patience=3,
               restore_best_weights=True)]
)

# 4. Predict on raw text — NO preprocessing needed!
reviews = [
    "This movie was absolutely fantastic! Best film of the year.",
    "Terrible waste of time. Worst movie ever made.",
    "It was okay, nothing special but not bad either.",
]

predictions = model.predict(reviews)
for review, pred in zip(reviews, predictions):
    sentiment = "Positive" if pred > 0.5 else "Negative"
    print(f"{sentiment} ({pred[0]:.1%}): {review[:50]}...")
# Positive (94.2%): This movie was absolutely fantastic! Best fi...
# Negative (3.1%):  Terrible waste of time. Worst movie ever ma...
# Positive (61.3%): It was okay, nothing special but not bad ei...

# 5. Save — preprocessing included!
model.save("sentiment_analyzer.keras")
# loaded = keras.models.load_model("sentiment_analyzer.keras")
# loaded.predict(["Great movie!"]) → works with raw text! ✅

📝

11. Ringkasan Page 5

11. Page 5 Summary

Semua yang sudah kita pelajari

Everything we learned

Konsep	Apa Itu	Kode Kunci
TextVectorization	Teks → integer di dalam model	`TextVectorization(max_tokens=10000)`
Embedding	Integer → dense vector bermakna	`Embedding(10000, 64, mask_zero=True)`
LSTM	Sequence model: 3 gates + cell	`LSTM(64, return_sequences=True)`
GRU	LSTM lebih ringan: 2 gates	`GRU(64, dropout=0.2)`
Bidirectional	Baca maju + mundur	`Bidirectional(LSTM(64))`
Padding	Samakan panjang sequence	`pad_sequences(X, maxlen=200)`
Masking	Abaikan padding di model	`mask_zero=True`
Pre-trained	Embedding yang sudah terlatih	`GloVe, TF Hub USE, NNLM`
Production Pipeline	Raw text → prediction	`TextVectorization → Embedding → BiLSTM`

Concept	What It Is	Key Code
TextVectorization	Text → integer inside model	`TextVectorization(max_tokens=10000)`
Embedding	Integer → meaningful dense vector	`Embedding(10000, 64, mask_zero=True)`
LSTM	Sequence model: 3 gates + cell	`LSTM(64, return_sequences=True)`
GRU	Lighter LSTM: 2 gates	`GRU(64, dropout=0.2)`
Bidirectional	Read forward + backward	`Bidirectional(LSTM(64))`
Padding	Make sequences same length	`pad_sequences(X, maxlen=200)`
Masking	Ignore padding in model	`mask_zero=True`
Pre-trained	Already-trained embeddings	`GloVe, TF Hub USE, NNLM`
Production Pipeline	Raw text → prediction	`TextVectorization → Embedding → BiLSTM`

← Page Sebelumnya← Previous Page

Page 4 — tf.data Pipeline & Performance

📘

Coming Next: Page 6 — Transformer & BERT di TensorFlow

Arsitektur modern yang merevolusi NLP: Multi-Head Attention layer (built-in Keras!), membangun Transformer Encoder dari building blocks, Positional Encoding, fine-tuning BERT dari TF Hub, integrasi Hugging Face + TensorFlow, text classification 95%+ dengan BERT, dan perbandingan LSTM vs Transformer. Game-changer yang membuat LSTM terlihat kuno!

📘

Coming Next: Page 6 — Transformer & BERT in TensorFlow

The modern architecture that revolutionized NLP: Multi-Head Attention layer (built-in Keras!), building Transformer Encoder from building blocks, Positional Encoding, fine-tuning BERT from TF Hub, Hugging Face + TensorFlow integration, 95%+ text classification with BERT, and LSTM vs Transformer comparison. The game-changer that makes LSTM look outdated!

Natural Language Processing
dengan TensorFlow

Natural Language Processing
with TensorFlow

📑 Daftar Isi — Page 5

📑 Table of Contents — Page 5

1. Teks → Angka — Fondasi NLP

1. Text → Numbers — NLP Foundation

2. TextVectorization — Preprocessing di Dalam Model

2. TextVectorization — Preprocessing Inside the Model

3. Embedding Layer — Kata → Vektor Bermakna

3. Embedding Layer — Words → Meaningful Vectors

4. RNN & LSTM di Keras — Memproses Sequence

4. RNN & LSTM in Keras — Processing Sequences

5. GRU — Alternatif LSTM yang Lebih Ringan

5. GRU — Lighter LSTM Alternative

6. Bidirectional LSTM — Konteks Dua Arah

6. Bidirectional LSTM — Two-Way Context

7. Proyek: IMDB Sentiment Classifier — 87%+ Akurasi

7. Project: IMDB Sentiment Classifier — 87%+ Accuracy

8. Padding & Masking — Sequence Panjang Berbeda

8. Padding & Masking — Variable-Length Sequences

9. Pre-trained Embeddings — GloVe & TF Hub

9. Pre-trained Embeddings — GloVe & TF Hub

10. Complete NLP Pipeline — Raw Text → Prediksi

10. Complete NLP Pipeline — Raw Text → Prediction

11. Ringkasan Page 5

11. Page 5 Summary

Page 4 — tf.data Pipeline & Performance

Coming Next: Page 6 — Transformer & BERT di TensorFlow

Coming Next: Page 6 — Transformer & BERT in TensorFlow