š Daftar Isi ā Page 5
š Table of Contents ā Page 5
- Teks ā Angka ā Kenapa dan bagaimana
- TextVectorization ā Preprocessing di dalam model
- Embedding Layer ā Kata ā vektor bermakna
- RNN & LSTM di Keras ā Memproses sequence
- GRU ā Alternatif LSTM yang lebih ringan
- Bidirectional LSTM ā Konteks maju dan mundur
- Proyek: IMDB Sentiment Classifier ā 87%+ akurasi
- Padding & Masking ā Menangani sequence panjang berbeda
- Pre-trained Embeddings ā GloVe, TF Hub, Universal Sentence Encoder
- Complete NLP Pipeline ā Dari raw text sampai prediksi
- Ringkasan & Preview Page 6
- Text ā Numbers ā Why and how
- TextVectorization ā Preprocessing inside the model
- Embedding Layer ā Words ā meaningful vectors
- RNN & LSTM in Keras ā Processing sequences
- GRU ā Lighter LSTM alternative
- Bidirectional LSTM ā Forward and backward context
- Project: IMDB Sentiment Classifier ā 87%+ accuracy
- Padding & Masking ā Handling variable-length sequences
- Pre-trained Embeddings ā GloVe, TF Hub, Universal Sentence Encoder
- Complete NLP Pipeline ā From raw text to prediction
- Summary & Page 6 Preview
1. Teks ā Angka ā Fondasi NLP
1. Text ā Numbers ā NLP Foundation
Neural network tidak bisa langsung memproses string teks. Semua teks harus dikonversi menjadi representasi numerik. Di seri Neural Network Page 6, kita membahas Word2Vec dan one-hot encoding. Di TensorFlow, prosesnya otomatis melalui TextVectorization dan Embedding layers.
Neural networks cannot directly process text strings. All text must be converted to numerical representations. In Neural Network series Page 6, we discussed Word2Vec and one-hot encoding. In TensorFlow, this process is automated through TextVectorization and Embedding layers.
2. TextVectorization ā Preprocessing di Dalam Model
2. TextVectorization ā Preprocessing Inside the Model
TextVectorization adalah layer Keras yang mengubah string teks menjadi integer indices. Keuntungan besar: preprocessing ada di dalam model, jadi saat Anda export model, preprocessing ikut ā tidak perlu preprocessing terpisah saat inference.
TextVectorization is a Keras layer that converts text strings to integer indices. Big advantage: preprocessing is inside the model, so when you export the model, preprocessing comes with it ā no separate preprocessing needed at inference time.
import tensorflow as tf from tensorflow.keras import layers # =========================== # 1. Create TextVectorization layer # =========================== vectorizer = layers.TextVectorization( max_tokens=10000, # vocabulary size (top 10k words) output_mode='int', # output integer indices output_sequence_length=200, # pad/truncate to 200 tokens standardize='lower_and_strip_punctuation', # lowercase + remove punct split='whitespace', # split on spaces (default) ) # =========================== # 2. Adapt ā build vocabulary from training data # =========================== train_texts = [ "I love this movie, it was amazing!", "Terrible film, waste of time.", "Great acting and wonderful story.", "The worst movie I have ever seen.", # ... thousands more ] vectorizer.adapt(train_texts) # builds vocabulary! # Check vocabulary vocab = vectorizer.get_vocabulary() print(f"Vocab size: {len(vocab)}") print(f"First 20: {vocab[:20]}") # ['', '[UNK]', 'the', 'i', 'movie', 'was', 'this', ...] # Index 0 = padding, Index 1 = unknown word # =========================== # 3. Vectorize text # =========================== sample = tf.constant(["I love this movie"]) encoded = vectorizer(sample) print(encoded) # tf.Tensor([[3, 42, 6, 4, 0, 0, 0, ... 0]], shape=(1, 200)) # i love this movie pad pad pad pad # =========================== # 4. Use INSIDE a model (BEST approach!) # =========================== model = tf.keras.Sequential([ vectorizer, # text ā integers (in-model!) layers.Embedding(10000, 64), # integers ā vectors layers.Bidirectional(layers.LSTM(64)), layers.Dense(64, activation='relu'), layers.Dense(1, activation='sigmoid') ]) # Train directly with raw strings! # model.fit(train_texts, train_labels, epochs=10) # At inference: model.predict(["Great movie!"]) ā works directly! # No separate preprocessing needed ā it's ALL in the model. # =========================== # 5. output_mode options # =========================== # 'int' ā [42, 156, 8, 2041, 0, ...] (for LSTM/Transformer) # 'multi_hot' ā [0, 1, 0, 0, 1, 1, 0, ...] (bag of words) # 'count' ā [0, 2, 0, 0, 1, 3, 0, ...] (word counts) # 'tf_idf' ā [0, 0.7, 0, 0, 1.2, ...] (TF-IDF weights)
š Kenapa TextVectorization di Dalam Model?
Tanpa: Anda perlu preprocessing terpisah saat training DAN inference. Risiko: preprocessing mismatch antara training dan production ā bug tersembunyi.
Dengan: Preprocessing = bagian dari model. Export model (.keras atau SavedModel) ā preprocessing ikut. model.predict("raw text") langsung bekerja. Ini best practice untuk production NLP.
š Why TextVectorization Inside the Model?
Without: You need separate preprocessing during training AND inference. Risk: preprocessing mismatch between training and production ā hidden bugs.
With: Preprocessing = part of the model. Export model (.keras or SavedModel) ā preprocessing included. model.predict("raw text") works directly. This is the best practice for production NLP.
3. Embedding Layer ā Kata ā Vektor Bermakna
3. Embedding Layer ā Words ā Meaningful Vectors
Di seri NN Page 6, kita membahas Word2Vec dari nol ā bagaimana kata-kata yang mirip memiliki vektor yang berdekatan. Embedding layer di Keras melakukan hal yang sama: setiap kata dipetakan ke vektor dense (misalnya 64 dimensi). Vektor ini dipelajari selama training, sehingga kata-kata dengan makna serupa akan memiliki vektor yang mirip.
In NN series Page 6, we discussed Word2Vec from scratch ā how similar words have nearby vectors. The Embedding layer in Keras does the same thing: each word is mapped to a dense vector (e.g., 64 dimensions). These vectors are learned during training, so words with similar meanings will have similar vectors.
import tensorflow as tf from tensorflow.keras import layers import numpy as np # =========================== # 1. How Embedding works # =========================== embedding = layers.Embedding( input_dim=10000, # vocabulary size output_dim=64, # embedding dimension input_length=200, # sequence length (optional) ) # Input: batch of word indices x = tf.constant([[42, 156, 8, 2041, 0]]) # 1 sentence, 5 words output = embedding(x) print(output.shape) # (1, 5, 64) ā each word ā 64-dim vector! # Internally, this is just a LOOKUP TABLE: # embedding.weights[0] has shape (10000, 64) # Word 42 ā row 42 of the table ā [0.12, -0.34, ...] # Word 156 ā row 156 ā [0.78, 0.23, ...] # These rows are LEARNED during training! print(f"Embedding matrix: {embedding.weights[0].shape}") # (10000, 64) = 640,000 learnable parameters # =========================== # 2. Visualize learned embeddings # =========================== # After training, similar words have similar vectors: # cosine_similarity("good", "great") ā 0.85 # cosine_similarity("good", "terrible") ā -0.72 # cosine_similarity("king" - "man" + "woman") ā "queen" def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) # Get vectors after training: # weights = embedding.get_weights()[0] # vec_good = weights[vocab.index("good")] # vec_great = weights[vocab.index("great")] # print(cosine_similarity(vec_good, vec_great)) # ā 0.85 # =========================== # 3. Embedding dimension guidelines # =========================== # Vocab 1k-10k: embedding_dim = 32-64 # Vocab 10k-50k: embedding_dim = 64-128 # Vocab 50k-100k: embedding_dim = 128-256 # Rule of thumb: dim ā vocab^(1/4) # Too small: can't capture nuances # Too large: overfitting, slow training # =========================== # 4. Pre-trained embeddings (GloVe, Word2Vec) # =========================== # Load GloVe: https://nlp.stanford.edu/projects/glove/ # glove_matrix = np.zeros((10000, 100)) # build from GloVe file # for word, idx in word_index.items(): # if word in glove_dict: # glove_matrix[idx] = glove_dict[word] # Use pre-trained weights: # embedding = layers.Embedding(10000, 100, # embeddings_initializer=tf.keras.initializers.Constant(glove_matrix), # trainable=False) # freeze! (or True to fine-tune)
4. RNN & LSTM di Keras ā Memproses Sequence
4. RNN & LSTM in Keras ā Processing Sequences
Di seri Neural Network Page 5, kita mengimplementasi VanillaRNN dan LSTM dari nol ā forget gate, input gate, output gate, cell state ā ratusan baris NumPy. Sekarang di Keras: layers.LSTM(64). Tapi konsepnya tetap identik.
In Neural Network series Page 5, we implemented VanillaRNN and LSTM from scratch ā forget gate, input gate, output gate, cell state ā hundreds of lines of NumPy. Now in Keras: layers.LSTM(64). But the concepts are identical.
import tensorflow as tf from tensorflow.keras import layers # =========================== # 1. Simple LSTM # =========================== # Input: (batch, timesteps, features) # For text: (batch, seq_len, embedding_dim) lstm_layer = layers.LSTM( units=64, # output dimension (hidden size) return_sequences=False, # only return LAST output # return_sequences=True, # return ALL timestep outputs dropout=0.2, # dropout on inputs recurrent_dropout=0.2, # dropout on recurrent state ) # Input: (batch, 200, 64) ā 200 timesteps, 64 features # Output: (batch, 64) ā last hidden state (return_sequences=False) # Output: (batch, 200, 64) ā all hidden states (return_sequences=True) # =========================== # 2. Stacking LSTM layers # =========================== model = tf.keras.Sequential([ layers.Embedding(10000, 64, input_length=200), # Layer 1: return ALL timestep outputs ā feed to next LSTM layers.LSTM(128, return_sequences=True, dropout=0.2), # Layer 2: return only LAST output ā feed to Dense layers.LSTM(64, return_sequences=False, dropout=0.2), layers.Dense(64, activation='relu'), layers.Dropout(0.5), layers.Dense(1, activation='sigmoid') ]) # CRITICAL: return_sequences # When stacking LSTMs: all layers EXCEPT the last need return_sequences=True # Last LSTM: return_sequences=False (or use GlobalAveragePooling1D) # =========================== # 3. Parameter count # =========================== # LSTM(64) with input_dim=64: # Parameters = 4 Ć ((input_dim + units + 1) Ć units) # = 4 Ć ((64 + 64 + 1) Ć 64) # = 4 Ć 8,256 = 33,024 # The "4" = 4 gates: forget, input, cell candidate, output model.summary() # Embedding: 10000 Ć 64 = 640,000 # LSTM 128: 4 Ć (64+128+1) Ć 128 = 98,816 # LSTM 64: 4 Ć (128+64+1) Ć 64 = 49,408 # Dense: 64 Ć 64 + 64 = 4,160 # Total: ~792k parameters
š return_sequences: Kapan True vs False?
return_sequences=False (default): output hanya hidden state dari timestep terakhir. Shape: (batch, units). Gunakan saat: layer LSTM terakhir sebelum Dense, atau untuk classification.
return_sequences=True: output hidden state dari setiap timestep. Shape: (batch, timesteps, units). Gunakan saat: stacking LSTM layers, atau untuk sequence-to-sequence tasks (translation, tagging).
Rule: Stacking LSTM? Semua return_sequences=True kecuali LSTM terakhir.
š return_sequences: When True vs False?
return_sequences=False (default): output only the hidden state from the last timestep. Shape: (batch, units). Use when: last LSTM layer before Dense, or for classification.
return_sequences=True: output hidden state from every timestep. Shape: (batch, timesteps, units). Use when: stacking LSTM layers, or for sequence-to-sequence tasks (translation, tagging).
Rule: Stacking LSTMs? All return_sequences=True except the last LSTM.
5. GRU ā Alternatif LSTM yang Lebih Ringan
5. GRU ā Lighter LSTM Alternative
from tensorflow.keras import layers # GRU: 2 gates (reset, update) vs LSTM: 3 gates + cell state gru_model = tf.keras.Sequential([ layers.Embedding(10000, 64), layers.GRU(64, return_sequences=True, dropout=0.2), layers.GRU(32, dropout=0.2), layers.Dense(1, activation='sigmoid') ]) # GRU(64) params = 3 Ć ((64 + 64 + 1) Ć 64) = 24,768 # LSTM(64) params = 4 Ć ((64 + 64 + 1) Ć 64) = 33,024 # GRU has 25% fewer parameters ā faster training! # =========================== # When to use which? # =========================== # GRU: faster, fewer params, good for shorter sequences # Try GRU first for speed, switch to LSTM if needed # LSTM: more expressive (separate cell state), better for # very long sequences, slightly more accurate on some tasks # In practice: difference is often < 1% accuracy
| Aspek | LSTM | GRU |
|---|---|---|
| Gates | 3 (forget, input, output) + cell | 2 (reset, update) |
| Parameters | 4 Ć (input + hidden + 1) Ć hidden | 3 Ć (input + hidden + 1) Ć hidden |
| Speed | Slower | ~25% faster |
| Memory | Separate cell state | Combined hidden/cell |
| Long Sequences | Sedikit lebih baik | Good enough |
| Recommendation | Default untuk NLP | Coba dulu untuk speed |
| Aspect | LSTM | GRU |
|---|---|---|
| Gates | 3 (forget, input, output) + cell | 2 (reset, update) |
| Parameters | 4 Ć (input + hidden + 1) Ć hidden | 3 Ć (input + hidden + 1) Ć hidden |
| Speed | Slower | ~25% faster |
| Memory | Separate cell state | Combined hidden/cell |
| Long Sequences | Slightly better | Good enough |
| Recommendation | Default for NLP | Try first for speed |
6. Bidirectional LSTM ā Konteks Dua Arah
6. Bidirectional LSTM ā Two-Way Context
LSTM biasa hanya membaca dari kiri ke kanan. Tapi dalam bahasa, konteks sering datang dari kedua arah. Contoh: "The movie was not good" ā kata "not" mengubah makna "good" yang datang setelahnya. Bidirectional LSTM menjalankan dua LSTM secara paralel: satu maju (ā) dan satu mundur (ā), lalu menggabungkan output keduanya.
Regular LSTM only reads left to right. But in language, context often comes from both directions. Example: "The movie was not good" ā the word "not" changes the meaning of "good" that comes after it. Bidirectional LSTM runs two LSTMs in parallel: one forward (ā) and one backward (ā), then combines both outputs.
from tensorflow.keras import layers import tensorflow as tf # =========================== # 1. Bidirectional wrapper # =========================== bilstm = layers.Bidirectional( layers.LSTM(64, return_sequences=True), merge_mode='concat' # default: concatenate forward + backward # merge_mode='sum' # add forward + backward # merge_mode='mul' # multiply # merge_mode='ave' # average ) # Input: (batch, 200, 64) ā 200 timesteps, 64 features # Output: (batch, 200, 128) ā 64 forward + 64 backward = 128! # =========================== # 2. Full BiLSTM model for text classification # =========================== model = tf.keras.Sequential([ layers.Embedding(10000, 64, input_length=200), # BiLSTM layer 1: captures bidirectional patterns layers.Bidirectional(layers.LSTM(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2)), # Output: (batch, 200, 128) ā 64 forward + 64 backward # BiLSTM layer 2: further refine layers.Bidirectional(layers.LSTM(32, dropout=0.2)), # Output: (batch, 64) ā 32 forward + 32 backward # Classifier layers.Dense(64, activation='relu'), layers.Dropout(0.5), layers.Dense(1, activation='sigmoid') ]) model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) model.summary() # Embedding: 640k | BiLSTM1: 66k | BiLSTM2: 25k | Dense: 4k # Total: ~735k parameters # =========================== # 3. How Bidirectional helps # =========================== # "The movie was NOT good" # Forward LSTM (ā): sees "not" BEFORE "good" ā context captured # Backward LSTM (ā): sees "good" BEFORE "not" ā also captured! # Combined: model understands "not good" = negative from BOTH sides # Compare accuracy: # Unidirectional LSTM: ~84% on IMDB # Bidirectional LSTM: ~87% on IMDB ā +3% from bidirectional!
7. Proyek: IMDB Sentiment Classifier ā 87%+ Akurasi
7. Project: IMDB Sentiment Classifier ā 87%+ Accuracy
import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers import numpy as np # =========================== # 1. LOAD IMDB DATASET # =========================== VOCAB_SIZE = 10000 # top 10k most common words MAX_LEN = 200 # pad/truncate to 200 words (X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data( num_words=VOCAB_SIZE) print(f"Train: {len(X_train)} reviews") # 25,000 print(f"Test: {len(X_test)} reviews") # 25,000 print(f"Sample lengths: {[len(x) for x in X_train[:5]]}") # [218, 189, 141, 550, 147] ā variable length! # Pad sequences to fixed length X_train = keras.utils.pad_sequences(X_train, maxlen=MAX_LEN, padding='post', truncating='post') X_test = keras.utils.pad_sequences(X_test, maxlen=MAX_LEN, padding='post', truncating='post') print(f"After padding: {X_train.shape}") # (25000, 200) # =========================== # 2. BUILD BiLSTM MODEL # =========================== model = keras.Sequential([ # Embedding: word index ā dense vector layers.Embedding(VOCAB_SIZE, 64, input_length=MAX_LEN, mask_zero=True), # mask padding (0s)! # BiLSTM layers layers.Bidirectional(layers.LSTM(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2)), layers.Bidirectional(layers.LSTM(32, dropout=0.2)), # Classifier layers.Dense(64, activation='relu'), layers.Dropout(0.5), layers.Dense(1, activation='sigmoid') ]) # =========================== # 3. COMPILE # =========================== model.compile( optimizer=keras.optimizers.Adam(1e-3), loss='binary_crossentropy', metrics=['accuracy'] ) model.summary() # =========================== # 4. TRAIN # =========================== history = model.fit( X_train, y_train, epochs=15, batch_size=64, validation_split=0.2, callbacks=[ keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True), keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=2) ] ) # =========================== # 5. EVALUATE # =========================== test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0) print(f"\nš¬ Test Accuracy: {test_acc:.1%}") # š¬ Test Accuracy: 87.2% with BiLSTM! š # Compare: simple LSTM = 84%, BiLSTM = 87%, BERT (Page 6) = 95%+ # =========================== # 6. PREDICT on new reviews # =========================== word_index = keras.datasets.imdb.get_word_index() reverse_index = {v+3: k for k, v in word_index.items()} def encode_review(text): words = text.lower().split() encoded = [word_index.get(w, 2) + 3 for w in words] padded = keras.utils.pad_sequences([encoded], maxlen=MAX_LEN) return padded review = "This film was absolutely terrible waste of time" pred = model.predict(encode_review(review))[0,0] print(f"'{review}'") print(f"Sentiment: {'Positive' if pred > 0.5 else 'Negative'} ({pred:.1%})") # Sentiment: Negative (8.3%) ā
š¬ 87.2% Akurasi dengan BiLSTM!
Bandingkan evolusi akurasi kita:
⢠Seri NN (manual NumPy): ~80% (ratusan baris kode)
⢠Simple LSTM Keras: ~84% (20 baris kode)
⢠BiLSTM Keras: 87.2% (25 baris kode)
⢠BERT fine-tuned (Page 6): 95%+ (akan datang!)
Setiap teknik memberikan peningkatan ā tapi BERT akan menjadi game-changer total.
š¬ 87.2% Accuracy with BiLSTM!
Compare our accuracy evolution:
⢠NN Series (manual NumPy): ~80% (hundreds of lines)
⢠Simple LSTM Keras: ~84% (20 lines of code)
⢠BiLSTM Keras: 87.2% (25 lines of code)
⢠BERT fine-tuned (Page 6): 95%+ (coming next!)
Each technique brings improvement ā but BERT will be a total game-changer.
8. Padding & Masking ā Sequence Panjang Berbeda
8. Padding & Masking ā Variable-Length Sequences
import tensorflow as tf from tensorflow.keras import layers # =========================== # 1. Padding ā make all sequences same length # =========================== sequences = [[4, 2, 8], # 3 words [1, 5, 9, 3, 7], # 5 words [6]] # 1 word padded = tf.keras.utils.pad_sequences(sequences, maxlen=5, padding='post', # add zeros at END (recommended for RNN) truncating='post', # if too long, cut from END value=0 # padding value (0 = default) ) print(padded) # [[4, 2, 8, 0, 0], # [1, 5, 9, 3, 7], # [6, 0, 0, 0, 0]] # =========================== # 2. Masking ā tell model to IGNORE padding # =========================== # Method 1: mask_zero=True in Embedding embedding = layers.Embedding(10000, 64, mask_zero=True) # This tells downstream layers: "index 0 = padding, ignore it!" # LSTM and GRU automatically use this mask ā skip padded timesteps # Method 2: Masking layer (explicit) model = tf.keras.Sequential([ layers.Embedding(10000, 64), layers.Masking(mask_value=0.0), # explicit mask on zero vectors layers.LSTM(64), layers.Dense(1, activation='sigmoid') ]) # =========================== # 3. Why masking matters # =========================== # Without masking: LSTM processes padding tokens as real data # ā "I love this movie 0 0 0 0 0 0" # ā LSTM "reads" 6 zeros ā pollutes hidden state! # With masking: LSTM skips padding tokens completely # ā Only processes "I love this movie" ā cleaner output! # Impact: +1-2% accuracy improvement, especially for short texts
9. Pre-trained Embeddings ā GloVe & TF Hub
9. Pre-trained Embeddings ā GloVe & TF Hub
import tensorflow as tf import numpy as np # =========================== # Method 1: Load GloVe embeddings # Download: https://nlp.stanford.edu/projects/glove/ # =========================== def load_glove(filepath, embedding_dim=100): """Load GloVe vectors from file""" embeddings = {} with open(filepath, encoding='utf-8') as f: for line in f: values = line.split() word = values[0] vector = np.asarray(values[1:], dtype='float32') embeddings[word] = vector print(f"Loaded {len(embeddings)} word vectors") return embeddings # glove = load_glove('glove.6B.100d.txt') # 400k words, 100-dim # Build embedding matrix for your vocabulary def build_embedding_matrix(word_index, glove, vocab_size, embed_dim): matrix = np.zeros((vocab_size, embed_dim)) for word, idx in word_index.items(): if idx < vocab_size and word in glove: matrix[idx] = glove[word] return matrix # embedding_matrix = build_embedding_matrix(word_index, glove, 10000, 100) # Use in model (freeze or fine-tune): # embedding = layers.Embedding(10000, 100, # embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix), # trainable=False) # freeze pre-trained weights # =========================== # Method 2: TF Hub ā pre-trained sentence encoders # =========================== # Universal Sentence Encoder ā maps ANY sentence to 512-dim vector # import tensorflow_hub as hub # # embed = hub.KerasLayer( # "https://tfhub.dev/google/universal-sentence-encoder/4", # trainable=False) # 512-dim output per sentence # # # Super simple classifier: # model = tf.keras.Sequential([ # embed, # sentence ā 512-dim # layers.Dense(64, activation='relu'), # layers.Dropout(0.3), # layers.Dense(1, activation='sigmoid') # ]) # ā 90%+ accuracy with minimal code! # ā Works with raw string input ā no tokenization needed! # =========================== # Method 3: NNLM (Neural Network Language Model) # =========================== # embed = hub.KerasLayer( # "https://tfhub.dev/google/nnlm-en-dim128/2", # trainable=True) # 128-dim, fine-tunable
š Kapan Pakai Pre-trained Embeddings?
Dataset kecil (<10k samples): Selalu gunakan pre-trained! Model Anda tidak punya cukup data untuk belajar embedding yang bagus dari nol.
Dataset besar (>100k samples): Bisa train from scratch, tapi pre-trained + fine-tune sering tetap lebih baik.
TF Hub USE vs GloVe: USE lebih mudah (input = raw string), GloVe lebih fleksibel (per-word embeddings). Untuk classification cepat: USE. Untuk research: GloVe/Word2Vec.
š When to Use Pre-trained Embeddings?
Small dataset (<10k samples): Always use pre-trained! Your model doesn't have enough data to learn good embeddings from scratch.
Large dataset (>100k samples): Can train from scratch, but pre-trained + fine-tune is often still better.
TF Hub USE vs GloVe: USE is easier (input = raw strings), GloVe is more flexible (per-word embeddings). For quick classification: USE. For research: GloVe/Word2Vec.
10. Complete NLP Pipeline ā Raw Text ā Prediksi
10. Complete NLP Pipeline ā Raw Text ā Prediction
import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā # š„ PRODUCTION NLP PIPELINE TEMPLATE # Input: raw text strings # Output: sentiment/class prediction # āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā # Config VOCAB_SIZE = 20000 MAX_LEN = 200 EMBED_DIM = 64 # 1. Vectorizer (adapt on training data) vectorizer = layers.TextVectorization( max_tokens=VOCAB_SIZE, output_sequence_length=MAX_LEN) vectorizer.adapt(train_texts) # build vocabulary # 2. Model (preprocessing INSIDE!) model = keras.Sequential([ vectorizer, # text ā integers layers.Embedding(VOCAB_SIZE, EMBED_DIM, mask_zero=True), layers.Bidirectional(layers.LSTM(64, return_sequences=True, dropout=0.2)), layers.Bidirectional(layers.LSTM(32, dropout=0.2)), layers.Dense(64, activation='relu'), layers.Dropout(0.5), layers.Dense(1, activation='sigmoid') # binary # layers.Dense(NUM_CLASSES, activation='softmax') # multi-class ]) # 3. Train with RAW STRINGS! model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) model.fit( train_texts, train_labels, # raw strings + labels! epochs=15, batch_size=64, validation_split=0.2, callbacks=[keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True)] ) # 4. Predict on raw text ā NO preprocessing needed! reviews = [ "This movie was absolutely fantastic! Best film of the year.", "Terrible waste of time. Worst movie ever made.", "It was okay, nothing special but not bad either.", ] predictions = model.predict(reviews) for review, pred in zip(reviews, predictions): sentiment = "Positive" if pred > 0.5 else "Negative" print(f"{sentiment} ({pred[0]:.1%}): {review[:50]}...") # Positive (94.2%): This movie was absolutely fantastic! Best fi... # Negative (3.1%): Terrible waste of time. Worst movie ever ma... # Positive (61.3%): It was okay, nothing special but not bad ei... # 5. Save ā preprocessing included! model.save("sentiment_analyzer.keras") # loaded = keras.models.load_model("sentiment_analyzer.keras") # loaded.predict(["Great movie!"]) ā works with raw text! ā
11. Ringkasan Page 5
11. Page 5 Summary
| Konsep | Apa Itu | Kode Kunci |
|---|---|---|
| TextVectorization | Teks ā integer di dalam model | TextVectorization(max_tokens=10000) |
| Embedding | Integer ā dense vector bermakna | Embedding(10000, 64, mask_zero=True) |
| LSTM | Sequence model: 3 gates + cell | LSTM(64, return_sequences=True) |
| GRU | LSTM lebih ringan: 2 gates | GRU(64, dropout=0.2) |
| Bidirectional | Baca maju + mundur | Bidirectional(LSTM(64)) |
| Padding | Samakan panjang sequence | pad_sequences(X, maxlen=200) |
| Masking | Abaikan padding di model | mask_zero=True |
| Pre-trained | Embedding yang sudah terlatih | GloVe, TF Hub USE, NNLM |
| Production Pipeline | Raw text ā prediction | TextVectorization ā Embedding ā BiLSTM |
| Concept | What It Is | Key Code |
|---|---|---|
| TextVectorization | Text ā integer inside model | TextVectorization(max_tokens=10000) |
| Embedding | Integer ā meaningful dense vector | Embedding(10000, 64, mask_zero=True) |
| LSTM | Sequence model: 3 gates + cell | LSTM(64, return_sequences=True) |
| GRU | Lighter LSTM: 2 gates | GRU(64, dropout=0.2) |
| Bidirectional | Read forward + backward | Bidirectional(LSTM(64)) |
| Padding | Make sequences same length | pad_sequences(X, maxlen=200) |
| Masking | Ignore padding in model | mask_zero=True |
| Pre-trained | Already-trained embeddings | GloVe, TF Hub USE, NNLM |
| Production Pipeline | Raw text ā prediction | TextVectorization ā Embedding ā BiLSTM |
Page 4 ā tf.data Pipeline & Performance
Coming Next: Page 6 ā Transformer & BERT di TensorFlow
Arsitektur modern yang merevolusi NLP: Multi-Head Attention layer (built-in Keras!), membangun Transformer Encoder dari building blocks, Positional Encoding, fine-tuning BERT dari TF Hub, integrasi Hugging Face + TensorFlow, text classification 95%+ dengan BERT, dan perbandingan LSTM vs Transformer. Game-changer yang membuat LSTM terlihat kuno!
Coming Next: Page 6 ā Transformer & BERT in TensorFlow
The modern architecture that revolutionized NLP: Multi-Head Attention layer (built-in Keras!), building Transformer Encoder from building blocks, Positional Encoding, fine-tuning BERT from TF Hub, Hugging Face + TensorFlow integration, 95%+ text classification with BERT, and LSTM vs Transformer comparison. The game-changer that makes LSTM look outdated!