A sophisticated deep learning recommendation system using a Two-Tower neural network architecture that combines collaborative filtering (ALS) with transformer-based embeddings (BERT) to predict user ratings and generate personalized product recommendations on Amazon Products dataset
๐ Read the full technical walkthrough on Medium: Building a Two-Tower Recommendation Engine: Combining BERT, ALS, and Popularity-Weighted Negative Sampling
Project Statistics:
- ๐๏ธ 600 unique products from Amazon product reviews
- ๐ฅ 71,044 original reviews โ 284,176 samples (with negative sampling)
- ๐ง 322,817 trainable parameters (~1.23 MB model)
- ๐ 5 iterative model versions (v1 โ v5_power)
- โก ~70ms inference latency (after warmup)
This project implements a hybrid recommendation engine that solves the following business problem:
Given a user's past review behavior and product metadata, predict which products they are likely to rate highly and recommend personalized top-k products.
The system combines three powerful techniques:
- Collaborative Filtering (ALS): Captures user-product interaction patterns through matrix factorization
- Content-Based Filtering (BERT): Extracts semantic meaning from review text and product descriptions
- Temporal User Modeling: Uses running mean of previous reviews to capture evolving user preferences
This hybrid approach handles both cold-start users (via content features) and power users (via collaborative signals) while avoiding data leakage through careful temporal windowing.
- ๐๏ธ Two-Tower Architecture: Separate embedding towers for users and products with fusion layer for rating prediction
- ๐ Dual Embedding Strategy: Combines ALS (64-dim) + BERT (384-dim) embeddings for rich representations
- ๐ฏ Sophisticated Negative Sampling: Popularity-weighted sampling (3 negatives per positive) with label smoothing
- ๐ค Anonymous User Handling: Special treatment for users without account information using global statistics
- ๐ Running Mean Embeddings: Historical user preferences computed from past reviews (excludes current review to prevent leakage)
- โก Efficient Batch Inference: Pre-computed product embeddings enable fast top-k recommendations
- ๐พ Production-Ready: Models saved in both
.kerasand.weights.h5formats - ๐ง XLA Compilation: TensorFlow XLA for optimized training and inference
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ TWO-TOWER MODEL โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ USER TOWER โ PRODUCT TOWER โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Inputs: โ Inputs: โ
โ โข ALS Embedding (64) โ โข ALS Embedding (64) โ
โ โข BERT Embedding (384) โ โข BERT Embedding (384) โ
โ โข is_anonymous (1) โ โ
โ โ โ
โ Total: 449 dims โ Total: 448 dims โ
โ โ โ โ โ
โ Dense(256, elu) โ Dense(256, elu) โ
โ โ โ โ โ
โ Dense(128, elu) โ Dense(128, elu) โ
โ โ โ โ โ
โ Dropout(0.3) โ Dropout(0.3) โ
โ โ โ โ โ
โ Dense(64, elu) โ Dense(64, elu) โ
โ โ โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโ
โ FUSION LAYER โ
โ Concatenate(128) โ
โ โ โ
โ Dense(64, elu) โ
โ โ โ
โ Dropout(0.2) โ
โ โ โ
โ Dense(32, elu) โ
โ โ โ
โ Dense(1, linear) โ
โ โ โ
โ Rating Prediction โ
โโโโโโโโโโโโโโโโโโโโโโโ
- Activation Function: ELU (Exponential Linear Unit) for all hidden layers
- Weight Initialization: He Normal (optimal for ELU activations)
- Regularization: Dropout (0.3 in towers, 0.2 in fusion layer)
- Output Layer: Linear activation for continuous rating prediction (0-5 scale)
- Total Parameters: 322,817 trainable parameters
The project uses the Amazon Product Reviews dataset containing grammar and product reviews.
Input File: GrammarandProductReviews.csv
- 71,044 product reviews
- 600 unique products
- 25 columns including:
- Product metadata:
id,brand,categories,manufacturer,name - Review data:
reviews.rating(1-5),reviews.text,reviews.title,reviews.username - Timestamps:
dateAdded,dateUpdated,reviews.date - User info:
reviews.userCity,reviews.userProvince
- Product metadata:
Generated Files:
- negative_samples.csv - Synthetically generated negative training samples
- final_df.csv - Combined dataset with positive + negative samples (284,176 records)
- Original Reviews: 71,044
- After Negative Sampling: 284,176 (4x expansion)
- Negative Sample Ratio: 75% negative samples
- Mean Rating (after sampling): 1.85
- Standard Deviation: 1.64
- Rating Range: 0-5
- Fill missing review titles with
"No Title" - Fill missing review text with
"No Review" - Fill missing usernames with placeholder
"Akshat" - Fill missing manufacturers with mode value
- Create
user_idby mapping usernames to integers - Create
prod_idby mapping product IDs to integers (0-599) - Convert ratings to float32
- Train AlternatingLeastSquares model using
implicitlibrary - Factors: 64 dimensions
- Generates:
- User embeddings (64-dim) from interaction patterns
- Product embeddings (64-dim) from interaction patterns
- Anonymous users receive zero vectors
User Review Embeddings (384 dimensions):
- Model:
all-MiniLM-L6-v2(SentenceTransformer) - Input: Concatenate
review_title + review_text - Running Mean Strategy: Each user receives the average of all their PREVIOUS review embeddings
- Excludes current review (prevents data leakage)
- First review for new user โ zero vector
- Anonymous users โ global average embedding
Product Embeddings (384 dimensions):
- Input: Concatenate
name + brand + categories + manufacturer - Fixed per product (pre-computed)
- Binary feature:
is_anonymous_user(0 or 1) - Flags users: "Anonymous", "An anonymous customer", "ByAmazon Customer"
- Allows model to learn different behavior patterns for anonymous users
Innovation: Popularity-based weighted sampling with label smoothing
Method:
- For each positive review, generate k=3 negative samples
- Sample products using popularity distribution (alpha=1.0)
- Assign low negative ratings (uniform random 0-2)
- Ensures model learns from items the user hasn't interacted with
- Prevents popularity bias while maintaining realism
Result: Dataset expands from 71,044 โ 284,176 records
# Core ML/DL
tensorflow>=2.15.0
keras>=2.15.0
numpy>=1.24.0
# Data Processing
pandas>=2.0.0
scikit-learn>=1.3.0
# Recommendation/Embeddings
implicit>=0.7.0
sentence-transformers>=2.2.0
# Visualization
matplotlib>=3.7.0
seaborn>=0.12.0
pydot>=1.4.0
graphviz>=0.20.0
# Utilities
scipy>=1.11.0
Pillow>=10.0.0
tqdm>=4.65.0
jupyterlab>=4.0.0
# GPU Acceleration (optional)
nvidia-cuda-nvcc-cu12# Clone or navigate to project directory
cd /home/akshat/recsys
# Install dependencies
pip install tensorflow keras numpy pandas scikit-learn implicit sentence-transformers matplotlib seaborn scipy tqdm pydot graphviz
# Launch Jupyter
jupyter labOpen and run recsys.ipynb to:
- Load and preprocess data
- Train ALS model
- Generate BERT embeddings
- Create negative samples
- Build and train the Two-Tower model
- Save trained model
The notebook handles the complete pipeline from raw data to trained model.
import tensorflow as tf
# Load the latest model (v5_power)
model = tf.keras.models.load_model('two_tower_recsys_v5_power.keras')
# Or load weights only
# model.load_weights('my_two_tower_model_v5_power.weights.h5')# Recommend top-5 products for specific users
recommendations = predict_for_users_batched(
user_ids=[0, 243, 599],
top_k=5,
product_ids=None # None = all products
)
# Output format: {user_id: [(product_id, predicted_rating), ...]}
# Example:
# {
# 0: [(542, 0.948), (120, 0.948), (17, 0.948)],
# 243: [(262, 4.35), (91, 4.34), (28, 4.34)],
# 599: [(91, 4.40), (381, 4.37), (90, 4.36)]
# }The system uses pre-computed product embeddings for efficient inference:
# Product embeddings are pre-computed and stored
product_tower_forward_pass_embedding_dict # 600 products ready
# Batch inference with 1024 batch size
# First call: ~66-734ms (warmup)
# Subsequent calls: ~66-75ms (optimized)The project includes 5 iterative model versions, each representing improvements and experiments:
| Version | Keras Model | Weights File | Notes |
|---|---|---|---|
| v1 | two_tower_recsys_v1.keras | my_two_tower_model.weights.h5 | Initial baseline |
| v2 | two_tower_recsys_v2.keras | my_two_tower_model_v2.weights.h5 | Iteration 2 |
| v3 | two_tower_recsys_v3.keras | my_two_tower_model_v3.weights.h5 | Iteration 3 |
| v4 | two_tower_recsys_v4.keras | my_two_tower_model_v4.weights.h5 | Iteration 4 |
| v5_power | two_tower_recsys_v5_power.keras | my_two_tower_model_v5_power.weights.h5 | Latest & most refined |
Recommended: Use v5_power for best performance.
# Optimizer
optimizer = 'adam'
# Loss & Metrics
loss = 'mean_squared_error'
metrics = ['mean_absolute_error']
# Training
epochs = 10-15
batch_size = 32
total_samples = 284_176
batches_per_epoch = 8_881
# Data Pipeline
shuffle_buffer = 1024
prefetch = tf.data.AUTOTUNE
# Optimization
xla_compilation = True # Accelerated Linear Algebra- Data Pipeline: TensorFlow
tf.datawith shuffling and prefetching - XLA Compilation: Compiled clusters for faster execution
- Batch Training: 32 samples per batch, ~8,881 batches per epoch
- Monitoring: MSE loss and MAE metrics tracked during training
- Checkpointing: Models saved after achieving satisfactory performance
Initial Training Metrics:
- Starting Loss: ~2.64
- Starting MAE: ~1.24
- First Prediction (cold start): ~66-734ms
- Subsequent Predictions (warm): ~66-75ms
- Batch Size: 1024 samples
- Optimization: Pre-computed product embeddings stored in memory
- Pre-computed Embeddings: Product tower outputs computed once for all 600 products
- Batch Inference: Tile user features across products for vectorized predictions
- XLA Compilation: JIT compilation for optimized ops
- Memory Efficiency: ~1.23 MB model size
-
Temporal Data Handling: Running mean of PREVIOUS reviews only (excludes current review)
- Prevents data leakage during training
- Captures evolving user preferences naturally
-
Dual Embedding Strategy: Combines complementary signals
- ALS: Captures collaborative patterns (who likes similar items)
- BERT: Captures content semantics (what items are similar)
-
Explicit Anonymous User Feature: Binary flag allows model to learn different patterns
- Anonymous users may have different rating behaviors
- Model can adapt predictions accordingly
-
Production-Ready Design:
- Saved in standard Keras formats (
.kerasand.weights.h5) - Efficient batch inference pipeline
- Pre-computed embeddings for scalability
- Saved in standard Keras formats (
-
Sophisticated Negative Sampling:
- Popularity-weighted (not uniform random)
- Label smoothing with low ratings (0-2)
- Realistic training distribution
recsys/
โโโ README.md # This file
โโโ recsys.ipynb # Main training notebook
โโโ dummy.ipynb # Simple data exploration notebook
โ
โโโ GrammarandProductReviews.csv # Raw dataset (71K reviews)
โโโ negative_samples.csv # Generated negative samples
โโโ final_df.csv # Combined dataset (284K samples)
โ
โโโ two_tower_recsys_v1.keras # Model v1
โโโ two_tower_recsys_v2.keras # Model v2
โโโ two_tower_recsys_v3.keras # Model v3
โโโ two_tower_recsys_v4.keras # Model v4
โโโ two_tower_recsys_v5_power.keras # Model v5 (latest)
โ
โโโ my_two_tower_model.weights.h5 # Weights v1
โโโ my_two_tower_model_v2.weights.h5 # Weights v2
โโโ my_two_tower_model_v3.weights.h5 # Weights v3
โโโ my_two_tower_model_v4.weights.h5 # Weights v4
โโโ my_two_tower_model_v5_power.weights.h5 # Weights v5 (latest)
Potential enhancements for the system:
- Multi-task Learning: Predict both ratings and purchase probability
- Attention Mechanisms: Add attention layers to weight embedding importance
- Contextual Features: Incorporate time-of-day, seasonality, device type
- Online Learning: Support incremental updates with new user interactions
- A/B Testing Framework: Compare model versions in production
- Explainability: Add SHAP or attention visualization for recommendation explanations
- Diversity Metrics: Optimize for recommendation diversity alongside accuracy
- Real-time Serving: Deploy with TensorFlow Serving or TorchServe
- Cross-validation: Implement temporal cross-validation for robust evaluation
- Hyperparameter Tuning: Automated search with Optuna or Ray Tune
This project is available for educational and research purposes.
Akshat
Data Scientist | Machine Learning Engineer
- Dataset: Amazon Product Reviews (Grammar and Product Reviews subset)
- Libraries: TensorFlow, Keras, Hugging Face Transformers, Implicit
- Inspiration: Two-Tower architecture from Google Research and YouTube recommendations
- Two-Tower Models for Recommendation
- Neural Collaborative Filtering
- Sentence-BERT: Sentence Embeddings using Siamese Networks
- Implicit Feedback for Collaborative Filtering