Skip to content

akshatkhatri/Product-Recommendation-system-Two-tower-model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

8 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Two-Tower Product Recommendation System

A sophisticated deep learning recommendation system using a Two-Tower neural network architecture that combines collaborative filtering (ALS) with transformer-based embeddings (BERT) to predict user ratings and generate personalized product recommendations on Amazon Products dataset

๐Ÿ“ Read the full technical walkthrough on Medium: Building a Two-Tower Recommendation Engine: Combining BERT, ALS, and Popularity-Weighted Negative Sampling

Project Statistics:

  • ๐Ÿ›๏ธ 600 unique products from Amazon product reviews
  • ๐Ÿ‘ฅ 71,044 original reviews โ†’ 284,176 samples (with negative sampling)
  • ๐Ÿง  322,817 trainable parameters (~1.23 MB model)
  • ๐Ÿš€ 5 iterative model versions (v1 โ†’ v5_power)
  • โšก ~70ms inference latency (after warmup)

๐Ÿ“‹ Overview

This project implements a hybrid recommendation engine that solves the following business problem:

Given a user's past review behavior and product metadata, predict which products they are likely to rate highly and recommend personalized top-k products.

Key Innovation

The system combines three powerful techniques:

  1. Collaborative Filtering (ALS): Captures user-product interaction patterns through matrix factorization
  2. Content-Based Filtering (BERT): Extracts semantic meaning from review text and product descriptions
  3. Temporal User Modeling: Uses running mean of previous reviews to capture evolving user preferences

This hybrid approach handles both cold-start users (via content features) and power users (via collaborative signals) while avoiding data leakage through careful temporal windowing.


โœจ Features

  • ๐Ÿ—๏ธ Two-Tower Architecture: Separate embedding towers for users and products with fusion layer for rating prediction
  • ๐Ÿ”„ Dual Embedding Strategy: Combines ALS (64-dim) + BERT (384-dim) embeddings for rich representations
  • ๐ŸŽฏ Sophisticated Negative Sampling: Popularity-weighted sampling (3 negatives per positive) with label smoothing
  • ๐Ÿ‘ค Anonymous User Handling: Special treatment for users without account information using global statistics
  • ๐Ÿ“Š Running Mean Embeddings: Historical user preferences computed from past reviews (excludes current review to prevent leakage)
  • โšก Efficient Batch Inference: Pre-computed product embeddings enable fast top-k recommendations
  • ๐Ÿ’พ Production-Ready: Models saved in both .keras and .weights.h5 formats
  • ๐Ÿ”ง XLA Compilation: TensorFlow XLA for optimized training and inference

๐Ÿ›๏ธ Architecture

Model Structure

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      TWO-TOWER MODEL                            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚       USER TOWER         โ”‚        PRODUCT TOWER                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Inputs:                  โ”‚ Inputs:                              โ”‚
โ”‚  โ€ข ALS Embedding (64)    โ”‚  โ€ข ALS Embedding (64)                โ”‚
โ”‚  โ€ข BERT Embedding (384)  โ”‚  โ€ข BERT Embedding (384)              โ”‚
โ”‚  โ€ข is_anonymous (1)      โ”‚                                      โ”‚
โ”‚                          โ”‚                                      โ”‚
โ”‚ Total: 449 dims          โ”‚ Total: 448 dims                      โ”‚
โ”‚     โ†“                    โ”‚     โ†“                                โ”‚
โ”‚ Dense(256, elu)          โ”‚ Dense(256, elu)                      โ”‚
โ”‚     โ†“                    โ”‚     โ†“                                โ”‚
โ”‚ Dense(128, elu)          โ”‚ Dense(128, elu)                      โ”‚
โ”‚     โ†“                    โ”‚     โ†“                                โ”‚
โ”‚ Dropout(0.3)             โ”‚ Dropout(0.3)                         โ”‚
โ”‚     โ†“                    โ”‚     โ†“                                โ”‚
โ”‚ Dense(64, elu)           โ”‚ Dense(64, elu)                       โ”‚
โ”‚     โ†“                    โ”‚     โ†“                                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
              โ†“                           โ†“
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ†“
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚   FUSION LAYER      โ”‚
                    โ”‚   Concatenate(128)  โ”‚
                    โ”‚         โ†“           โ”‚
                    โ”‚   Dense(64, elu)    โ”‚
                    โ”‚         โ†“           โ”‚
                    โ”‚   Dropout(0.2)      โ”‚
                    โ”‚         โ†“           โ”‚
                    โ”‚   Dense(32, elu)    โ”‚
                    โ”‚         โ†“           โ”‚
                    โ”‚   Dense(1, linear)  โ”‚
                    โ”‚         โ†“           โ”‚
                    โ”‚  Rating Prediction  โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Design Choices

  • Activation Function: ELU (Exponential Linear Unit) for all hidden layers
  • Weight Initialization: He Normal (optimal for ELU activations)
  • Regularization: Dropout (0.3 in towers, 0.2 in fusion layer)
  • Output Layer: Linear activation for continuous rating prediction (0-5 scale)
  • Total Parameters: 322,817 trainable parameters

๐Ÿ“Š Data

Dataset

The project uses the Amazon Product Reviews dataset containing grammar and product reviews.

Input File: GrammarandProductReviews.csv

  • 71,044 product reviews
  • 600 unique products
  • 25 columns including:
    • Product metadata: id, brand, categories, manufacturer, name
    • Review data: reviews.rating (1-5), reviews.text, reviews.title, reviews.username
    • Timestamps: dateAdded, dateUpdated, reviews.date
    • User info: reviews.userCity, reviews.userProvince

Generated Files:

Data Statistics

  • Original Reviews: 71,044
  • After Negative Sampling: 284,176 (4x expansion)
  • Negative Sample Ratio: 75% negative samples
  • Mean Rating (after sampling): 1.85
  • Standard Deviation: 1.64
  • Rating Range: 0-5

๐Ÿ”„ Data Processing Pipeline

1. Data Cleaning

  • Fill missing review titles with "No Title"
  • Fill missing review text with "No Review"
  • Fill missing usernames with placeholder "Akshat"
  • Fill missing manufacturers with mode value

2. Encoding

  • Create user_id by mapping usernames to integers
  • Create prod_id by mapping product IDs to integers (0-599)
  • Convert ratings to float32

3. ALS Collaborative Filtering

  • Train AlternatingLeastSquares model using implicit library
  • Factors: 64 dimensions
  • Generates:
    • User embeddings (64-dim) from interaction patterns
    • Product embeddings (64-dim) from interaction patterns
  • Anonymous users receive zero vectors

4. BERT Embeddings

User Review Embeddings (384 dimensions):

  • Model: all-MiniLM-L6-v2 (SentenceTransformer)
  • Input: Concatenate review_title + review_text
  • Running Mean Strategy: Each user receives the average of all their PREVIOUS review embeddings
    • Excludes current review (prevents data leakage)
    • First review for new user โ†’ zero vector
    • Anonymous users โ†’ global average embedding

Product Embeddings (384 dimensions):

  • Input: Concatenate name + brand + categories + manufacturer
  • Fixed per product (pre-computed)

5. Anonymous User Handling

  • Binary feature: is_anonymous_user (0 or 1)
  • Flags users: "Anonymous", "An anonymous customer", "ByAmazon Customer"
  • Allows model to learn different behavior patterns for anonymous users

6. Negative Sampling Strategy

Innovation: Popularity-based weighted sampling with label smoothing

Method:

  • For each positive review, generate k=3 negative samples
  • Sample products using popularity distribution (alpha=1.0)
  • Assign low negative ratings (uniform random 0-2)
  • Ensures model learns from items the user hasn't interacted with
  • Prevents popularity bias while maintaining realism

Result: Dataset expands from 71,044 โ†’ 284,176 records


๐Ÿ’ป Installation

Requirements

# Core ML/DL
tensorflow>=2.15.0
keras>=2.15.0
numpy>=1.24.0

# Data Processing
pandas>=2.0.0
scikit-learn>=1.3.0

# Recommendation/Embeddings
implicit>=0.7.0
sentence-transformers>=2.2.0

# Visualization
matplotlib>=3.7.0
seaborn>=0.12.0
pydot>=1.4.0
graphviz>=0.20.0

# Utilities
scipy>=1.11.0
Pillow>=10.0.0
tqdm>=4.65.0
jupyterlab>=4.0.0

# GPU Acceleration (optional)
nvidia-cuda-nvcc-cu12

Setup

# Clone or navigate to project directory
cd /home/akshat/recsys

# Install dependencies
pip install tensorflow keras numpy pandas scikit-learn implicit sentence-transformers matplotlib seaborn scipy tqdm pydot graphviz

# Launch Jupyter
jupyter lab

๐Ÿš€ Usage

1. Training the Model

Open and run recsys.ipynb to:

  1. Load and preprocess data
  2. Train ALS model
  3. Generate BERT embeddings
  4. Create negative samples
  5. Build and train the Two-Tower model
  6. Save trained model

The notebook handles the complete pipeline from raw data to trained model.

2. Loading a Trained Model

import tensorflow as tf

# Load the latest model (v5_power)
model = tf.keras.models.load_model('two_tower_recsys_v5_power.keras')

# Or load weights only
# model.load_weights('my_two_tower_model_v5_power.weights.h5')

3. Getting Recommendations

# Recommend top-5 products for specific users
recommendations = predict_for_users_batched(
    user_ids=[0, 243, 599],
    top_k=5,
    product_ids=None  # None = all products
)

# Output format: {user_id: [(product_id, predicted_rating), ...]}
# Example:
# {
#   0: [(542, 0.948), (120, 0.948), (17, 0.948)],
#   243: [(262, 4.35), (91, 4.34), (28, 4.34)],
#   599: [(91, 4.40), (381, 4.37), (90, 4.36)]
# }

4. Batch Inference

The system uses pre-computed product embeddings for efficient inference:

# Product embeddings are pre-computed and stored
product_tower_forward_pass_embedding_dict  # 600 products ready

# Batch inference with 1024 batch size
# First call: ~66-734ms (warmup)
# Subsequent calls: ~66-75ms (optimized)

๐Ÿ“ฆ Model Versions

The project includes 5 iterative model versions, each representing improvements and experiments:

Version Keras Model Weights File Notes
v1 two_tower_recsys_v1.keras my_two_tower_model.weights.h5 Initial baseline
v2 two_tower_recsys_v2.keras my_two_tower_model_v2.weights.h5 Iteration 2
v3 two_tower_recsys_v3.keras my_two_tower_model_v3.weights.h5 Iteration 3
v4 two_tower_recsys_v4.keras my_two_tower_model_v4.weights.h5 Iteration 4
v5_power two_tower_recsys_v5_power.keras my_two_tower_model_v5_power.weights.h5 Latest & most refined

Recommended: Use v5_power for best performance.


โš™๏ธ Training Configuration

Hyperparameters

# Optimizer
optimizer = 'adam'

# Loss & Metrics
loss = 'mean_squared_error'
metrics = ['mean_absolute_error']

# Training
epochs = 10-15
batch_size = 32
total_samples = 284_176
batches_per_epoch = 8_881

# Data Pipeline
shuffle_buffer = 1024
prefetch = tf.data.AUTOTUNE

# Optimization
xla_compilation = True  # Accelerated Linear Algebra

Training Process

  1. Data Pipeline: TensorFlow tf.data with shuffling and prefetching
  2. XLA Compilation: Compiled clusters for faster execution
  3. Batch Training: 32 samples per batch, ~8,881 batches per epoch
  4. Monitoring: MSE loss and MAE metrics tracked during training
  5. Checkpointing: Models saved after achieving satisfactory performance

Initial Training Metrics:

  • Starting Loss: ~2.64
  • Starting MAE: ~1.24

โšก Performance

Inference Speed

  • First Prediction (cold start): ~66-734ms
  • Subsequent Predictions (warm): ~66-75ms
  • Batch Size: 1024 samples
  • Optimization: Pre-computed product embeddings stored in memory

Efficiency Features

  1. Pre-computed Embeddings: Product tower outputs computed once for all 600 products
  2. Batch Inference: Tile user features across products for vectorized predictions
  3. XLA Compilation: JIT compilation for optimized ops
  4. Memory Efficiency: ~1.23 MB model size

๐ŸŽฏ Technical Highlights

Key Innovations

  1. Temporal Data Handling: Running mean of PREVIOUS reviews only (excludes current review)

    • Prevents data leakage during training
    • Captures evolving user preferences naturally
  2. Dual Embedding Strategy: Combines complementary signals

    • ALS: Captures collaborative patterns (who likes similar items)
    • BERT: Captures content semantics (what items are similar)
  3. Explicit Anonymous User Feature: Binary flag allows model to learn different patterns

    • Anonymous users may have different rating behaviors
    • Model can adapt predictions accordingly
  4. Production-Ready Design:

    • Saved in standard Keras formats (.keras and .weights.h5)
    • Efficient batch inference pipeline
    • Pre-computed embeddings for scalability
  5. Sophisticated Negative Sampling:

    • Popularity-weighted (not uniform random)
    • Label smoothing with low ratings (0-2)
    • Realistic training distribution

๐Ÿ“ Project Structure

recsys/
โ”œโ”€โ”€ README.md                                    # This file
โ”œโ”€โ”€ recsys.ipynb                                 # Main training notebook
โ”œโ”€โ”€ dummy.ipynb                                  # Simple data exploration notebook
โ”‚
โ”œโ”€โ”€ GrammarandProductReviews.csv                 # Raw dataset (71K reviews)
โ”œโ”€โ”€ negative_samples.csv                         # Generated negative samples
โ”œโ”€โ”€ final_df.csv                                 # Combined dataset (284K samples)
โ”‚
โ”œโ”€โ”€ two_tower_recsys_v1.keras                   # Model v1
โ”œโ”€โ”€ two_tower_recsys_v2.keras                   # Model v2
โ”œโ”€โ”€ two_tower_recsys_v3.keras                   # Model v3
โ”œโ”€โ”€ two_tower_recsys_v4.keras                   # Model v4
โ”œโ”€โ”€ two_tower_recsys_v5_power.keras             # Model v5 (latest)
โ”‚
โ”œโ”€โ”€ my_two_tower_model.weights.h5               # Weights v1
โ”œโ”€โ”€ my_two_tower_model_v2.weights.h5            # Weights v2
โ”œโ”€โ”€ my_two_tower_model_v3.weights.h5            # Weights v3
โ”œโ”€โ”€ my_two_tower_model_v4.weights.h5            # Weights v4
โ””โ”€โ”€ my_two_tower_model_v5_power.weights.h5      # Weights v5 (latest)

๐Ÿ”ฎ Future Improvements

Potential enhancements for the system:

  • Multi-task Learning: Predict both ratings and purchase probability
  • Attention Mechanisms: Add attention layers to weight embedding importance
  • Contextual Features: Incorporate time-of-day, seasonality, device type
  • Online Learning: Support incremental updates with new user interactions
  • A/B Testing Framework: Compare model versions in production
  • Explainability: Add SHAP or attention visualization for recommendation explanations
  • Diversity Metrics: Optimize for recommendation diversity alongside accuracy
  • Real-time Serving: Deploy with TensorFlow Serving or TorchServe
  • Cross-validation: Implement temporal cross-validation for robust evaluation
  • Hyperparameter Tuning: Automated search with Optuna or Ray Tune

๐Ÿ“„ License

This project is available for educational and research purposes.


๐Ÿ‘ค Author

Akshat
Data Scientist | Machine Learning Engineer


๐Ÿ™ Acknowledgments

  • Dataset: Amazon Product Reviews (Grammar and Product Reviews subset)
  • Libraries: TensorFlow, Keras, Hugging Face Transformers, Implicit
  • Inspiration: Two-Tower architecture from Google Research and YouTube recommendations

๐Ÿ“š References


Built with โค๏ธ using TensorFlow, Keras, and BERT

About

A deep learning recommendation system using a Two-Tower neural network architecture that combines collaborative filtering (ALS) with transformer-based embeddings (BERT) to predict user ratings and generate personalized product recommendations on Amazon Products dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors