DeepDiet: Multimodal Deep Learning for Nutritional Content Estimation

Read the Full Project Report (PDF)

A Stanford CS230 Deep Learning project that estimates nutritional information (calories, mass, fat, carbs, protein) from multi-view food images using the Nutrition5k dataset.

Overview

DeepDiet explores how different deep learning architectures handle the task of nutritional estimation from food imagery. We implement and compare three approaches:

Multi-modal CNN-LSTM - A multi-branch fusion architecture combining EfficientNet-B0 encoders with BiLSTM temporal aggregation for side-view video frames
Cross-Swin-CLS - Feature Pyramid Network with Swin Transformer backbone and cross-attention decoder
ConvNeXt - Modernized ConvNet with differential learning rates for efficient transfer learning

Key Innovation

Processing temporal sequences of 16 rotating side-view frames (from 4 cameras) combined with overhead RGB and depth images for comprehensive food volume and nutrient estimation.

Architecture

The primary Multi-modal CNN-LSTM model uses a three-branch encoder-fusion architecture:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Side Frames   │     │  Overhead RGB   │     │  Overhead Depth │
│ [B, 16, 3, H, W]│     │   [B, 3, H, W]  │     │   [B, 1, H, W]  │
└────────┬────────┘     └────────┬────────┘     └────────┬────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  EfficientNet-B0│     │  EfficientNet-B0│     │  EfficientNet-B0│
│     Encoder     │     │     Encoder     │     │   (1-ch input)  │
└────────┬────────┘     └────────┬────────┘     └────────┬────────┘
         │                       │                       │
         ▼                       │                       │
┌─────────────────┐              │                       │
│ BiLSTM/Attention│              │                       │
│   Aggregation   │              │                       │
└────────┬────────┘              │                       │
         │                       │                       │
         └───────────────┬───────┴───────────────────────┘
                         ▼
               ┌─────────────────┐
               │  Feature Fusion │
               │  (3840 → 1024)  │
               └────────┬────────┘
                        │
         ┌──────────────┼──────────────┐
         ▼              ▼              ▼
    ┌─────────┐   ┌──────────┐   ┌──────────┐
    │  Mass   │   │ Calories │   │  Macros  │
    │  Head   │   │   Head   │   │   Head   │
    └─────────┘   └──────────┘   └──────────┘

Results

Multi-modal CNN-LSTM (Baseline: Overhead + Depth)

Metric	Nutrition5k (Depth 4ch)	Nutrition5k (Volume)	Ours (Fusion)
Calorie MAE	47.6	41.3	49.9
Mass MAE	40.7	29.4	33.0
Fat MAE	2.27	3.0	4.6
Carb MAE	4.6	4.5	10.1
Protein MAE	3.7	5.2	7.2

Cross-Swin-CLS Performance

Category	SMAPE (%)	PMAE (%)
Calories	22.50	16.61
Mass	15.93	10.87
Fat	50.54	27.06
Carbohydrates	37.44	23.81
Protein	37.37	23.95
Average	32.76	20.44

Installation

# Clone the repository
git clone https://github.com/gernim/deepdiet.git
cd deepdiet

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Dataset Setup

Download the Nutrition5k dataset
Place data in data/nutrition5k_dataset/
Use the official train/test splits in indexes/

Usage

Training

# Basic training with overhead RGB + depth (baseline)
python src/train.py --use-overhead --use-depth --epochs 20 --batch-size 8

# Full multi-modal training with side frames
python src/train.py --use-side-frames --use-overhead --use-depth --epochs 20

# Training with attention aggregation instead of LSTM
python src/train.py --use-side-frames --side-aggregation attention --grad-clip 0.5

# Training with frozen encoders (transfer learning)
python src/train.py --use-side-frames --freeze-encoders --unfreeze-epoch 10

# Resume from checkpoint
python src/train.py --use-side-frames --resume checkpoints/best_model.pt

# Enable Weights & Biases logging
python src/train.py --use-side-frames --wandb --wandb-project deepdiet

Hydra-based Configuration

# Default configuration
python src/train_hydra.py

# Override parameters
python src/train_hydra.py model.side_aggregation=attention training.lr=5e-5

Viewing Training Logs

tensorboard --logdir runs/

Project Structure

deepdiet/
├── configs/                 # Hydra configuration files
│   ├── model/              # Model configurations
│   ├── training/           # Training hyperparameters
│   ├── data/               # Data loading settings
│   └── logging/            # Logging configurations
├── data/
│   └── nutrition5k_dataset/ # Dataset location
├── docs/
│   └── DeepDiet___Project_Final-6.pdf  # Project report
├── indexes/                 # Train/test split CSV files
├── runs/                    # TensorBoard logs
├── src/
│   ├── model.py            # DeepDietModel architecture
│   ├── dataset.py          # MultiViewDataset data loader
│   ├── train.py            # Main training script
│   ├── train_hydra.py      # Hydra-based training
│   ├── config.py           # TrainingConfig dataclass
│   ├── transforms.py       # Data augmentation
│   ├── metrics.py          # Metric tracking utilities
│   └── training/
│       └── epoch.py        # Training/validation loops
└── requirements.txt

Key Findings

Overhead + Depth baseline achieves competitive performance with simpler architecture
Side-angle frames introduce overfitting risk - the model memorizes frame patterns rather than learning generalizable food features
Cross-Swin-CLS demonstrates best generalization (29% train-test gap)
Differential learning rates significantly reduce overfitting in ConvNeXt models

Citation

If you use this code in your research, please cite:

@misc{deepdiet2025,
  title={DeepDiet: Multimodal Deep Learning for Nutritional Content Estimation},
  author={Rawat, Mini and Gernitis, Mark and Sharma, Neetish},
  year={2025},
  institution={Stanford University, CS230}
}

References

Nutrition5k Dataset - Thames et al., 2021
EfficientNet - Tan & Le, 2019
Swin Transformer - Liu et al., 2021
ConvNeXt - Liu et al., 2022

License

This project is for educational purposes as part of Stanford CS230.

Authors

Mini Rawat - minir07@stanford.edu
Mark Gernitis - gernitis@stanford.edu
Neetish Sharma - neetishs@stanford.edu

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.idea		.idea
configs		configs
docs		docs
indexes		indexes
report		report
runs		runs
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
create_subset_splits.py		create_subset_splits.py
data_loader.py		data_loader.py
data_split_setup.py		data_split_setup.py
environment.yml		environment.yml
requirements.txt		requirements.txt
requirements_gcp.txt		requirements_gcp.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepDiet: Multimodal Deep Learning for Nutritional Content Estimation

Overview

Key Innovation

Architecture

Results

Multi-modal CNN-LSTM (Baseline: Overhead + Depth)

Cross-Swin-CLS Performance

Installation

Dataset Setup

Usage

Training

Hydra-based Configuration

Viewing Training Logs

Project Structure

Key Findings

Citation

References

License

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DeepDiet: Multimodal Deep Learning for Nutritional Content Estimation

Overview

Key Innovation

Architecture

Results

Multi-modal CNN-LSTM (Baseline: Overhead + Depth)

Cross-Swin-CLS Performance

Installation

Dataset Setup

Usage

Training

Hydra-based Configuration

Viewing Training Logs

Project Structure

Key Findings

Citation

References

License

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages