Skip to content

feat(dataviewer): VLM-as-judge automated evaluation for VLA episode scoring #438

@akzaidi

Description

@akzaidi

Component

data-management/viewer/ (backend + frontend) + evaluation/

Problem Statement

Evaluating VLA policy quality currently requires manual human review or running the full policy in simulation. There is no automated way to:

  1. Score task completion from episode video — did the robot achieve the instructed goal?
  2. Assess trajectory quality — was the execution smooth, efficient, and safe?
  3. Generate dense reward signals for VLA-RL fine-tuning (PPO + RPRM)
  4. Decompose failures into subtask-level scores for credit assignment
  5. Scale annotation beyond what human reviewers can handle

VLM-as-judge enables automated evaluation by using a vision-language model to score episodes from video observations, analogous to process reward models (PRMs) in LLM reasoning chains.

Proposed Solution

Implement a VLM-as-judge evaluation pipeline with two integration points:

1. Dataviewer Integration (Auto-Analysis Extension)

Extend the existing auto-analysis system (AutoQualityAnalysis model) to support VLM-based scoring:

  • VLM judge endpoint — new backend service that sends episode keyframes + language instruction to a VLM and receives structured scoring
  • Supported VLMs — Azure OpenAI (GPT-4.1), Azure AI Foundry (Phi-4-Multimodal), local models via Ollama
  • Scoring schema — task completion (0-1), subtask progress (ordered list with per-subtask scores), trajectory quality (smoothness, efficiency, safety), failure mode classification
  • UI integration — display VLM judge scores alongside human annotations for comparison / calibration

2. Evaluation Pipeline Integration (Reward Generation)

Create a standalone evaluation module for VLA-RL training reward generation:

  • Batch scoring — score entire datasets of rollout episodes for RL training
  • RPRM pseudo-labels — generate dense per-subtask reward signals from VLM scores
  • Self-improving judges — iterative self-training protocol (majority voting, confidence filtering) following Lin et al. (2025) to improve judge quality without human labels
  • Output format — reward labels in LeRobot-compatible format for VLA-RL fine-tuning

Technical Design

Episode Video + Language Instruction
         │
         ▼
┌─────────────────────┐
│ Keyframe Extraction  │  Sample N frames at task-relevant intervals
│ (front + wrist cams) │
└────────┬────────────┘
         ▼
┌─────────────────────┐
│ VLM Judge Prompt     │  "Given the instruction '{instruction}' and these
│                      │   frames, score: task_completion (0-1),
│                      │   subtask_progress [{name, score}...],
│                      │   failure_mode (if any)"
└────────┬────────────┘
         ▼
┌─────────────────────┐
│ Structured Output    │  JSON schema validation
│ Parser               │
└────────┬────────────┘
         ▼
  ┌──────┴──────┐
  │             │
Dataviewer   Reward
Annotation   Labels

Acceptance Criteria

  • VLM judge service supports Azure OpenAI and at least one local model backend
  • Scoring prompt produces structured JSON with task completion, subtask progress, and failure mode
  • Keyframe extraction selects N evenly-spaced frames plus task-transition frames (configurable)
  • Dataviewer displays VLM judge scores alongside human annotations
  • Batch scoring endpoint processes a full dataset and outputs reward labels
  • VLM judge scores correlate with human annotation consensus at >0.7 agreement
  • Output reward labels compatible with LeRobot dataset format for VLA-RL training

References

  • Self-Improving VLM Judges: Lin et al., 2025 (arXiv:2512.05145)
  • Process Reward Models for VLAs: VLA presentation slides on RPRM and dense reward
  • ROVE evaluation module: feat(evaluation): add ROVE task-level domain evaluation module #102
  • Existing auto-analysis: data-management/viewer/backend/src/api/models/annotations.py (AutoQualityAnalysis)

Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/srcSource code in src directoryenhancementNew feature or improvement request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions