feat(dataviewer): VLM-as-judge automated evaluation for VLA episode scoring

### Component

`data-management/viewer/` (backend + frontend) + `evaluation/`

### Problem Statement

Evaluating VLA policy quality currently requires manual human review or running the full policy in simulation. There is no automated way to:

1. **Score task completion** from episode video — did the robot achieve the instructed goal?
2. **Assess trajectory quality** — was the execution smooth, efficient, and safe?
3. **Generate dense reward signals** for VLA-RL fine-tuning (PPO + RPRM)
4. **Decompose failures** into subtask-level scores for credit assignment
5. **Scale annotation** beyond what human reviewers can handle

VLM-as-judge enables automated evaluation by using a vision-language model to score episodes from video observations, analogous to process reward models (PRMs) in LLM reasoning chains.

### Proposed Solution

Implement a VLM-as-judge evaluation pipeline with two integration points:

#### 1. Dataviewer Integration (Auto-Analysis Extension)

Extend the existing auto-analysis system (`AutoQualityAnalysis` model) to support VLM-based scoring:

- **VLM judge endpoint** — new backend service that sends episode keyframes + language instruction to a VLM and receives structured scoring
- **Supported VLMs** — Azure OpenAI (GPT-4.1), Azure AI Foundry (Phi-4-Multimodal), local models via Ollama
- **Scoring schema** — task completion (0-1), subtask progress (ordered list with per-subtask scores), trajectory quality (smoothness, efficiency, safety), failure mode classification
- **UI integration** — display VLM judge scores alongside human annotations for comparison / calibration

#### 2. Evaluation Pipeline Integration (Reward Generation)

Create a standalone evaluation module for VLA-RL training reward generation:

- **Batch scoring** — score entire datasets of rollout episodes for RL training
- **RPRM pseudo-labels** — generate dense per-subtask reward signals from VLM scores
- **Self-improving judges** — iterative self-training protocol (majority voting, confidence filtering) following Lin et al. (2025) to improve judge quality without human labels
- **Output format** — reward labels in LeRobot-compatible format for VLA-RL fine-tuning

### Technical Design

```
Episode Video + Language Instruction
         │
         ▼
┌─────────────────────┐
│ Keyframe Extraction  │  Sample N frames at task-relevant intervals
│ (front + wrist cams) │
└────────┬────────────┘
         ▼
┌─────────────────────┐
│ VLM Judge Prompt     │  "Given the instruction '{instruction}' and these
│                      │   frames, score: task_completion (0-1),
│                      │   subtask_progress [{name, score}...],
│                      │   failure_mode (if any)"
└────────┬────────────┘
         ▼
┌─────────────────────┐
│ Structured Output    │  JSON schema validation
│ Parser               │
└────────┬────────────┘
         ▼
  ┌──────┴──────┐
  │             │
Dataviewer   Reward
Annotation   Labels
```

### Acceptance Criteria

- [ ] VLM judge service supports Azure OpenAI and at least one local model backend
- [ ] Scoring prompt produces structured JSON with task completion, subtask progress, and failure mode
- [ ] Keyframe extraction selects N evenly-spaced frames plus task-transition frames (configurable)
- [ ] Dataviewer displays VLM judge scores alongside human annotations
- [ ] Batch scoring endpoint processes a full dataset and outputs reward labels
- [ ] VLM judge scores correlate with human annotation consensus at >0.7 agreement
- [ ] Output reward labels compatible with LeRobot dataset format for VLA-RL training

### References

- Self-Improving VLM Judges: Lin et al., 2025 (`arXiv:2512.05145`)
- Process Reward Models for VLAs: VLA presentation slides on RPRM and dense reward
- ROVE evaluation module: #102
- Existing auto-analysis: `data-management/viewer/backend/src/api/models/annotations.py` (`AutoQualityAnalysis`)

### Context

- Backend annotation models: `data-management/viewer/backend/src/api/models/annotations.py`
- Language instruction model: `LanguageInstructionAnnotation` (branch `feat/vla-twinvla-robotwin`)
- VLA training pipeline: `training/vla/`
- Bimanual robot types: `evaluation/sil/bimanual_robot_types.py`
- Tracking issue: #115 (dataviewer build & lint remediation)
- Related: #102 (ROVE task-level domain evaluation module)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dataviewer): VLM-as-judge automated evaluation for VLA episode scoring #438

Component

Problem Statement

Proposed Solution

1. Dataviewer Integration (Auto-Analysis Extension)

2. Evaluation Pipeline Integration (Reward Generation)

Technical Design

Acceptance Criteria

References

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(dataviewer): VLM-as-judge automated evaluation for VLA episode scoring #438

Description

Component

Problem Statement

Proposed Solution

1. Dataviewer Integration (Auto-Analysis Extension)

2. Evaluation Pipeline Integration (Reward Generation)

Technical Design

Acceptance Criteria

References

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions