You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Evaluating VLA policy quality currently requires manual human review or running the full policy in simulation. There is no automated way to:
Score task completion from episode video — did the robot achieve the instructed goal?
Assess trajectory quality — was the execution smooth, efficient, and safe?
Generate dense reward signals for VLA-RL fine-tuning (PPO + RPRM)
Decompose failures into subtask-level scores for credit assignment
Scale annotation beyond what human reviewers can handle
VLM-as-judge enables automated evaluation by using a vision-language model to score episodes from video observations, analogous to process reward models (PRMs) in LLM reasoning chains.
Proposed Solution
Implement a VLM-as-judge evaluation pipeline with two integration points:
Self-improving judges — iterative self-training protocol (majority voting, confidence filtering) following Lin et al. (2025) to improve judge quality without human labels
Output format — reward labels in LeRobot-compatible format for VLA-RL fine-tuning
Component
data-management/viewer/(backend + frontend) +evaluation/Problem Statement
Evaluating VLA policy quality currently requires manual human review or running the full policy in simulation. There is no automated way to:
VLM-as-judge enables automated evaluation by using a vision-language model to score episodes from video observations, analogous to process reward models (PRMs) in LLM reasoning chains.
Proposed Solution
Implement a VLM-as-judge evaluation pipeline with two integration points:
1. Dataviewer Integration (Auto-Analysis Extension)
Extend the existing auto-analysis system (
AutoQualityAnalysismodel) to support VLM-based scoring:2. Evaluation Pipeline Integration (Reward Generation)
Create a standalone evaluation module for VLA-RL training reward generation:
Technical Design
Acceptance Criteria
References
arXiv:2512.05145)data-management/viewer/backend/src/api/models/annotations.py(AutoQualityAnalysis)Context
data-management/viewer/backend/src/api/models/annotations.pyLanguageInstructionAnnotation(branchfeat/vla-twinvla-robotwin)training/vla/evaluation/sil/bimanual_robot_types.py