Component
data-management/viewer/ (frontend + backend)
Problem Statement
The current dataviewer episode viewer renders a single camera stream per episode. Bimanual VLA policies (TwinVLA, π₀, RDT-1B) require up to 3 camera views during training and inference:
- Front/ego-centric camera — workspace overview
- Left wrist camera — left arm end-effector view
- Right wrist camera — right arm end-effector view
RoboTwin 2.0 datasets include front_image, wrist_image_left, and wrist_image_right per episode step. LeRobot datasets may include additional camera keys. Without multi-camera visualization, operators cannot:
- Verify camera coverage and alignment across views
- Inspect occlusion or lighting issues in individual camera streams
- Correlate spatial relationships between arm-mounted and workspace cameras
- Quality-check the exact visual inputs the VLA model receives during training
Proposed Solution
Add multi-camera display support to the episode viewer:
- Auto-detect camera keys from LeRobot dataset metadata (scan
observation.images.* keys)
- Grid layout — display all camera views simultaneously in a responsive grid (1×1 for single camera, 1×3 for three cameras, 2×2 for four, etc.)
- Camera selector — allow toggling individual cameras on/off via a camera panel
- Synchronized scrubbing — all camera views stay frame-synchronized when scrubbing the timeline
- Per-camera zoom — click a camera view to expand it to full width while keeping others visible as thumbnails
- Camera labels — display the LeRobot image key name as an overlay on each view
Technical Notes
- LeRobot v3.0 stores images as MP4 videos under
videos/{camera_key}_episode_{idx}.mp4
- Camera keys are defined in dataset
info.json under features.observation.images
- The backend already streams video frames; the change is primarily frontend layout and state management
- Consider using CSS grid with
auto-fit for responsive layout
Acceptance Criteria
Context
Component
data-management/viewer/(frontend + backend)Problem Statement
The current dataviewer episode viewer renders a single camera stream per episode. Bimanual VLA policies (TwinVLA, π₀, RDT-1B) require up to 3 camera views during training and inference:
RoboTwin 2.0 datasets include
front_image,wrist_image_left, andwrist_image_rightper episode step. LeRobot datasets may include additional camera keys. Without multi-camera visualization, operators cannot:Proposed Solution
Add multi-camera display support to the episode viewer:
observation.images.*keys)Technical Notes
videos/{camera_key}_episode_{idx}.mp4info.jsonunderfeatures.observation.imagesauto-fitfor responsive layoutAcceptance Criteria
Context
training/vla/(branchfeat/vla-twinvla-robotwin)evaluation/sil/bimanual_robot_types.py