Issue Description
Background
The evaluation/ domain has 16+ Python source files across sil/, sil/scripts/, and metrics/ with zero tests — evaluation/tests/ contains only an empty __init__.py. There is no pytest configuration, no coverage configuration, and no test dependencies in evaluation/pyproject.toml.
This domain includes core evaluation logic (policy evaluation, checkpoint monitoring, robot types, plotting, MLflow bootstrapping) that directly affects training quality feedback loops.
Source File Inventory
evaluation/sil/ (7 files):
monitor_checkpoints.py — checkpoint monitoring
play.py — policy playback
play_policy.py — policy execution
policy_evaluation.py — evaluation orchestration
policy_runner.py — policy runner
robot_types.py — robot type definitions (most testable — pure data/logic)
evaluation/sil/scripts/ (10 files):
batch-lerobot-eval.py, download_aml_model.py, download_blob_dataset.py
run-local-lerobot-eval.py, run_evaluation.py, test-lerobot-eval.py
submit-azureml-lerobot-eval.sh, submit-azureml-validation.sh
submit-osmo-eval.sh, submit-osmo-lerobot-eval.sh
evaluation/metrics/ (4 files):
bootstrap_mlflow.py — MLflow initialization
plot-lerobot-trajectories.py — trajectory visualization
plotting.py — general plotting utilities
upload_artifacts.py — artifact upload
Current evaluation/pyproject.toml State
- No
[tool.pytest.ini_options] section
- No
[tool.coverage.run] section
- No test dependencies (pytest, pytest-cov, etc.)
requires-python = ">=3.11,<3.12" (upper bound constraint)
[tool.uv] package = false
Suggested Fix
Phase 1: Test Infrastructure
- Add test dependencies to
evaluation/pyproject.toml:
[dependency-groups]
dev = [
"pytest>=8.0",
"pytest-cov>=6.0",
"pytest-asyncio>=1.3.0",
]
- Add pytest configuration:
[tool.pytest.ini_options]
testpaths = ["tests"]
asyncio_mode = "auto"
python_files = ["test_*.py"]
addopts = ["-ra", "--strict-markers", "--strict-config"]
- Create initial
conftest.py in evaluation/tests/
Phase 2: Initial Test Suite (Priority by Testability)
High testability (pure logic, minimal external deps):
robot_types.py — data classes, type definitions
plotting.py — plotting utilities (mock matplotlib)
bootstrap_mlflow.py — MLflow config logic (mock mlflow SDK)
upload_artifacts.py — artifact management (mock Azure SDK)
Medium testability (requires GPU/sim mocking):
policy_evaluation.py — evaluation orchestration
policy_runner.py — policy execution
monitor_checkpoints.py — filesystem monitoring
Lower testability (submission scripts, mostly CLI wrappers):
- Scripts in
sil/scripts/ — shell command composition
Phase 3: CI Integration
Either:
- Option A: Add
evaluation/tests to root pyproject.toml testpaths (simplest)
- Option B: Create a dedicated
.github/workflows/evaluation-pytests.yml workflow (mirrors dataviewer-backend-pytests.yml pattern)
If Option B, add a new Codecov flag:
- name: pytest-evaluation
paths: ["evaluation/"]
statuses:
- type: patch
informational: true
Acceptance Criteria
Implementation Notes
- Heavy deps (torch, onnxruntime-gpu, lerobot) must be mocked — these are not available in CI
requires-python = ">=3.11,<3.12" means CI runner must use Python 3.11.x
- Follow project conventions:
from __future__ import annotations, _LOGGER naming, class-based test grouping
- Use
monkeypatch over unittest.mock.patch per project conventions
Related Issues
OpenSSF IDs: regression_tests_added50
Issue Description
Background
The
evaluation/domain has 16+ Python source files acrosssil/,sil/scripts/, andmetrics/with zero tests —evaluation/tests/contains only an empty__init__.py. There is no pytest configuration, no coverage configuration, and no test dependencies inevaluation/pyproject.toml.This domain includes core evaluation logic (policy evaluation, checkpoint monitoring, robot types, plotting, MLflow bootstrapping) that directly affects training quality feedback loops.
Source File Inventory
evaluation/sil/(7 files):monitor_checkpoints.py— checkpoint monitoringplay.py— policy playbackplay_policy.py— policy executionpolicy_evaluation.py— evaluation orchestrationpolicy_runner.py— policy runnerrobot_types.py— robot type definitions (most testable — pure data/logic)evaluation/sil/scripts/(10 files):batch-lerobot-eval.py,download_aml_model.py,download_blob_dataset.pyrun-local-lerobot-eval.py,run_evaluation.py,test-lerobot-eval.pysubmit-azureml-lerobot-eval.sh,submit-azureml-validation.shsubmit-osmo-eval.sh,submit-osmo-lerobot-eval.shevaluation/metrics/(4 files):bootstrap_mlflow.py— MLflow initializationplot-lerobot-trajectories.py— trajectory visualizationplotting.py— general plotting utilitiesupload_artifacts.py— artifact uploadCurrent
evaluation/pyproject.tomlState[tool.pytest.ini_options]section[tool.coverage.run]sectionrequires-python = ">=3.11,<3.12"(upper bound constraint)[tool.uv] package = falseSuggested Fix
Phase 1: Test Infrastructure
evaluation/pyproject.toml:conftest.pyinevaluation/tests/Phase 2: Initial Test Suite (Priority by Testability)
High testability (pure logic, minimal external deps):
robot_types.py— data classes, type definitionsplotting.py— plotting utilities (mock matplotlib)bootstrap_mlflow.py— MLflow config logic (mock mlflow SDK)upload_artifacts.py— artifact management (mock Azure SDK)Medium testability (requires GPU/sim mocking):
policy_evaluation.py— evaluation orchestrationpolicy_runner.py— policy executionmonitor_checkpoints.py— filesystem monitoringLower testability (submission scripts, mostly CLI wrappers):
sil/scripts/— shell command compositionPhase 3: CI Integration
Either:
evaluation/teststo rootpyproject.tomltestpaths(simplest).github/workflows/evaluation-pytests.ymlworkflow (mirrorsdataviewer-backend-pytests.ymlpattern)If Option B, add a new Codecov flag:
Acceptance Criteria
evaluation/pyproject.tomlincludes test dependencies and pytest configurationevaluation/tests/conftest.pyexists with shared fixturesrobot_types.py,plotting.py, andbootstrap_mlflow.pyevaluation/appears in Codecov reportsImplementation Notes
requires-python = ">=3.11,<3.12"means CI runner must use Python 3.11.xfrom __future__ import annotations,_LOGGERnaming, class-based test groupingmonkeypatchoverunittest.mock.patchper project conventionsRelated Issues
OpenSSF IDs:
regression_tests_added50