Skip to content

test(evaluation): add unit test infrastructure and initial test suite #440

@WilliamBerryiii

Description

@WilliamBerryiii

Issue Description

Background

The evaluation/ domain has 16+ Python source files across sil/, sil/scripts/, and metrics/ with zero testsevaluation/tests/ contains only an empty __init__.py. There is no pytest configuration, no coverage configuration, and no test dependencies in evaluation/pyproject.toml.

This domain includes core evaluation logic (policy evaluation, checkpoint monitoring, robot types, plotting, MLflow bootstrapping) that directly affects training quality feedback loops.

Source File Inventory

evaluation/sil/ (7 files):

  • monitor_checkpoints.py — checkpoint monitoring
  • play.py — policy playback
  • play_policy.py — policy execution
  • policy_evaluation.py — evaluation orchestration
  • policy_runner.py — policy runner
  • robot_types.py — robot type definitions (most testable — pure data/logic)

evaluation/sil/scripts/ (10 files):

  • batch-lerobot-eval.py, download_aml_model.py, download_blob_dataset.py
  • run-local-lerobot-eval.py, run_evaluation.py, test-lerobot-eval.py
  • submit-azureml-lerobot-eval.sh, submit-azureml-validation.sh
  • submit-osmo-eval.sh, submit-osmo-lerobot-eval.sh

evaluation/metrics/ (4 files):

  • bootstrap_mlflow.py — MLflow initialization
  • plot-lerobot-trajectories.py — trajectory visualization
  • plotting.py — general plotting utilities
  • upload_artifacts.py — artifact upload

Current evaluation/pyproject.toml State

  • No [tool.pytest.ini_options] section
  • No [tool.coverage.run] section
  • No test dependencies (pytest, pytest-cov, etc.)
  • requires-python = ">=3.11,<3.12" (upper bound constraint)
  • [tool.uv] package = false

Suggested Fix

Phase 1: Test Infrastructure

  1. Add test dependencies to evaluation/pyproject.toml:
[dependency-groups]
dev = [
  "pytest>=8.0",
  "pytest-cov>=6.0",
  "pytest-asyncio>=1.3.0",
]
  1. Add pytest configuration:
[tool.pytest.ini_options]
testpaths = ["tests"]
asyncio_mode = "auto"
python_files = ["test_*.py"]
addopts = ["-ra", "--strict-markers", "--strict-config"]
  1. Create initial conftest.py in evaluation/tests/

Phase 2: Initial Test Suite (Priority by Testability)

High testability (pure logic, minimal external deps):

  • robot_types.py — data classes, type definitions
  • plotting.py — plotting utilities (mock matplotlib)
  • bootstrap_mlflow.py — MLflow config logic (mock mlflow SDK)
  • upload_artifacts.py — artifact management (mock Azure SDK)

Medium testability (requires GPU/sim mocking):

  • policy_evaluation.py — evaluation orchestration
  • policy_runner.py — policy execution
  • monitor_checkpoints.py — filesystem monitoring

Lower testability (submission scripts, mostly CLI wrappers):

  • Scripts in sil/scripts/ — shell command composition

Phase 3: CI Integration

Either:

  • Option A: Add evaluation/tests to root pyproject.toml testpaths (simplest)
  • Option B: Create a dedicated .github/workflows/evaluation-pytests.yml workflow (mirrors dataviewer-backend-pytests.yml pattern)

If Option B, add a new Codecov flag:

- name: pytest-evaluation
  paths: ["evaluation/"]
  statuses:
    - type: patch
      informational: true

Acceptance Criteria

  • evaluation/pyproject.toml includes test dependencies and pytest configuration
  • evaluation/tests/conftest.py exists with shared fixtures
  • Unit tests exist for at least robot_types.py, plotting.py, and bootstrap_mlflow.py
  • All tests pass without GPU hardware or live Azure services
  • Tests are integrated into CI (root pytest or dedicated workflow)
  • Coverage data for evaluation/ appears in Codecov reports

Implementation Notes

  • Heavy deps (torch, onnxruntime-gpu, lerobot) must be mocked — these are not available in CI
  • requires-python = ">=3.11,<3.12" means CI runner must use Python 3.11.x
  • Follow project conventions: from __future__ import annotations, _LOGGER naming, class-based test grouping
  • Use monkeypatch over unittest.mock.patch per project conventions

Related Issues

OpenSSF IDs: regression_tests_added50

Metadata

Metadata

Labels

ci/cdCI/CD pipeline and automationenhancementNew feature or improvement requesttestingTesting-related issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions