test(evaluation): add unit test infrastructure and initial test suite

## Issue Description

### Background

The `evaluation/` domain has **16+ Python source files** across `sil/`, `sil/scripts/`, and `metrics/` with **zero tests** — `evaluation/tests/` contains only an empty `__init__.py`. There is no pytest configuration, no coverage configuration, and no test dependencies in `evaluation/pyproject.toml`.

This domain includes core evaluation logic (policy evaluation, checkpoint monitoring, robot types, plotting, MLflow bootstrapping) that directly affects training quality feedback loops.

### Source File Inventory

**`evaluation/sil/`** (7 files):
- `monitor_checkpoints.py` — checkpoint monitoring
- `play.py` — policy playback
- `play_policy.py` — policy execution
- `policy_evaluation.py` — evaluation orchestration
- `policy_runner.py` — policy runner
- `robot_types.py` — robot type definitions (most testable — pure data/logic)

**`evaluation/sil/scripts/`** (10 files):
- `batch-lerobot-eval.py`, `download_aml_model.py`, `download_blob_dataset.py`
- `run-local-lerobot-eval.py`, `run_evaluation.py`, `test-lerobot-eval.py`
- `submit-azureml-lerobot-eval.sh`, `submit-azureml-validation.sh`
- `submit-osmo-eval.sh`, `submit-osmo-lerobot-eval.sh`

**`evaluation/metrics/`** (4 files):
- `bootstrap_mlflow.py` — MLflow initialization
- `plot-lerobot-trajectories.py` — trajectory visualization
- `plotting.py` — general plotting utilities
- `upload_artifacts.py` — artifact upload

### Current `evaluation/pyproject.toml` State

- No `[tool.pytest.ini_options]` section
- No `[tool.coverage.run]` section
- No test dependencies (pytest, pytest-cov, etc.)
- `requires-python = ">=3.11,<3.12"` (upper bound constraint)
- `[tool.uv] package = false`

## Suggested Fix

### Phase 1: Test Infrastructure

1. Add test dependencies to `evaluation/pyproject.toml`:

```toml
[dependency-groups]
dev = [
  "pytest>=8.0",
  "pytest-cov>=6.0",
  "pytest-asyncio>=1.3.0",
]
```

2. Add pytest configuration:

```toml
[tool.pytest.ini_options]
testpaths = ["tests"]
asyncio_mode = "auto"
python_files = ["test_*.py"]
addopts = ["-ra", "--strict-markers", "--strict-config"]
```

3. Create initial `conftest.py` in `evaluation/tests/`

### Phase 2: Initial Test Suite (Priority by Testability)

**High testability** (pure logic, minimal external deps):
- `robot_types.py` — data classes, type definitions
- `plotting.py` — plotting utilities (mock matplotlib)
- `bootstrap_mlflow.py` — MLflow config logic (mock mlflow SDK)
- `upload_artifacts.py` — artifact management (mock Azure SDK)

**Medium testability** (requires GPU/sim mocking):
- `policy_evaluation.py` — evaluation orchestration
- `policy_runner.py` — policy execution
- `monitor_checkpoints.py` — filesystem monitoring

**Lower testability** (submission scripts, mostly CLI wrappers):
- Scripts in `sil/scripts/` — shell command composition

### Phase 3: CI Integration

Either:
- **Option A**: Add `evaluation/tests` to root `pyproject.toml` `testpaths` (simplest)
- **Option B**: Create a dedicated `.github/workflows/evaluation-pytests.yml` workflow (mirrors `dataviewer-backend-pytests.yml` pattern)

If Option B, add a new Codecov flag:

```yaml
- name: pytest-evaluation
  paths: ["evaluation/"]
  statuses:
    - type: patch
      informational: true
```

## Acceptance Criteria

- [ ] `evaluation/pyproject.toml` includes test dependencies and pytest configuration
- [ ] `evaluation/tests/conftest.py` exists with shared fixtures
- [ ] Unit tests exist for at least `robot_types.py`, `plotting.py`, and `bootstrap_mlflow.py`
- [ ] All tests pass without GPU hardware or live Azure services
- [ ] Tests are integrated into CI (root pytest or dedicated workflow)
- [ ] Coverage data for `evaluation/` appears in Codecov reports

## Implementation Notes

- Heavy deps (torch, onnxruntime-gpu, lerobot) must be mocked — these are not available in CI
- `requires-python = ">=3.11,<3.12"` means CI runner must use Python 3.11.x
- Follow project conventions: `from __future__ import annotations`, `_LOGGER` naming, class-based test grouping
- Use `monkeypatch` over `unittest.mock.patch` per project conventions

## Related Issues

- OpenSSF Silver badge: #4
- Frontend coverage tracking: #210
- Backend test quality: #128

**OpenSSF IDs:** `regression_tests_added50`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(evaluation): add unit test infrastructure and initial test suite #440

Issue Description

Background

Source File Inventory

Current `evaluation/pyproject.toml` State

Suggested Fix

Phase 1: Test Infrastructure

Phase 2: Initial Test Suite (Priority by Testability)

Phase 3: CI Integration

Acceptance Criteria

Implementation Notes

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

test(evaluation): add unit test infrastructure and initial test suite #440

Description

Issue Description

Background

Source File Inventory

Current evaluation/pyproject.toml State

Suggested Fix

Phase 1: Test Infrastructure

Phase 2: Initial Test Suite (Priority by Testability)

Phase 3: CI Integration

Acceptance Criteria

Implementation Notes

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Current `evaluation/pyproject.toml` State