A lightweight proof-of-concept for oversight-centered metrology in small coding-agent experiments.
This repository focuses on workflow-aware evaluation rather than raw benchmark scores. It treats human-like review as one interrupt channel among many, not as a privileged oracle, and it uses claim-margin reporting to avoid overclaiming from small logged experiments.
It is a compact experimental artifact accompanying the paper:
Small coding-agent experiments are often reported as raw success scores. That hides workflow-level quantities that matter in practice:
- hazard detection
- retries
- oversight cost
- escalation load
- claim status after transport and audit-distortion budgets
This repo makes those quantities explicit. The point is not to show that one model is "best." The point is to show that workflow-level accounting can materially change how the same experiment should be interpreted.
- runs the same small Python task protocol under three workflow conditions
- logs every attempt in machine-readable form
- models cheap automated checks and costly review through one common interrupt-channel abstraction
- computes workflow-level metrics and a simple fail-closed claim margin
- compares recorded actual runs across
gemma3:1b,gemma3:4b, andgemini-2.5-flash
- not a leaderboard
- not a broad coding benchmark
- not a deployment-readiness evaluation
- not a provider ranking claim
- not a stability claim from one run per actual model
- not evidence that the current oversight stack captures all hidden semantic failures
In the logged runs currently included here:
gemma3:1bis metrology-positive but performance-negative in this setupgemma3:4bimproves performance, but the net value of oversight remains limited after workflow costsgemini-2.5-flashis stronger on raw success and protocol compliance, and shows a small positive automated-oversight utility signal- despite that, the claim-margin status remains conservative: the positive comparative oversight claims stay
fail-closed - costly selective escalation is not justified in this lightweight setup
That is the scientifically interesting result: raw success alone is insufficient.
Use the documents in this order:
- report/workflow_oversight_report.md This is the main public-facing report. It frames the cross-model result conservatively.
- report/comparative_results.md This is the compact cross-model comparison table over actual runs.
- result_summary.md This is the fuller run log with failure-mode commentary and explicit honesty notes.
- Raw run directories in
results/These contain the original JSON/JSONL/CSV artifacts.
The most important distinction is:
- raw success is only one measurement
- workflow utility can move differently once retries and oversight cost are counted
- claim status can remain conservative even when raw scores improve
Comparative summary artifacts:
Recorded actual runs:
- results/ollama_gemma3_1b_20260312_180932
- results/ollama_gemma3_4b_20260312_184007
- results/gemini_2_5_flash_20260312_190453
Recorded smoke run:
The raw results above are the authoritative artifacts. The reports summarize them; they do not replace them.
paper/Original TeX source for the paper. It is kept untouched.tasks/Small Python repair tasks.src/Runner, model backends, oversight logic, metrics, and report generation.results/Raw logged run artifacts.report/Public-facing reports and comparative summaries.
The core pipeline uses only the Python standard library.
Smoke test:
python -m src.main run --backend scripted --model scripted-smokeLocal Ollama runs:
python -m src.main run --backend ollama --model gemma3:1b --results-dir results/ollama_gemma3_1b_manual
python -m src.main run --backend ollama --model gemma3:4b --results-dir results/ollama_gemma3_4b_manualCloud Gemini run:
python -m src.main run --backend gemini --model gemini-2.5-flash --results-dir results/gemini_2_5_flash_manualSet GEMINI_API_KEY in the shell before using the Gemini backend. Do not hardcode the key.
To regenerate the public comparative docs from the recorded runs:
python -m src.main release-docsTo sanitize recorded result paths for public release and remove Python caches:
python -m src.main release-clean- small synthetic Python task set
- one lightweight protocol
- one logged run per actual model in the current release
- current oversight stack mainly captures visible/public hazards, not all hidden semantic failures
- no broad deployment claim