Skip to content

kadubon/Oversight-Centered-Metrology-PoC

Repository files navigation

Oversight-Centered Metrology PoC for Small Coding-Agent Workflows

A lightweight proof-of-concept for oversight-centered metrology in small coding-agent experiments.

This repository focuses on workflow-aware evaluation rather than raw benchmark scores. It treats human-like review as one interrupt channel among many, not as a privileged oracle, and it uses claim-margin reporting to avoid overclaiming from small logged experiments.

It is a compact experimental artifact accompanying the paper:

Why this matters

Small coding-agent experiments are often reported as raw success scores. That hides workflow-level quantities that matter in practice:

  • hazard detection
  • retries
  • oversight cost
  • escalation load
  • claim status after transport and audit-distortion budgets

This repo makes those quantities explicit. The point is not to show that one model is "best." The point is to show that workflow-level accounting can materially change how the same experiment should be interpreted.

What this repo does

  • runs the same small Python task protocol under three workflow conditions
  • logs every attempt in machine-readable form
  • models cheap automated checks and costly review through one common interrupt-channel abstraction
  • computes workflow-level metrics and a simple fail-closed claim margin
  • compares recorded actual runs across gemma3:1b, gemma3:4b, and gemini-2.5-flash

What this repo does NOT claim

  • not a leaderboard
  • not a broad coding benchmark
  • not a deployment-readiness evaluation
  • not a provider ranking claim
  • not a stability claim from one run per actual model
  • not evidence that the current oversight stack captures all hidden semantic failures

Main empirical takeaway

In the logged runs currently included here:

  • gemma3:1b is metrology-positive but performance-negative in this setup
  • gemma3:4b improves performance, but the net value of oversight remains limited after workflow costs
  • gemini-2.5-flash is stronger on raw success and protocol compliance, and shows a small positive automated-oversight utility signal
  • despite that, the claim-margin status remains conservative: the positive comparative oversight claims stay fail-closed
  • costly selective escalation is not justified in this lightweight setup

That is the scientifically interesting result: raw success alone is insufficient.

How to read the results

Use the documents in this order:

  1. report/workflow_oversight_report.md This is the main public-facing report. It frames the cross-model result conservatively.
  2. report/comparative_results.md This is the compact cross-model comparison table over actual runs.
  3. result_summary.md This is the fuller run log with failure-mode commentary and explicit honesty notes.
  4. Raw run directories in results/ These contain the original JSON/JSONL/CSV artifacts.

The most important distinction is:

  • raw success is only one measurement
  • workflow utility can move differently once retries and oversight cost are counted
  • claim status can remain conservative even when raw scores improve

Recorded results

Comparative summary artifacts:

Recorded actual runs:

Recorded smoke run:

The raw results above are the authoritative artifacts. The reports summarize them; they do not replace them.

Repository layout

  • paper/ Original TeX source for the paper. It is kept untouched.
  • tasks/ Small Python repair tasks.
  • src/ Runner, model backends, oversight logic, metrics, and report generation.
  • results/ Raw logged run artifacts.
  • report/ Public-facing reports and comparative summaries.

Running the code

The core pipeline uses only the Python standard library.

Smoke test:

python -m src.main run --backend scripted --model scripted-smoke

Local Ollama runs:

python -m src.main run --backend ollama --model gemma3:1b --results-dir results/ollama_gemma3_1b_manual
python -m src.main run --backend ollama --model gemma3:4b --results-dir results/ollama_gemma3_4b_manual

Cloud Gemini run:

python -m src.main run --backend gemini --model gemini-2.5-flash --results-dir results/gemini_2_5_flash_manual

Set GEMINI_API_KEY in the shell before using the Gemini backend. Do not hardcode the key.

To regenerate the public comparative docs from the recorded runs:

python -m src.main release-docs

To sanitize recorded result paths for public release and remove Python caches:

python -m src.main release-clean

Limitations

  • small synthetic Python task set
  • one lightweight protocol
  • one logged run per actual model in the current release
  • current oversight stack mainly captures visible/public hazards, not all hidden semantic failures
  • no broad deployment claim

License and citation

Releases

No releases published

Packages

 
 
 

Contributors