Oversight-Centered Metrology PoC for Small Coding-Agent Workflows

A lightweight proof-of-concept for oversight-centered metrology in small coding-agent experiments.

This repository focuses on workflow-aware evaluation rather than raw benchmark scores. It treats human-like review as one interrupt channel among many, not as a privileged oracle, and it uses claim-margin reporting to avoid overclaiming from small logged experiments.

It is a compact experimental artifact accompanying the paper:

Takahashi, K. (2026). Oversight-Centered Metrology and Control for Agentic Systems: Costly Interrupt Channels, Claim Margins, and Deployment-Relevant Evaluation

Why this matters

Small coding-agent experiments are often reported as raw success scores. That hides workflow-level quantities that matter in practice:

hazard detection
retries
oversight cost
escalation load
claim status after transport and audit-distortion budgets

This repo makes those quantities explicit. The point is not to show that one model is "best." The point is to show that workflow-level accounting can materially change how the same experiment should be interpreted.

What this repo does

runs the same small Python task protocol under three workflow conditions
logs every attempt in machine-readable form
models cheap automated checks and costly review through one common interrupt-channel abstraction
computes workflow-level metrics and a simple fail-closed claim margin
compares recorded actual runs across gemma3:1b, gemma3:4b, and gemini-2.5-flash

What this repo does NOT claim

not a leaderboard
not a broad coding benchmark
not a deployment-readiness evaluation
not a provider ranking claim
not a stability claim from one run per actual model
not evidence that the current oversight stack captures all hidden semantic failures

Main empirical takeaway

In the logged runs currently included here:

gemma3:1b is metrology-positive but performance-negative in this setup
gemma3:4b improves performance, but the net value of oversight remains limited after workflow costs
gemini-2.5-flash is stronger on raw success and protocol compliance, and shows a small positive automated-oversight utility signal
despite that, the claim-margin status remains conservative: the positive comparative oversight claims stay fail-closed
costly selective escalation is not justified in this lightweight setup

That is the scientifically interesting result: raw success alone is insufficient.

How to read the results

Use the documents in this order:

report/workflow_oversight_report.md This is the main public-facing report. It frames the cross-model result conservatively.
report/comparative_results.md This is the compact cross-model comparison table over actual runs.
result_summary.md This is the fuller run log with failure-mode commentary and explicit honesty notes.
Raw run directories in results/ These contain the original JSON/JSONL/CSV artifacts.

The most important distinction is:

raw success is only one measurement
workflow utility can move differently once retries and oversight cost are counted
claim status can remain conservative even when raw scores improve

Recorded results

Comparative summary artifacts:

Recorded actual runs:

Recorded smoke run:

results/latest_smoke

The raw results above are the authoritative artifacts. The reports summarize them; they do not replace them.

Repository layout

paper/ Original TeX source for the paper. It is kept untouched.
tasks/ Small Python repair tasks.
src/ Runner, model backends, oversight logic, metrics, and report generation.
results/ Raw logged run artifacts.
report/ Public-facing reports and comparative summaries.

Running the code

The core pipeline uses only the Python standard library.

Smoke test:

python -m src.main run --backend scripted --model scripted-smoke

Local Ollama runs:

python -m src.main run --backend ollama --model gemma3:1b --results-dir results/ollama_gemma3_1b_manual
python -m src.main run --backend ollama --model gemma3:4b --results-dir results/ollama_gemma3_4b_manual

Cloud Gemini run:

python -m src.main run --backend gemini --model gemini-2.5-flash --results-dir results/gemini_2_5_flash_manual

Set GEMINI_API_KEY in the shell before using the Gemini backend. Do not hardcode the key.

To regenerate the public comparative docs from the recorded runs:

python -m src.main release-docs

To sanitize recorded result paths for public release and remove Python caches:

python -m src.main release-clean

Limitations

small synthetic Python task set
one lightweight protocol
one logged run per actual model in the current release
current oversight stack mainly captures visible/public hazards, not all hidden semantic failures
no broad deployment claim

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
paper		paper
report		report
results		results
src		src
tasks		tasks
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENCE		LICENCE
Makefile		Makefile
README.md		README.md
experiment_manifest.yaml		experiment_manifest.yaml
requirements.txt		requirements.txt
result_summary.md		result_summary.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Oversight-Centered Metrology PoC for Small Coding-Agent Workflows

Why this matters

What this repo does

What this repo does NOT claim

Main empirical takeaway

How to read the results

Recorded results

Repository layout

Running the code

Limitations

License and citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Oversight-Centered Metrology PoC for Small Coding-Agent Workflows

Why this matters

What this repo does

What this repo does NOT claim

Main empirical takeaway

How to read the results

Recorded results

Repository layout

Running the code

Limitations

License and citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages