Skip to content

Commit 0bb8087

Browse files
committed
Harden workspace-context benchmark telemetry and matrix reporting
1 parent 929ed62 commit 0bb8087

7 files changed

Lines changed: 341 additions & 6 deletions

File tree

AGENTS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -298,3 +298,5 @@ Initial seed entries:
298298
- `2026-02-11`: LSP comparator fairness and reliability gaps were observed under loose-file tasks and unstable auth state -> Added project-backed paired-run task shape (`TargetHarness.csproj` + `Program.cs`) plus Claude auth preflight fail-fast in the harness -> Require project-context comparator runs and valid agent auth before interpreting Roslyn-vs-LSP outcomes.
299299
- `2026-02-11`: Preview distribution flow required too many manual steps across separate workflows -> Updated `Publish NuGet Preview` to build one artifact set, publish NuGet, and refresh GitHub Release assets in the same run -> Treat this unified workflow as the default regular release path for preview versions.
300300
- `2026-02-11`: File-scoped Roslyn commands could silently degrade to ad-hoc semantics and report misleading diagnostics -> Added workspace auto-resolution + explicit `workspace_path` override with surfaced `workspace_context` metadata in `nav.find_symbol`/`diag.get_file_diagnostics` and aligned pit-of-success guidance/harness prompts -> Require agents to verify `workspace_context.mode` and force workspace binding when mode is `ad_hoc` on project-backed files.
301+
- `2026-02-12`: Project-task benchmark runs were falsely failing due harness artifact leakage (`Target.original.cs` compiled into generated project) -> Excluded `Target.original.cs` from `TargetHarness.csproj` and added regression test coverage -> Treat run-harness file layout as part of experiment validity gates.
302+
- `2026-02-12`: Workspace-mode evidence was hard to aggregate across transcripts -> Added paired-run metadata fields for Roslyn workspace mode counts and updated summary markdown/workflow docs -> Use `workspace/ad_hoc` counters as first-class comparability telemetry in scenario matrices.

RESEARCH_FINDINGS.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -894,6 +894,117 @@ Decision:
894894
- Treat `workspace_context.mode=workspace` as the expected state for project-backed `nav.find_symbol` and `diag.get_file_diagnostics` calls.
895895
- Update pit-of-success and paired-run guidance to rerun with explicit workspace binding (`workspace_path`) when mode is `ad_hoc`.
896896

897+
### F-2026-02-12-30: Project-shape paired runs were initially confounded by harness self-collision, now fixed and regression-tested
898+
899+
Evidence:
900+
901+
- harness fix:
902+
- `benchmarks/scripts/Run-PairedAgentRuns.ps1` (`TargetHarness.csproj` now excludes `Target.original.cs`)
903+
- regression test:
904+
- `tests/RoslynSkills.Benchmark.Tests/PairedRunHarnessScriptTests.cs`
905+
- post-fix clean project bundle:
906+
- `artifacts/real-agent-runs/20260212-v0.1.6-preview.7-project-matrix-v2/paired-run-summary.json`
907+
908+
Result:
909+
910+
- before fix, project-shape runs produced duplicate-type/member errors unrelated to the edit task.
911+
- after fix, codex `control`, `treatment`, and `treatment-mcp` all passed constraint checks in project shape.
912+
913+
Interpretation:
914+
915+
- this was a harness validity bug, not a Roslyn capability issue.
916+
- separating harness defects from tool behavior materially changes interpretation quality.
917+
918+
Decision:
919+
920+
- treat generated fixture compile-surface as part of experiment correctness gates.
921+
- keep explicit test coverage for task-shape project generation.
922+
923+
### F-2026-02-12-31: Paired harness now emits workspace-context mode telemetry that distinguishes workspace-backed vs ad-hoc runs
924+
925+
Evidence:
926+
927+
- metadata/summary instrumentation:
928+
- `benchmarks/scripts/Run-PairedAgentRuns.ps1` (`roslyn_workspace_mode_workspace_count`, `roslyn_workspace_mode_ad_hoc_count`, `roslyn_workspace_mode_last`)
929+
- refreshed project bundle:
930+
- `artifacts/real-agent-runs/20260212-v0.1.6-preview.7-project-matrix-v5/paired-run-summary.json`
931+
- refreshed single-file bundle:
932+
- `artifacts/real-agent-runs/20260212-v0.1.6-preview.7-singlefile-matrix-v4/paired-run-summary.json`
933+
934+
Result:
935+
936+
- codex `treatment-mcp` project shape reports `workspace/ad_hoc = 2/0`.
937+
- codex `treatment-mcp` single-file shape reports `workspace/ad_hoc = 0/2`.
938+
939+
Interpretation:
940+
941+
- workspace-context mode behavior now appears directly in run metadata, reducing transcript-only ambiguity.
942+
- scenario-level context differences (project vs loose file) are now measurable and auditable.
943+
944+
Decision:
945+
946+
- include workspace-mode counters in future promotion/readout tables.
947+
- use `TaskShape=project` as default for context-sensitive comparator claims.
948+
949+
### F-2026-02-12-32: Current cross-scenario approach matrix favors roscli helper as default path, with MCP as explicit-context path
950+
951+
Evidence:
952+
953+
- matrix artifact:
954+
- `benchmarks/experiments/20260212-approach-matrix-v0.1.6-preview.7.md`
955+
- codex bundles:
956+
- `artifacts/real-agent-runs/20260212-v0.1.6-preview.7-project-matrix-v5/paired-run-summary.json`
957+
- `artifacts/real-agent-runs/20260212-v0.1.6-preview.7-singlefile-matrix-v4/paired-run-summary.json`
958+
- latest valid Claude comparator with LSP lane:
959+
- `artifacts/real-agent-runs/20260211-lsp-roslyn-v4/paired-run-summary.json`
960+
961+
Result:
962+
963+
- codex project shape:
964+
- control: `22.626s`, `34,150` tokens
965+
- treatment: `35.740s`, `27,246` tokens
966+
- treatment-mcp: `27.003s`, `66,108` tokens
967+
- codex single-file:
968+
- control: `19.980s`, `34,037` tokens
969+
- treatment: `24.718s`, `26,991` tokens
970+
- treatment-mcp: `35.227s`, `79,416` tokens
971+
- Claude prior LSP comparator (`v4`) kept Roslyn lanes passing, while `treatment-lsp` timed out (`180.066s`, `0/1` successful LSP calls).
972+
973+
Interpretation:
974+
975+
- for this task family, `treatment` (roscli helper) remains the most practical default:
976+
- consistent pass behavior,
977+
- lower token totals than control in current codex runs,
978+
- materially lower token/round-trip overhead than MCP.
979+
- MCP is useful when explicit workspace-mode evidence is required, but still costs more tokens/round-trips.
980+
981+
Decision:
982+
983+
- keep roscli helper path as default treatment baseline.
984+
- use MCP selectively for context assurance/debugging and structured multi-step operations.
985+
986+
### F-2026-02-12-33: Comparator reliability is currently limited by execution-environment issues, not only tool behavior
987+
988+
Evidence:
989+
990+
- current run logs (2026-02-12 bundles) showed Claude auth preflight failures (`401 OAuth token expired`) and skipped Claude lanes.
991+
- prior LSP-enabled bundle still showed first-call timeout despite LSP availability:
992+
- `artifacts/real-agent-runs/20260211-lsp-roslyn-v4/paired-run-summary.json`
993+
994+
Result:
995+
996+
- fresh codex data is clean and reproducible on current version.
997+
- fresh Claude/LSP data is currently blocked by auth and prior LSP timeout behavior.
998+
999+
Interpretation:
1000+
1001+
- experimental infrastructure and account/plugin health are still first-order confounds for cross-agent conclusions.
1002+
1003+
Decision:
1004+
1005+
- treat Claude auth as a hard precondition for matrix refresh runs.
1006+
- rerun full project-backed comparator (`control`, `treatment`, `treatment-mcp`, `treatment-lsp`) after auth recovery before updating architecture-level claims.
1007+
8971008
## Token-to-Information Efficiency (Proxy Metrics)
8981009

8991010
Current telemetry allows two practical proxies:

benchmarks/AGENT_EVAL_WORKFLOW.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,7 @@ Current guidance from skill-intro ablations (`artifacts/skill-intro-ablation/202
154154
- Treat `schema-first` as a debugging/contract-validation lane, not a default execution lane.
155155
- Keep prompt examples shell-specific (PowerShell vs Bash) and avoid inline JSON quoting in profile guidance.
156156
- For `nav.find_symbol` and `diag.get_file_diagnostics`, require `workspace_context.mode=workspace` on project-backed tasks; if mode is `ad_hoc`, rerun with explicit workspace path (`--workspace-path TargetHarness.csproj` or `workspace_path=TargetHarness.csproj` in MCP query).
157+
- `-TaskShape project` now excludes `Target.original.cs` from compilation in generated `TargetHarness.csproj` to prevent duplicate-type benchmark confounds.
157158

158159
Isolation and integrity defaults:
159160

@@ -173,6 +174,7 @@ Current harness outputs include:
173174
- control contamination detection,
174175
- deterministic rename constraint checks,
175176
- Roslyn attempted/successful call counts,
177+
- Roslyn workspace-context mode counts (`roslyn_workspace_mode_workspace_count`, `roslyn_workspace_mode_ad_hoc_count`, `roslyn_workspace_mode_last`),
176178
- `duration_seconds` elapsed time per run,
177179
- `mcp_enabled` and MCP config file paths when applicable,
178180
- model token totals and cache-inclusive token totals,
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# Approach Matrix (v0.1.6-preview.7)
2+
3+
Date: 2026-02-12
4+
Purpose: compare currently available approaches across scenarios while separating experiment/harness failures from tool-behavior signals.
5+
6+
## Sources
7+
8+
- `artifacts/real-agent-runs/20260212-v0.1.6-preview.7-project-matrix-v5/paired-run-summary.json`
9+
- `artifacts/real-agent-runs/20260212-v0.1.6-preview.7-singlefile-matrix-v4/paired-run-summary.json`
10+
- `artifacts/real-agent-runs/20260211-lsp-roslyn-v4/paired-run-summary.json`
11+
12+
## Scenario Matrix
13+
14+
### A) Project-backed task (Codex, current version)
15+
16+
| Approach | Run passed | Duration (s) | Total tokens | Round trips | Roslyn calls (ok/attempted) | Workspace modes (workspace/ad_hoc) |
17+
| --- | --- | ---: | ---: | ---: | --- | --- |
18+
| control | true | 22.626 | 34,150 | 2 | 0/0 | 0/0 |
19+
| treatment (roscli helper) | true | 35.740 | 27,246 | 2 | 1/1 | 0/0 |
20+
| treatment-mcp | true | 27.003 | 66,108 | 4 | 3/3 | 2/0 |
21+
22+
### B) Single-file task (Codex, current version)
23+
24+
| Approach | Run passed | Duration (s) | Total tokens | Round trips | Roslyn calls (ok/attempted) | Workspace modes (workspace/ad_hoc) |
25+
| --- | --- | ---: | ---: | ---: | --- | --- |
26+
| control | true | 19.980 | 34,037 | 2 | 0/0 | 0/0 |
27+
| treatment (roscli helper) | true | 24.718 | 26,991 | 2 | 1/1 | 0/0 |
28+
| treatment-mcp | true | 35.227 | 79,416 | 5 | 3/3 | 0/2 |
29+
30+
### C) Single-file comparator snapshot (Claude, prior valid LSP bundle)
31+
32+
| Approach | Run passed | Duration (s) | Total tokens | Round trips | Roslyn calls (ok/attempted) | LSP calls (ok/attempted) | LSP tools available |
33+
| --- | --- | ---: | ---: | ---: | --- | --- | --- |
34+
| control | true | 31.772 | 510 | 3 | 0/0 | 0/0 | n/a |
35+
| treatment (roscli) | true | 38.524 | 649 | 4 | 2/2 | 0/0 | n/a |
36+
| treatment-mcp | true | 38.225 | 957 | 5 | 3/3 | 0/0 | n/a |
37+
| treatment-lsp | false | 180.066 | n/a | 2 | 0/0 | 0/1 | true |
38+
39+
## Most Promising Path (Current)
40+
41+
- Default path for practical reliability: `treatment (roscli helper)` in project-backed tasks.
42+
- Why now:
43+
- passed constraints in current project and single-file runs,
44+
- lower model-token totals than control in both current codex scenarios,
45+
- lower operational overhead than MCP on this task family.
46+
- MCP remains valuable when explicit workspace-context evidence is required:
47+
- project scenario recorded `workspace/ad_hoc = 2/0`,
48+
- single-file scenario recorded `workspace/ad_hoc = 0/2`.
49+
50+
## Things To Disentangle
51+
52+
1. Claude auth volatility (execution environment)
53+
- Current 2026-02-12 Claude lanes were not runnable due OAuth expiry (`401`), so no fresh Claude comparator data was produced.
54+
- This is an environment gate, not a Roslyn/LSP capability result.
55+
56+
2. LSP reliability vs availability (experimental validity)
57+
- In latest valid LSP comparator bundle (`20260211-lsp-roslyn-v4`), LSP tools were available but first semantic call timed out (`0/1`, 180s).
58+
- Need project-backed replicated LSP runs with valid auth before comparative claims.
59+
60+
3. Helper-path workspace telemetry visibility (instrumentation gap)
61+
- `treatment` helper lane uses `roslyn-rename-and-verify.ps1`; it does not currently emit `workspace_context` counts directly, so helper rows show `0/0`.
62+
- MCP lane provides clear workspace telemetry; helper lane should gain optional explicit workspace-mode probes for parity.
63+
64+
4. Token comparability across providers (measurement caveat)
65+
- Claude rows include large cache-inclusive token components with provider-specific semantics.
66+
- Use per-agent comparisons first, cross-agent token comparisons second.
67+
68+
## Immediate Follow-up
69+
70+
1. Re-run full matrix with Claude after auth refresh (`control`, `treatment`, `treatment-mcp`, `treatment-lsp`) on `TaskShape=project`.
71+
2. Add helper-lane workspace-mode probe option so non-MCP Roslyn runs also report explicit `workspace/ad_hoc` counts.
72+
3. Add first `dotnet-inspect` comparator lanes (`inspect-only`, `roslyn-only`, `combined`) on package/API-sensitive scenarios.

0 commit comments

Comments
 (0)