You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: AGENTS.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -298,3 +298,5 @@ Initial seed entries:
298
298
-`2026-02-11`: LSP comparator fairness and reliability gaps were observed under loose-file tasks and unstable auth state -> Added project-backed paired-run task shape (`TargetHarness.csproj` + `Program.cs`) plus Claude auth preflight fail-fast in the harness -> Require project-context comparator runs and valid agent auth before interpreting Roslyn-vs-LSP outcomes.
299
299
-`2026-02-11`: Preview distribution flow required too many manual steps across separate workflows -> Updated `Publish NuGet Preview` to build one artifact set, publish NuGet, and refresh GitHub Release assets in the same run -> Treat this unified workflow as the default regular release path for preview versions.
300
300
-`2026-02-11`: File-scoped Roslyn commands could silently degrade to ad-hoc semantics and report misleading diagnostics -> Added workspace auto-resolution + explicit `workspace_path` override with surfaced `workspace_context` metadata in `nav.find_symbol`/`diag.get_file_diagnostics` and aligned pit-of-success guidance/harness prompts -> Require agents to verify `workspace_context.mode` and force workspace binding when mode is `ad_hoc` on project-backed files.
301
+
-`2026-02-12`: Project-task benchmark runs were falsely failing due harness artifact leakage (`Target.original.cs` compiled into generated project) -> Excluded `Target.original.cs` from `TargetHarness.csproj` and added regression test coverage -> Treat run-harness file layout as part of experiment validity gates.
302
+
-`2026-02-12`: Workspace-mode evidence was hard to aggregate across transcripts -> Added paired-run metadata fields for Roslyn workspace mode counts and updated summary markdown/workflow docs -> Use `workspace/ad_hoc` counters as first-class comparability telemetry in scenario matrices.
Copy file name to clipboardExpand all lines: benchmarks/AGENT_EVAL_WORKFLOW.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -154,6 +154,7 @@ Current guidance from skill-intro ablations (`artifacts/skill-intro-ablation/202
154
154
- Treat `schema-first` as a debugging/contract-validation lane, not a default execution lane.
155
155
- Keep prompt examples shell-specific (PowerShell vs Bash) and avoid inline JSON quoting in profile guidance.
156
156
- For `nav.find_symbol` and `diag.get_file_diagnostics`, require `workspace_context.mode=workspace` on project-backed tasks; if mode is `ad_hoc`, rerun with explicit workspace path (`--workspace-path TargetHarness.csproj` or `workspace_path=TargetHarness.csproj` in MCP query).
157
+
-`-TaskShape project` now excludes `Target.original.cs` from compilation in generated `TargetHarness.csproj` to prevent duplicate-type benchmark confounds.
157
158
158
159
Isolation and integrity defaults:
159
160
@@ -173,6 +174,7 @@ Current harness outputs include:
-`treatment` helper lane uses `roslyn-rename-and-verify.ps1`; it does not currently emit `workspace_context` counts directly, so helper rows show `0/0`.
62
+
- MCP lane provides clear workspace telemetry; helper lane should gain optional explicit workspace-mode probes for parity.
63
+
64
+
4. Token comparability across providers (measurement caveat)
65
+
- Claude rows include large cache-inclusive token components with provider-specific semantics.
66
+
- Use per-agent comparisons first, cross-agent token comparisons second.
67
+
68
+
## Immediate Follow-up
69
+
70
+
1. Re-run full matrix with Claude after auth refresh (`control`, `treatment`, `treatment-mcp`, `treatment-lsp`) on `TaskShape=project`.
71
+
2. Add helper-lane workspace-mode probe option so non-MCP Roslyn runs also report explicit `workspace/ad_hoc` counts.
72
+
3. Add first `dotnet-inspect` comparator lanes (`inspect-only`, `roslyn-only`, `combined`) on package/API-sensitive scenarios.
0 commit comments