DNAKode
diff --git a/‎AGENTS.md‎
Lines changed: 2 additions & 0 deletions b/‎AGENTS.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎RESEARCH_FINDINGS.md‎
Lines changed: 111 additions & 0 deletions b/‎RESEARCH_FINDINGS.md‎
Lines changed: 111 additions & 0 deletions
diff --git a/‎benchmarks/AGENT_EVAL_WORKFLOW.md‎
Lines changed: 2 additions & 0 deletions b/‎benchmarks/AGENT_EVAL_WORKFLOW.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎benchmarks/experiments/20260212-approach-matrix-v0.1.6-preview.7.md‎
Lines changed: 72 additions & 0 deletions b/‎benchmarks/experiments/20260212-approach-matrix-v0.1.6-preview.7.md‎
Lines changed: 72 additions & 0 deletions
@@ -298,3 +298,5 @@ Initial seed entries:
 - `2026-02-11`: LSP comparator fairness and reliability gaps were observed under loose-file tasks and unstable auth state -> Added project-backed paired-run task shape (`TargetHarness.csproj` + `Program.cs`) plus Claude auth preflight fail-fast in the harness -> Require project-context comparator runs and valid agent auth before interpreting Roslyn-vs-LSP outcomes.
 - `2026-02-11`: Preview distribution flow required too many manual steps across separate workflows -> Updated `Publish NuGet Preview` to build one artifact set, publish NuGet, and refresh GitHub Release assets in the same run -> Treat this unified workflow as the default regular release path for preview versions.
 - `2026-02-11`: File-scoped Roslyn commands could silently degrade to ad-hoc semantics and report misleading diagnostics -> Added workspace auto-resolution + explicit `workspace_path` override with surfaced `workspace_context` metadata in `nav.find_symbol`/`diag.get_file_diagnostics` and aligned pit-of-success guidance/harness prompts -> Require agents to verify `workspace_context.mode` and force workspace binding when mode is `ad_hoc` on project-backed files.
+- `2026-02-12`: Project-task benchmark runs were falsely failing due harness artifact leakage (`Target.original.cs` compiled into generated project) -> Excluded `Target.original.cs` from `TargetHarness.csproj` and added regression test coverage -> Treat run-harness file layout as part of experiment validity gates.
+- `2026-02-12`: Workspace-mode evidence was hard to aggregate across transcripts -> Added paired-run metadata fields for Roslyn workspace mode counts and updated summary markdown/workflow docs -> Use `workspace/ad_hoc` counters as first-class comparability telemetry in scenario matrices.
@@ -894,6 +894,117 @@ Decision:
 - Treat `workspace_context.mode=workspace` as the expected state for project-backed `nav.find_symbol` and `diag.get_file_diagnostics` calls.
 - Update pit-of-success and paired-run guidance to rerun with explicit workspace binding (`workspace_path`) when mode is `ad_hoc`.
 
+### F-2026-02-12-30: Project-shape paired runs were initially confounded by harness self-collision, now fixed and regression-tested
+
+Evidence:
+
+- harness fix:
+  - `benchmarks/scripts/Run-PairedAgentRuns.ps1` (`TargetHarness.csproj` now excludes `Target.original.cs`)
+- regression test:
+  - `tests/RoslynSkills.Benchmark.Tests/PairedRunHarnessScriptTests.cs`
+- post-fix clean project bundle:
+  - `artifacts/real-agent-runs/20260212-v0.1.6-preview.7-project-matrix-v2/paired-run-summary.json`
+
+Result:
+
+- before fix, project-shape runs produced duplicate-type/member errors unrelated to the edit task.
+- after fix, codex `control`, `treatment`, and `treatment-mcp` all passed constraint checks in project shape.
+
+Interpretation:
+
+- this was a harness validity bug, not a Roslyn capability issue.
+- separating harness defects from tool behavior materially changes interpretation quality.
+
+Decision:
+
+- treat generated fixture compile-surface as part of experiment correctness gates.
+- keep explicit test coverage for task-shape project generation.
+
+### F-2026-02-12-31: Paired harness now emits workspace-context mode telemetry that distinguishes workspace-backed vs ad-hoc runs
+
+Evidence:
+
+- metadata/summary instrumentation:
+  - `benchmarks/scripts/Run-PairedAgentRuns.ps1` (`roslyn_workspace_mode_workspace_count`, `roslyn_workspace_mode_ad_hoc_count`, `roslyn_workspace_mode_last`)
+- refreshed project bundle:
+  - `artifacts/real-agent-runs/20260212-v0.1.6-preview.7-project-matrix-v5/paired-run-summary.json`
+- refreshed single-file bundle:
+  - `artifacts/real-agent-runs/20260212-v0.1.6-preview.7-singlefile-matrix-v4/paired-run-summary.json`
+
+Result:
+
+- codex `treatment-mcp` project shape reports `workspace/ad_hoc = 2/0`.
+- codex `treatment-mcp` single-file shape reports `workspace/ad_hoc = 0/2`.
+
+Interpretation:
+
+- workspace-context mode behavior now appears directly in run metadata, reducing transcript-only ambiguity.
+- scenario-level context differences (project vs loose file) are now measurable and auditable.
+
+Decision:
+
+- include workspace-mode counters in future promotion/readout tables.
+- use `TaskShape=project` as default for context-sensitive comparator claims.
+
+### F-2026-02-12-32: Current cross-scenario approach matrix favors roscli helper as default path, with MCP as explicit-context path
+
+Evidence:
+
+- matrix artifact:
+  - `benchmarks/experiments/20260212-approach-matrix-v0.1.6-preview.7.md`
+- codex bundles:
+  - `artifacts/real-agent-runs/20260212-v0.1.6-preview.7-project-matrix-v5/paired-run-summary.json`
+  - `artifacts/real-agent-runs/20260212-v0.1.6-preview.7-singlefile-matrix-v4/paired-run-summary.json`
+- latest valid Claude comparator with LSP lane:
+  - `artifacts/real-agent-runs/20260211-lsp-roslyn-v4/paired-run-summary.json`
+
+Result:
+
+- codex project shape:
+  - control: `22.626s`, `34,150` tokens
+  - treatment: `35.740s`, `27,246` tokens
+  - treatment-mcp: `27.003s`, `66,108` tokens
+- codex single-file:
+  - control: `19.980s`, `34,037` tokens
+  - treatment: `24.718s`, `26,991` tokens
+  - treatment-mcp: `35.227s`, `79,416` tokens
+- Claude prior LSP comparator (`v4`) kept Roslyn lanes passing, while `treatment-lsp` timed out (`180.066s`, `0/1` successful LSP calls).
+
+Interpretation:
+
+- for this task family, `treatment` (roscli helper) remains the most practical default:
+  - consistent pass behavior,
+  - lower token totals than control in current codex runs,
+  - materially lower token/round-trip overhead than MCP.
+- MCP is useful when explicit workspace-mode evidence is required, but still costs more tokens/round-trips.
+
+Decision:
+
+- keep roscli helper path as default treatment baseline.
+- use MCP selectively for context assurance/debugging and structured multi-step operations.
+
+### F-2026-02-12-33: Comparator reliability is currently limited by execution-environment issues, not only tool behavior
+
+Evidence:
+
+- current run logs (2026-02-12 bundles) showed Claude auth preflight failures (`401 OAuth token expired`) and skipped Claude lanes.
+- prior LSP-enabled bundle still showed first-call timeout despite LSP availability:
+  - `artifacts/real-agent-runs/20260211-lsp-roslyn-v4/paired-run-summary.json`
+
+Result:
+
+- fresh codex data is clean and reproducible on current version.
+- fresh Claude/LSP data is currently blocked by auth and prior LSP timeout behavior.
+
+Interpretation:
+
+- experimental infrastructure and account/plugin health are still first-order confounds for cross-agent conclusions.
+
+Decision:
+
+- treat Claude auth as a hard precondition for matrix refresh runs.
+- rerun full project-backed comparator (`control`, `treatment`, `treatment-mcp`, `treatment-lsp`) after auth recovery before updating architecture-level claims.
+
 ## Token-to-Information Efficiency (Proxy Metrics)
 
 Current telemetry allows two practical proxies:
 
@@ -154,6 +154,7 @@ Current guidance from skill-intro ablations (`artifacts/skill-intro-ablation/202
 - Treat `schema-first` as a debugging/contract-validation lane, not a default execution lane.
 - Keep prompt examples shell-specific (PowerShell vs Bash) and avoid inline JSON quoting in profile guidance.
 - For `nav.find_symbol` and `diag.get_file_diagnostics`, require `workspace_context.mode=workspace` on project-backed tasks; if mode is `ad_hoc`, rerun with explicit workspace path (`--workspace-path TargetHarness.csproj` or `workspace_path=TargetHarness.csproj` in MCP query).
+- `-TaskShape project` now excludes `Target.original.cs` from compilation in generated `TargetHarness.csproj` to prevent duplicate-type benchmark confounds.
 
 Isolation and integrity defaults:
 
@@ -173,6 +174,7 @@ Current harness outputs include:
   - control contamination detection,
   - deterministic rename constraint checks,
   - Roslyn attempted/successful call counts,
+  - Roslyn workspace-context mode counts (`roslyn_workspace_mode_workspace_count`, `roslyn_workspace_mode_ad_hoc_count`, `roslyn_workspace_mode_last`),
   - `duration_seconds` elapsed time per run,
   - `mcp_enabled` and MCP config file paths when applicable,
   - model token totals and cache-inclusive token totals,
 
@@ -0,0 +1,72 @@
+# Approach Matrix (v0.1.6-preview.7)
+
+Date: 2026-02-12  
+Purpose: compare currently available approaches across scenarios while separating experiment/harness failures from tool-behavior signals.
+
+## Sources
+
+- `artifacts/real-agent-runs/20260212-v0.1.6-preview.7-project-matrix-v5/paired-run-summary.json`
+- `artifacts/real-agent-runs/20260212-v0.1.6-preview.7-singlefile-matrix-v4/paired-run-summary.json`
+- `artifacts/real-agent-runs/20260211-lsp-roslyn-v4/paired-run-summary.json`
+
+## Scenario Matrix
+
+### A) Project-backed task (Codex, current version)
+
+| Approach | Run passed | Duration (s) | Total tokens | Round trips | Roslyn calls (ok/attempted) | Workspace modes (workspace/ad_hoc) |
+| --- | --- | ---: | ---: | ---: | --- | --- |
+| control | true | 22.626 | 34,150 | 2 | 0/0 | 0/0 |
+| treatment (roscli helper) | true | 35.740 | 27,246 | 2 | 1/1 | 0/0 |
+| treatment-mcp | true | 27.003 | 66,108 | 4 | 3/3 | 2/0 |
+
+### B) Single-file task (Codex, current version)
+
+| Approach | Run passed | Duration (s) | Total tokens | Round trips | Roslyn calls (ok/attempted) | Workspace modes (workspace/ad_hoc) |
+| --- | --- | ---: | ---: | ---: | --- | --- |
+| control | true | 19.980 | 34,037 | 2 | 0/0 | 0/0 |
+| treatment (roscli helper) | true | 24.718 | 26,991 | 2 | 1/1 | 0/0 |
+| treatment-mcp | true | 35.227 | 79,416 | 5 | 3/3 | 0/2 |
+
+### C) Single-file comparator snapshot (Claude, prior valid LSP bundle)
+
+| Approach | Run passed | Duration (s) | Total tokens | Round trips | Roslyn calls (ok/attempted) | LSP calls (ok/attempted) | LSP tools available |
+| --- | --- | ---: | ---: | ---: | --- | --- | --- |
+| control | true | 31.772 | 510 | 3 | 0/0 | 0/0 | n/a |
+| treatment (roscli) | true | 38.524 | 649 | 4 | 2/2 | 0/0 | n/a |
+| treatment-mcp | true | 38.225 | 957 | 5 | 3/3 | 0/0 | n/a |
+| treatment-lsp | false | 180.066 | n/a | 2 | 0/0 | 0/1 | true |
+
+## Most Promising Path (Current)
+
+- Default path for practical reliability: `treatment (roscli helper)` in project-backed tasks.
+- Why now:
+  - passed constraints in current project and single-file runs,
+  - lower model-token totals than control in both current codex scenarios,
+  - lower operational overhead than MCP on this task family.
+- MCP remains valuable when explicit workspace-context evidence is required:
+  - project scenario recorded `workspace/ad_hoc = 2/0`,
+  - single-file scenario recorded `workspace/ad_hoc = 0/2`.
+
+## Things To Disentangle
+
+1. Claude auth volatility (execution environment)
+- Current 2026-02-12 Claude lanes were not runnable due OAuth expiry (`401`), so no fresh Claude comparator data was produced.
+- This is an environment gate, not a Roslyn/LSP capability result.
+
+2. LSP reliability vs availability (experimental validity)
+- In latest valid LSP comparator bundle (`20260211-lsp-roslyn-v4`), LSP tools were available but first semantic call timed out (`0/1`, 180s).
+- Need project-backed replicated LSP runs with valid auth before comparative claims.
+
+3. Helper-path workspace telemetry visibility (instrumentation gap)
+- `treatment` helper lane uses `roslyn-rename-and-verify.ps1`; it does not currently emit `workspace_context` counts directly, so helper rows show `0/0`.
+- MCP lane provides clear workspace telemetry; helper lane should gain optional explicit workspace-mode probes for parity.
+
+4. Token comparability across providers (measurement caveat)
+- Claude rows include large cache-inclusive token components with provider-specific semantics.
+- Use per-agent comparisons first, cross-agent token comparisons second.
+
+## Immediate Follow-up
+
+1. Re-run full matrix with Claude after auth refresh (`control`, `treatment`, `treatment-mcp`, `treatment-lsp`) on `TaskShape=project`.
+2. Add helper-lane workspace-mode probe option so non-MCP Roslyn runs also report explicit `workspace/ad_hoc` counts.
+3. Add first `dotnet-inspect` comparator lanes (`inspect-only`, `roslyn-only`, `combined`) on package/API-sensitive scenarios.