fix: keep dashboard sessions responsive and stop cleanly#1782
fix: keep dashboard sessions responsive and stop cleanly#1782etwk wants to merge 12 commits intoComposioHQ:mainfrom
Conversation
Greptile SummaryThis PR addresses dashboard responsiveness and clean shutdown by bounding live enrichment with deadlines, adding stale-while-revalidate caching, separating metadata-only reads (
Confidence Score: 5/5Safe to merge — enrichment deadline tracking and SWR caching are well-guarded with generation checks, the stop/restore fix correctly moves all sessions into otherProjects for full stops, and the mux polling overhaul is a clean improvement. The enrichment deduplication logic (generation tracking, per-key cooldowns, deadline-based timeouts) is correct and self-consistent. The listCached SWR cold-cache path deliberately returns unenriched data as a conscious tradeoff for responsiveness. The restore-on-start logic now correctly handles full-stop records where sessionIds is empty and all sessions live in otherProjects. No correctness regressions were found in the reviewed paths. packages/core/src/session-manager.ts — the new bounded-enrichment and cache-refresh machinery is the most complex addition and deserves close attention during any future changes to listCached, ensureHandleAndEnrichWithin, or invalidateCache.
|
| Filename | Overview |
|---|---|
| packages/core/src/session-manager.ts | Major enrichment refactor: adds bounded enrichment with deadline tracking, deduplication via boundedEnrichmentPromises, stale-while-revalidate caching, and a new listStored() path. Cold-cache now returns unenriched sessions immediately. Logic is sound with generation tracking to prevent stale cache writes. |
| packages/web/server/mux-websocket.ts | SessionBroadcaster overhauled from setInterval to recursive setTimeout with deduplication via inFlight. Adds lastSnapshot for cold-start coalescing, error suppression threshold, and increases fetch timeout from 4s to 15s. Fixes Windows localhost to 127.0.0.1 for loopback. |
| packages/cli/src/lib/all-projects-config.ts | New module for merging global + running config for stop/shutdown paths. Handles relative path normalization, plugin deduplication, and fallback chains. Config merge semantics are consistent: secondary (running) wins for project definitions and plugins; primary (global) wins for notifiers/reactions/routing. |
| packages/cli/src/commands/start.ts | Stop command broadened to use merged all-projects config; full-stop now writes sessionIds: [] with all sessions in otherProjects keyed by their real project ID, fixing the restore mismatch. killDashboardOnPort skips port-based kill on Windows to avoid false positives. |
| packages/cli/src/lib/shutdown.ts | Shutdown handler now uses loadAllProjectsConfigWithFallback and listStored(). Session cleanup and unregister() are now in separate try/catch blocks so a cleanup failure no longer prevents daemon unregistration. |
| packages/web/src/app/api/sessions/route.ts | PR enrichment moved before settlesWithin barrier so a timed-out metadata enrichment running in background cannot overwrite PR fields. Adds metadataFresh freshness level mapping to listStored() for mux reconciliation. |
| packages/web/src/app/api/sessions/[id]/route.ts | Adds enrichTimeoutMs to sessionManager.get() calls, bounding live enrichment to 3s. PR enrichment is also moved before settlesWithin, consistent with sessions list route. |
| packages/web/src/app/api/sessions/patches/route.ts | Switches mux poll path from list() (live enrichment) to listStored() (metadata-only), correctly avoiding runtime probes on the high-frequency polling endpoint. |
Sequence Diagram
sequenceDiagram
participant UI as Browser/SSR
participant API as /api/sessions
participant SM as SessionManager
participant FS as Filesystem
participant RT as Runtime/Agent
Note over UI,RT: Cold cache (first request)
UI->>API: GET /api/sessions (staleWhileRevalidate)
API->>SM: "listCached({staleWhileRevalidate:true})"
SM->>FS: loadStoredSessions() no enrichment
FS-->>SM: raw metadata sessions
SM-->>API: unenriched sessions (fast)
SM--)RT: refreshSessionCache() background
API-->>UI: immediate response
Note over UI,RT: Background enrichment completes
RT-->>SM: runtime states, agent info
SM->>FS: persist lifecycle updates
SM->>SM: update sessionCache
Note over UI,RT: Mux WebSocket polling
UI->>API: GET /api/sessions/patches
API->>SM: listStored() metadata only
FS-->>SM: raw metadata
SM-->>API: session patches
API-->>UI: broadcast to WS clients
Note over UI,RT: Individual session GET
UI->>API: GET /api/sessions/[id]
API->>SM: get(id, enrichTimeoutMs:3000)
SM->>RT: ensureHandleAndEnrichWithin()
RT-->>SM: enriched or timeout
SM-->>API: session
API-->>UI: response
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
packages/core/src/session-manager.ts:519-545
**Cold-cache path stores unenriched sessions for the full TTL**
`listCached({staleWhileRevalidate: true})` on a cold cache stores the output of `loadStoredSessions()` (no runtime/agent enrichment) into `sessionCache` with the full `SESSION_CACHE_TTL_MS`. The background refresh then races to overwrite it. Until it completes, every subsequent SWR call sees the unenriched snapshot — including calls that arrive in the middle of the background refresh. If the background refresh fails (the error is swallowed in `refreshSessionCacheInBackground`), the cache TTL holds stale unenriched data for the full cycle with no visible signal to callers.
This is likely intentional for responsiveness, but it's worth confirming that all consumers of `listCached` with `staleWhileRevalidate: true` tolerate null/undefined `agentInfo`, `runtimeHandle`, and activity fields on the first render pass.
### Issue 2 of 3
packages/core/src/session-manager.ts:1050-1065
**Deduplication keys are path-scoped, not session-scoped — per-call-site cooldowns don't cross-protect**
The `dedupeKey` is prefixed with the call site (`get:3000:...`, `list:cached:...`, `list:fresh:...`), so a timeout in one path sets a cooldown only for that exact key. If `get()` times out for a session and sets a 15s cooldown, a concurrent `list()` call will still start a fresh enrichment for the same session under a different key, potentially timing out again and setting its own cooldown. Under heavy load, a stalling agent or runtime could cause repeated 15s timeouts across all three call-site variants before all three cooldowns are in effect.
### Issue 3 of 3
packages/cli/src/lib/all-projects-config.ts:158-175
**Relative plugin paths in secondary (running) config are normalized, but primary (global) config paths are not**
`normalizeConfigRelativePluginPaths` is applied to `secondary` only. The assumption is that `loadConfig()` already produces absolute paths for the primary config. If `loadConfig` ever returns unresolved relative paths, project-level plugin paths from the global config would remain relative after the merge. The `withAbsoluteProjectPluginPaths` helper only normalizes `tracker.path` and `scm.path` — other path-bearing fields are not touched in either config.
Reviews (10): Last reviewed commit: "Tighten final review edge cases" | Re-trigger Greptile
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e2f3e16969
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: df569c4b21
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The mux patches feed was doing fresh session enrichment on each high-frequency poll, making the old 4s abort deadline too tight for larger local session histories. Reuse the SessionManager cache for the polling route and give individual snapshots a 15s ceiling while the existing in-flight guard prevents overlap. Constraint: /api/sessions/patches is a high-frequency live hint path consumed by direct-terminal mux. Rejected: Keep fresh list() on every poll | it repeats expensive enrichment and reproduces abort warnings. Rejected: Remove the timeout | slow or hung Next.js route calls still need a bounded failure mode. Confidence: high Scope-risk: narrow Directive: Treat patches as eventually consistent; use fresh session routes for source-of-truth refreshes. Tested: pnpm --filter @aoagents/ao-web test -- server/__tests__/mux-websocket.test.ts src/__tests__/api-routes.test.ts; pnpm --filter @aoagents/ao-web typecheck; git diff --check Not-tested: Full monorepo test suite
Bound session detail enrichment so the web route can return metadata-backed session data instead of waiting indefinitely on live runtime or agent probes. Share bounded enrichment safely across concurrent callers, throttle retries after timeouts, and preserve fresher caller metadata while applying explicit enrichment updates. Constraint: Session detail must degrade to partial data rather than showing the frontend timeout page. Rejected: Raising only the UI/page timeout | it would still leave stalled live enrichment work and duplicate polling pressure. Rejected: Broad plugin AbortSignal migration | too large for this targeted dashboard regression fix. Confidence: high Scope-risk: moderate Directive: Keep live enrichment best-effort on user-facing read paths; do not reintroduce unbounded detail-route waits. Tested: pnpm --filter @aoagents/ao-core test -- src/__tests__/session-manager/query.test.ts; pnpm --filter @aoagents/ao-web test -- src/__tests__/api-routes.test.ts; pnpm --filter @aoagents/ao-core typecheck; pnpm --filter @aoagents/ao-core build; pnpm --filter @aoagents/ao-web typecheck; git diff --check Not-tested: Full monorepo test suite.
Serve persisted session metadata immediately, refresh live enrichment in the background, and coalesce mux/client refresh paths so slow or aborted probes do not blank the dashboard. Constraint: Dashboard and session sidebars must stay responsive even when runtime or agent probes stall or abort. Rejected: Blocking ordinary /api/sessions reads on fresh enrichment | preserves correctness but recreates the reported multi-second blank dashboard and abort noise. Confidence: high Scope-risk: moderate Directive: Keep fresh/live enrichment explicit for callers that need it; passive dashboard reads should remain stale-while-revalidate and bounded by tests. Tested: pnpm --filter @aoagents/ao-web typecheck; pnpm --filter @aoagents/ao-web test -- server/__tests__/mux-websocket.test.ts src/app/sessions/[id]/page.test.tsx src/__tests__/api-routes.test.ts src/hooks/__tests__/useSessionEvents.mux.test.ts; pnpm --filter @aoagents/ao-core typecheck; pnpm --filter @aoagents/ao-core test -- src/__tests__/session-manager/cache.test.ts src/__tests__/session-manager/query.test.ts; pnpm lint; git diff --check; no TS/TSX NUL bytes Not-tested: full pnpm build; full pnpm test
The dashboard can receive fast list refreshes while slower metadata enrichment is still bounded by timeouts. Preserve locally available PR enrichment first, then guard the client against transient unenriched placeholders without letting stale status facts override incoming lifecycle state. Constraint: /api/sessions must stay responsive when tracker or agent metadata enrichment stalls Rejected: Server-only protection | client refreshes can still receive transient unenriched placeholders from older or partial sources Confidence: high Scope-risk: narrow Directive: Keep incoming lifecycle/status authoritative when preserving enriched PR display fields Tested: pnpm --filter @aoagents/ao-web test -- src/__tests__/api-routes.test.ts src/hooks/__tests__/useSessionEvents.mux.test.ts; pnpm --filter @aoagents/ao-web typecheck; pnpm lint; git diff --check; pnpm --filter @aoagents/ao-web build; headed-browser smoke against local dashboard Not-tested: Full monorepo test suite
Use cheaper persisted session metadata for dashboard cache paths while preserving fresh list semantics, and make full-stop/shutdown cleanup cover the global project set plus local running projects. Tighten dashboard process cleanup so Unix stops the supervisor tree and Windows fails safe without ownership proof. Constraint: AO dashboard and stop paths must remain cross-platform across macOS, Linux, and Windows. Rejected: Globally downgrading SessionManager.list() to cached Codex metadata | it weakens the fresh-enrichment contract for CLI/lifecycle callers. Rejected: Killing Windows port listeners by port alone | it can terminate unrelated user processes without a reliable dashboard ownership check. Confidence: high Scope-risk: moderate Directive: Keep future stop/shutdown logic routed through the all-project config loader and preserve separate fresh-vs-cached enrichment dedupe keys. Tested: pnpm --filter @aoagents/ao-core test -- src/__tests__/session-manager/query.test.ts; pnpm --filter @aoagents/ao-cli test -- __tests__/commands/start.test.ts __tests__/lib/project-supervisor.test.ts; pnpm --filter @aoagents/ao-web test -- src/lib/__tests__/serialize.test.ts server/__tests__/server-compatibility.test.ts src/hooks/__tests__/useSessionEvents.mux.test.ts; git diff --check; pnpm typecheck; pnpm lint; pnpm build; two-lane code review with final architecture CLEAR Not-tested: Full manual ao restart/stop smoke against the installed global CLI after committing.
Align the reliability PR with review feedback while preserving the bounded dashboard read model: restore uses all-project config, registered-parent shutdown does not emit a false dashboard warning, cached session refresh periodically refreshes live Codex metadata, detail PR enrichment matches list-route ordering, and mux patches use fresh stored metadata. Constraint: High-frequency dashboard reads must avoid live runtime/agent probes while avoiding stale session visibility. Rejected: Reverting patches to full live list() | it would reintroduce enrichment pressure. Confidence: high Scope-risk: narrow Directive: Keep mux patches metadata-only and reserve live enrichment for explicit refresh paths. Tested: pnpm --filter @aoagents/ao-core test -- src/__tests__/session-manager/cache.test.ts src/__tests__/session-manager/query.test.ts; pnpm --filter @aoagents/ao-web test -- server/__tests__/mux-websocket.test.ts server/__tests__/server-compatibility.test.ts src/__tests__/api-routes.test.ts src/app/sessions/[id]/page.test.tsx src/hooks/__tests__/useSessionEvents.mux.test.ts src/lib/__tests__/serialize.test.ts; pnpm --filter @aoagents/ao-cli test -- __tests__/commands/start.test.ts __tests__/lib/project-supervisor.test.ts; pnpm typecheck; git diff --check Not-tested: Full monorepo test suite
Prevent full stop and restore from silently narrowing project coverage when an existing global or running config fails to parse or validate. Constraint: Code review flagged broad optional config fallback as a restore and shutdown masking risk. Rejected: Catching every loadConfig failure | malformed existing configs must not look absent. Confidence: high Scope-risk: narrow Directive: Optional all-project config loading may ignore missing files only; existing invalid configs must remain observable. Tested: pnpm --filter @aoagents/ao-cli test -- __tests__/lib/all-projects-config.test.ts __tests__/commands/start.test.ts __tests__/lib/project-supervisor.test.ts; pnpm --filter @aoagents/ao-web test -- server/__tests__/mux-websocket.test.ts Not-tested: Full monorepo test suite
Keep startup non-blocking while surfacing all-project restore config failures that would otherwise hide unrestored sessions. Constraint: Follow-up architecture review found loadAllProjectsConfig failures were still swallowed by the last-stop restore wrapper. Rejected: Failing startup on restore bookkeeping errors | restore should remain best-effort, but operators need the warning. Confidence: high Scope-risk: narrow Directive: Last-stop restore errors should be visible and should not clear retry state unless sessions actually restore or the user declines. Tested: pnpm --filter @aoagents/ao-cli test -- __tests__/commands/start.test.ts __tests__/lib/all-projects-config.test.ts __tests__/lib/project-supervisor.test.ts; pnpm --filter @aoagents/ao-cli typecheck; git diff --check Not-tested: Full monorepo test suite
df569c4 to
ffcd6ff
Compare
Make stop/shutdown use strict all-project config when possible but fall back to running, global, or default configs before giving up. Prefer the running config on same-ID merges, and record no-arg stop sessions by actual project groups so restore does not depend on config ordering. Constraint: PR review flagged stop/shutdown regressions where malformed merged configs prevented daemon cleanup and arbitrary primary project selection hid restore prompts. Rejected: Making loadAllProjectsConfig swallow invalid configs | ordinary callers still need malformed config failures surfaced. Confidence: high Scope-risk: moderate Directive: Keep full-process stop/shutdown able to signal the registered daemon even when session cleanup config loading degrades. Tested: pnpm --filter @aoagents/ao-cli test -- __tests__/lib/all-projects-config.test.ts __tests__/lib/shutdown.test.ts __tests__/commands/start.test.ts; pnpm --filter @aoagents/ao-cli typecheck; pnpm exec eslint packages/cli/src/lib/all-projects-config.ts packages/cli/src/commands/start.ts packages/cli/src/lib/shutdown.ts packages/cli/__tests__/lib/all-projects-config.test.ts packages/cli/__tests__/commands/start.test.ts packages/cli/__tests__/lib/shutdown.test.ts; git diff --check Not-tested: Full monorepo test suite
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0855663372
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Address review feedback without reintroducing live-probe latency or unsafe shutdown fallback. Keep passive dashboard reads responsive while giving reconciliation paths deterministic metadata freshness, and guard restore cleanup from deleting still-registered worktrees. Constraint: AO supports Windows first-class, and shutdown paths must not depend on arbitrary CWD config discovery or slow live enrichment.\nRejected: Falling back to default config discovery for stop/shutdown | it can target unrelated local sessions when daemon-owned configs are broken.\nRejected: Keeping SIGINT cleanup on live list() | live enrichment can outlive the force-exit guard.\nRejected: Force-removing an existing registered worktree during restore | it can destroy user work before the safety check.\nConfidence: high\nScope-risk: moderate\nDirective: Keep listStored/listCached/fresh=metadata/includeDefaultFallback semantics explicit when adding new callers.\nTested: pnpm --filter @aoagents/ao-cli test -- __tests__/lib/all-projects-config.test.ts __tests__/lib/shutdown.test.ts __tests__/commands/start.test.ts; pnpm --filter @aoagents/ao-core test -- src/__tests__/session-manager/cache.test.ts src/__tests__/session-manager/query.test.ts; pnpm --filter @aoagents/ao-web test -- server/__tests__/mux-websocket.test.ts src/__tests__/api-routes.test.ts src/app/sessions/[id]/page.test.tsx src/hooks/__tests__/useSessionEvents.mux.test.ts; pnpm --filter @aoagents/ao-integration-tests test -- src/workspace-worktree.integration.test.ts; pnpm typecheck; pnpm lint; git diff --check.\nNot-tested: Full browser E2E/manual dashboard smoke.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 242a01f0af
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Scope bounded get enrichment by timeout so short dashboard probes cannot poison longer callers. Make all-project config merging preserve running-owned plugins and path bases, while keeping mutation-sensitive registries isolated per config object. Clean up full-stop restore display and avoid unsafe Windows port probing after parent stop. Constraint: Fresh Codex and Greptile review flagged timeout-agnostic get dedupe, merged-config plugin path/precedence issues, and full-stop restore display wording. Rejected: Rebase all merged configs to the running configPath | global-only plugin paths and config registry semantics would then regress. Rejected: Cache every equivalent inline-plugin config structurally | plugin registry loading mutates fresh configs with manifest names, so cache hits could skip required side effects. Confidence: high Scope-risk: moderate Directive: Keep Windows dashboard port fallback disabled unless ownership can be proven without risking unrelated listeners. Tested: pnpm --filter @aoagents/ao-cli test -- --run __tests__/lib/create-session-manager.test.ts __tests__/lib/all-projects-config.test.ts __tests__/commands/start.test.ts; pnpm --filter @aoagents/ao-core test -- --run src/__tests__/session-manager/query.test.ts; pnpm typecheck; pnpm lint Not-tested: Full pnpm test suite; lint has existing repository warnings only.
|
@codex review |
|
Codex Review: Didn't find any major issues. 👍 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Remove the cold-start mux duplicate notification path by deferring the first poll until the coalesced initial snapshot settles. Document the stop-all last-stop grouping invariant and the listCached cold-cache TTL/enrichment behavior so future changes preserve the intended tradeoffs. Constraint: Fresh Greptile review on 213dfe4 marked the PR safe to merge but identified minor edge cases worth eliminating or documenting. Rejected: Leave mux double notification as harmless | fixing it is small and avoids stateful subscriber surprises. Confidence: high Scope-risk: narrow Directive: Keep full-stop restore grouped by actual project; the top-level projectId is only a compatibility anchor when stopAll is true. Tested: pnpm --filter @aoagents/ao-web test -- --run server/__tests__/mux-websocket.test.ts; pnpm --filter @aoagents/ao-cli test -- --run __tests__/lib/create-session-manager.test.ts __tests__/lib/all-projects-config.test.ts __tests__/commands/start.test.ts; pnpm --filter @aoagents/ao-core test -- --run src/__tests__/session-manager/query.test.ts src/__tests__/session-manager/cache.test.ts; pnpm typecheck; pnpm lint Not-tested: Full pnpm test suite; lint has existing repository warnings only.
|
@codex review |
|
Codex Review: Didn't find any major issues. 👍 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Summary
Testing
pnpm --filter @aoagents/ao-core test -- src/__tests__/session-manager/cache.test.ts src/__tests__/session-manager/query.test.tspnpm --filter @aoagents/ao-web test -- server/__tests__/mux-websocket.test.ts server/__tests__/server-compatibility.test.ts src/__tests__/api-routes.test.ts src/app/sessions/[id]/page.test.tsx src/hooks/__tests__/useSessionEvents.mux.test.ts src/lib/__tests__/serialize.test.tspnpm --filter @aoagents/ao-cli test -- __tests__/commands/start.test.ts __tests__/lib/project-supervisor.test.tspnpm typecheckgit diff --check