Keep workspaces durable without tmux#188
Conversation
mariusvniekerk
commented
Apr 27, 2026
- Add a middleman-owned PTY owner fallback for workspace terminals when tmux is unavailable.
- Keep tmux as the preferred backend and bridge browser terminals through the selected backend.
- Add lifecycle, terminal, server, and Windows compile coverage for the fallback path.
roborev: Combined Review (
|
266f387 to
33a1aa5
Compare
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
15d4dc8 to
a1641c9
Compare
roborev: Combined Review (
|
roborev: Combined Review (
|
7749e3f to
ba8b139
Compare
roborev: Combined Review (
|
roborev: Combined Review (
|
d63015d to
c1182f9
Compare
roborev: Combined Review (
|
c1182f9 to
d1acb64
Compare
roborev: Combined Review (
|
d1acb64 to
3f77c2a
Compare
roborev: Combined Review (
|
roborev: Combined Review (
|
Runtime agent tmux sessions can outlive the server process, so they need the same durable ownership semantics as workspace tmux sessions. Tagging them at creation time lets startup cleanup reap crash-window sessions that were never persisted, while treating stored runtime cleanup failures as fatal keeps retryable stop/delete state from being erased. The tmux wrapper now passes only a small safe environment through argv because inherited process environment values can contain credentials and are visible to local process inspection. Verified with: - go test ./internal/workspace/localruntime -run 'TestManagerLaunchCommandWrapsAgentsInTmuxWhenEnabled|TestManagerLaunchCommandMarksWrappedAgentTmuxSession|TestManagerLaunchCommandUsesSanitizedEnvForWrappedAgent' -shuffle=on\n- go test ./internal/workspace -run 'TestManagerCleanupTmuxSessionKillsRuntimeSessionsForWorkspace|TestManagerCleanupTmuxSessionPreservesStoredRowsAfterRuntimeKillFailure|TestManagerReapOrphanTmuxSessionsKeepsStoredRuntimeSessions|TestManagerReapOrphanTmuxSessionsKillsUnknownManagedSessions' -shuffle=on\n- go test ./internal/server -run 'TestWorkspaceRuntimeLaunchAgentCreatesProbeableTmuxSessionE2E|TestServerStartupReapsUnrecordedRuntimeTmuxSessionE2E|TestWorkspaceRuntimeStopTmuxCleanupFailureKeepsSessionE2E' -shuffle=on\n- go test ./internal/db ./internal/workspace/localruntime ./internal/workspace ./internal/server ./internal/config -shuffle=on\n- git diff --check\n\nGenerated with Codex\nCo-authored-by: Codex <codex@openai.com>
Runtime tmux sessions are only recoverable after a restart if they are tagged before launch continues. If tmux cannot create the session with the owner marker, or cannot verify the marker afterward, fail the wrapper and clean up the just-created session instead of attaching to something startup cleanup cannot trust. Verified with: - go test ./internal/workspace/localruntime -run 'TestManagerLaunchCommandMarksWrappedAgentTmuxSession|TestManagerLaunchCommandCleansUpWhenOwnerMarkingFails|TestManagerLaunchCommandUsesSanitizedEnvForWrappedAgent' -shuffle=on\n- go test ./internal/db ./internal/workspace/localruntime ./internal/workspace ./internal/server ./internal/config -shuffle=on\n- git diff --check\n\nGenerated with Codex\nCo-authored-by: Codex <codex@openai.com>
The owner-marker failure path is the risky case for restart cleanup: a runtime tmux session may have been created, but cannot be trusted unless marker failure also cleans up and leaves the stored runtime state recoverable. Cover that path directly in the launcher test and through the HTTP API with real SQLite so future changes cannot silently turn marker failures into leaked sessions. Verified with: - go test ./internal/workspace/localruntime -run 'TestManagerLaunchCommandCleansUpWhenOwnerMarkingFails|TestManagerLaunchCommandMarksWrappedAgentTmuxSession' -shuffle=on\n- go test ./internal/server -run 'TestWorkspaceRuntimeLaunchTmuxOwnerMarkerFailureCleansSessionE2E|TestWorkspaceRuntimeLaunchAgentCreatesProbeableTmuxSessionE2E' -shuffle=on\n- go test ./internal/db ./internal/workspace/localruntime ./internal/workspace ./internal/server ./internal/config -shuffle=on\n- git diff --check\n\nGenerated with Codex\nCo-authored-by: Codex <codex@openai.com>
The riskiest marker failure is the tmux command sequence embedded in new-session, because tmux may have created the runtime session before rejecting the owner marker. Make the regression tests fail that embedded marker command and assert cleanup still kills the created runtime session, both in the launcher and through the HTTP/SQLite path. Verified with: - go test ./internal/workspace/localruntime -run 'TestManagerLaunchCommandCleansUpWhenOwnerMarkingFails|TestManagerLaunchCommandMarksWrappedAgentTmuxSession' -shuffle=on\n- go test ./internal/server -run 'TestWorkspaceRuntimeLaunchTmuxOwnerMarkerFailureCleansSessionE2E|TestWorkspaceRuntimeLaunchAgentCreatesProbeableTmuxSessionE2E' -shuffle=on\n- go test ./internal/db ./internal/workspace/localruntime ./internal/workspace ./internal/server ./internal/config -shuffle=on\n- git diff --check\n\nGenerated with Codex\nCo-authored-by: Codex <codex@openai.com>
Middleman currently requires tmux before a workspace can become ready, which blocks durable workspaces on hosts such as Windows. This plan lays out a small pty-owner process, backend selection, attach flow, cleanup, and verification path before implementation starts.
Middleman previously required tmux before a workspace could become ready, which prevented durable workspaces on hosts where tmux is unavailable. This adds a middleman-owned PTY owner process, uses it as the workspace terminal fallback, and bridges the browser terminal to that owner while preserving tmux as the preferred backend.
Persist the terminal backend that created each workspace so attach, activity snapshots, and delete keep using pty-owner even if tmux becomes available later. Also make pty-owner stop tolerate stale owner state and terminate the process group/tree during cleanup.
Leave pre-existing workspace rows unclassified during the backend migration and infer pty-owner ownership from existing owner state on first use. Also make the owner command's Unix session setup explicit before start so process-tree cleanup targets the right group.
Guard pty closure so owner stop and RunOwner return cannot close the go-pty handle concurrently under the race detector.
Add a real pty-owner shutdown regression that stops an in-process owner while RunOwner is returning, covering the race-detector failure path from CI.
Address remaining roborev blockers for the durable pty-owner backend: repair current databases missing terminal_backend, avoid leaking token env into detached helpers, wait for owner state removal on stop, and continue cleanup of stored tmux runtime sessions for pty-owner workspaces. Validation: - go test ./internal/db ./internal/ptyowner ./internal/workspace -shuffle=on - go test ./internal/server -run TestWorkspaceListTmuxActivityStressDoesNotLeakProcesses -shuffle=on - go test ./... -shuffle=on - go test ./internal/ptyowner -race -shuffle=on - go test ./internal/server -race -run TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E -shuffle=on - GOOS=windows GOARCH=amd64 go test -c -o /tmp/middleman-cmd-windows.test.exe ./cmd/middleman - GOOS=windows GOARCH=amd64 go test -c -o /tmp/middleman-ptyowner-windows.test.exe ./internal/ptyowner Generated with Codex Co-authored-by: Codex <codex@openai.com>
Address follow-up roborev lifecycle feedback: the durable pty-owner helper should not be tied to the request context after launch, and the e2e delete assertion now verifies workspace deletion cleans up owner state before any fallback cleanup runs. Validation: - go test ./internal/ptyowner ./internal/server -run 'TestOwnerHelperEnvironmentStripsCredentials|TestOwnerStopWhileRunOwnerReturns|TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E' -shuffle=on\n- go test ./internal/ptyowner -race -shuffle=on\n- go test ./internal/server -race -run TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E -shuffle=on\n- GOOS=windows GOARCH=amd64 go test -c -o /tmp/middleman-cmd-windows.test.exe ./cmd/middleman\n- GOOS=windows GOARCH=amd64 go test -c -o /tmp/middleman-ptyowner-windows.test.exe ./internal/ptyowner\n- go test ./... -shuffle=on\n\nGenerated with Codex\nCo-authored-by: Codex <codex@openai.com>
Address roborev feedback that owner state could be deleted on caller cancellation or transient connection errors. Only positively absent/refused owners are treated as stale, and the owner now bounds unauthenticated loopback connections with a connection cap and first-request deadline. Validation: - go test ./internal/ptyowner -shuffle=on - go test ./internal/ptyowner -race -shuffle=on - go test ./internal/server -race -run TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E -shuffle=on - GOOS=windows GOARCH=amd64 go test -c -o /tmp/middleman-ptyowner-windows.test.exe ./internal/ptyowner - go test ./... -shuffle=on Generated with Codex Co-authored-by: Codex <codex@openai.com>
Address roborev feedback that owner RPCs could hang after TCP connect and that the e2e path still bypassed the detached helper. Client RPCs now apply bounded connection deadlines and the no-tmux e2e exercises the real subprocess launch path through a test helper. Validation: - go test ./internal/ptyowner ./internal/server -run 'TestClientPingHonorsContextAfterConnect|TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E' -shuffle=on\n- go test ./internal/ptyowner -race -shuffle=on\n- go test ./internal/server -race -run TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E -shuffle=on\n- GOOS=windows GOARCH=amd64 go test -c -o /tmp/middleman-cmd-windows.test.exe ./cmd/middleman\n- GOOS=windows GOARCH=amd64 go test -c -o /tmp/middleman-ptyowner-windows.test.exe ./internal/ptyowner\n- go test ./... -shuffle=on\n\nGenerated with Codex\nCo-authored-by: Codex <codex@openai.com>
The pty-owner helper must survive middleman exiting on Windows just like the Unix detached process path. Use Windows process creation flags for a detached process group, and tighten the restart e2e so it reattaches only after the original server has shut down.
After rebasing the PR onto main, several older tmux-runtime test copies were replayed next to newer upstream versions. Remove the duplicate copies so the workspace and localruntime test packages build while preserving the current upstream coverage and this PR's pty-owner tests. Verified with: - go test ./internal/workspace ./internal/workspace/localruntime ./internal/server -shuffle=on Generated with Codex Co-authored-by: Codex <codex@openai.com>
Concurrent Ensure calls for the same session could all observe absent state and launch detached owners, leaving only the last state file manageable. Add a per-session atomic startup lock, recheck liveness while holding it, and cover the race with a concurrent Ensure regression test. Verified with: - go test ./internal/ptyowner -run TestClientEnsureSerializesConcurrentStarts -shuffle=on - go test ./internal/ptyowner -shuffle=on - go test ./internal/server -run TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E -shuffle=on Generated with Codex Co-authored-by: Codex <codex@openai.com>
Startup tmux pruning now skips workspaces that resolve to the pty-owner backend, including legacy rows with existing owner state. This prevents a restart with tmux available from marking a durable pty-owner workspace errored before it can reattach. Verified with: - go test ./internal/workspace -run TestManagerPruneMissingTmuxSessionsRemovesStaleRecords -shuffle=on - go test ./internal/server -run 'TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E|TestWorkspaceRuntimeSessionTerminalWebSocketBasePathE2E|TestWorkspaceRuntimeSessionTerminalSkipsAltScreenReplayE2E' -shuffle=on - go test ./... -shuffle=on
Limit the unauthenticated owner control frame before token validation and cap later attach request frames and terminal input chunks. This prevents oversized local JSON payloads from forcing unbounded allocation in the durable owner. Verified with: - go test ./internal/ptyowner -run 'TestOwnerRejectsOversizedUnauthenticatedRequest|TestOwnerAttachInputAndReplay' -shuffle=on - go test ./internal/ptyowner -shuffle=on
Renumber the terminal backend migration after main added repo overview migrations, update the merged runtime delete guard to preserve dirty-delete behavior, and align workspace tests with the runtime tmux session timestamp contract.
The rebase exposed test-only fallout: a merged require.New(t) assertion kept the old package-level signature, and the server runtime helper could trip Go deadlock detection after launch because its blocking modes used select {} in the only active goroutine. That made dirty-delete runtime coverage race the helper process exit instead of the delete path under test.
The OpenAPI spec update is the hook-generated output from the rebased Huma schema.
Validation:
- go test ./internal/db -run TestOpenAndSchema -shuffle=on
- go test ./internal/server -run TestWorkspaceDeleteDirtyKeepsRuntimeSessionsE2E -shuffle=on
- go test ./cmd/middleman ./internal/db ./internal/ptyowner ./internal/terminal ./internal/workspace ./internal/server -shuffle=on
Generated with Codex
Co-authored-by: Codex <codex@openai.com>
The rebased settings API type now requires agents on Settings, and one repo import modal mock still returned the older partial shape. That made the frontend typecheck fail in CI even though the test behavior itself was unchanged. Validation: - bunx vitest run src/lib/components/settings/RepoImportModal.test.ts - bun run typecheck - go test ./internal/db -run TestOpenAndSchema -shuffle=on - go test ./internal/server -run TestWorkspaceDeleteDirtyKeepsRuntimeSessionsE2E -shuffle=on Generated with Codex Co-authored-by: Codex <codex@openai.com>
Hold the runtime stopping marker before the non-force dirty check so a concurrent launch cannot create fresh worktree changes after the preflight passes. Existing runtime sessions are still stopped only after a clean preflight, preserving dirty-delete rejection behavior.
d7474ff to
ddabdaf
Compare
roborev: Combined Review (
|
Keep the runtime stopping guard change while removing the explanatory block and avoiding a second test agent target. The dirty-delete E2E still verifies sessions survive a 409 and launches are unblocked afterward.
roborev: Combined Review (
|
Reject a second active workspace terminal connection so pty-owner workspaces keep the same single-client invariant as the tmux attach flow. Also wait on detached pty-owner helpers from a background goroutine so exited helpers are reaped while the server is still running. Verified with: - go test ./internal/terminal -shuffle=on - go test ./internal/ptyowner -shuffle=on - go test ./internal/server -run TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E -shuffle=on - go test ./internal/terminal ./internal/ptyowner ./internal/server -run 'TestHandlerRejectsConcurrentWorkspaceTerminals|TestHandlerAttachesPtyOwnerTerminal|TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E|TestClientEnsureSerializesConcurrentStarts' -shuffle=on - git diff --check 🤖 Generated with [OpenAI Codex](https://openai.com/codex) Co-authored-by: OpenAI Codex <noreply@openai.com>
roborev: Combined Review (
|
Ensure pty-owner attachments close their output stream before signaling done, then wait for the WebSocket bridge output writer to drain before sending the terminal exit frame. Add full-stack e2e coverage for single terminal attachment behavior and final pty-owner output on shell exit. Verified with: - go test ./internal/server -run 'TestWorkspacePtyOwnerTerminalRejectsConcurrentAttachmentsE2E|TestWorkspacePtyOwnerTerminalFlushesFinalOutputOnExitE2E|TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E' -shuffle=on - go test ./internal/terminal ./internal/ptyowner -shuffle=on - git diff --check 🤖 Generated with [OpenAI Codex](https://openai.com/codex) Co-authored-by: OpenAI Codex <noreply@openai.com>
roborev: Combined Review (
|
Make the pty-owner terminal concurrency e2e wait for the server to observe the first WebSocket close before asserting a later attachment succeeds. The production slot release is intentionally tied to the request handler cleanup path, so the test needs to account for that asynchronous close. Verified with: - go test ./internal/server -run 'TestWorkspacePtyOwnerTerminalRejectsConcurrentAttachmentsE2E|TestWorkspacePtyOwnerTerminalFlushesFinalOutputOnExitE2E|TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E' -shuffle=on - go test ./internal/terminal ./internal/ptyowner -shuffle=on - git diff --check 🤖 Generated with [OpenAI Codex](https://openai.com/codex) Co-authored-by: OpenAI Codex <noreply@openai.com>
roborev: Combined Review (
|
Wait briefly for the PTY reader to drain after the owned process exits before closing subscribers and completing the owner. The attach writer now drains subscriber output to channel closure before sending the exit response, preventing the done path from racing ahead of final terminal bytes. Verified with: - go test ./internal/ptyowner -shuffle=on - go test ./internal/server -run 'TestWorkspacePtyOwnerTerminalFlushesFinalOutputOnExitE2E|TestWorkspacePtyOwnerTerminalRejectsConcurrentAttachmentsE2E' -shuffle=on - go test ./internal/terminal -shuffle=on - git diff --check 🤖 Generated with [OpenAI Codex](https://openai.com/codex) Co-authored-by: OpenAI Codex <noreply@openai.com>
roborev: Combined Review (
|