Skip to content

Keep workspaces durable without tmux#188

Open
mariusvniekerk wants to merge 28 commits intomainfrom
durable-pty-owner
Open

Keep workspaces durable without tmux#188
mariusvniekerk wants to merge 28 commits intomainfrom
durable-pty-owner

Conversation

@mariusvniekerk
Copy link
Copy Markdown
Collaborator

  • Add a middleman-owned PTY owner fallback for workspace terminals when tmux is unavailable.
  • Keep tmux as the preferred backend and bridge browser terminals through the selected backend.
  • Add lifecycle, terminal, server, and Windows compile coverage for the fallback path.

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 27, 2026

roborev: Combined Review (266f387)

Pty-owner fallback has several Medium/High issues around durability, cleanup, and test coverage.

High

  • internal/server/api_test.go:7566 - The new pty-owner fallback only has an e2e assertion that workspace creation reaches ready; it does not cover the core terminal workflow, restart reattach durability, or delete cleanup through the real HTTP API and SQLite path. Add full-stack e2e coverage for WebSocket terminal read/write via pty-owner, server recreation reattaching to the existing owner, and workspace delete stopping/removing owner state.

Medium

  • internal/server/server.go:365 - Backend choice is server-global and based only on current tmux availability, so existing pty-owner workspaces will be treated as tmux workspaces after tmux becomes available on a later restart. That can start a fresh tmux shell instead of reconnecting to the durable owner and can leave the owner process/state behind on delete. Persist or otherwise determine the terminal backend per workspace/session, and route attach/snapshot/delete to the backend that created that workspace.

  • internal/ptyowner/client.go:210 - Stop treats only missing state files as absent; a stale owner.json with a dead owner returns a dial error and causes workspace delete/retry cleanup to fail. A crashed or killed owner can therefore make the workspace hard to remove. Treat connection-refused/stale owner state as an absent owner during stop, remove the session state directory, and allow cleanup to continue.

  • internal/ptyowner/owner.go:274 - stop() kills only the immediate PTY child process, so background children spawned from the workspace terminal can survive workspace delete/retry or owner stop. For an untrusted PR workflow, that can leave long-lived processes running under the maintainer’s OS user after the UI indicates cleanup is complete. Start the PTY command in its own process group/session and terminate the whole group on Unix, matching localruntime.killSessionProcess; use a job object or equivalent process-tree cleanup on Windows.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 27, 2026

roborev: Combined Review (33a1aa5)

Medium confidence: the PR has a few medium-severity gaps around pty-owner cleanup, process lifecycle, and missing full-stack regression coverage.

Medium

  • internal/ptyowner/client.go:231
    Stop only treats a missing owner.json as absent. If the owner process crashes, is killed, or the host restarts, the state file can remain while the TCP listener is gone. In that case, connect returns a dial error, causing workspace delete/retry cleanup to fail instead of removing stale owner state.
    Fix: Treat unreachable/stale owner errors, such as connection refused, as absent during stop cleanup, remove the session state directory, and return nil.

  • internal/ptyowner/owner.go:267
    owner.stop() kills only the immediate PTY command process. A terminal command can spawn child/background processes that survive workspace delete/retry, leaving processes running under the user account with continued filesystem/network access.
    Fix: Start the PTY command in its own process group/session and kill the whole group on Unix, matching existing session cleanup behavior. Add a regression test that verifies spawned children are terminated.

  • internal/server/api_test.go:7567
    The new e2e coverage only verifies workspace creation when tmux is unavailable. It does not cover the durable terminal behavior added by this feature: terminal WebSocket read/write, reattach after server recreation, or delete stopping the owner and removing state.
    Fix: Add full-stack API/SQLite tests for pty-owner terminal attach/read/write, server restart reattach, and delete cleanup of owner/state files.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 27, 2026

roborev: Combined Review (a1f3334)

High-level verdict: changes need fixes before merge due to a schema migration bug.

High

  • internal/db/queries.go:2257 - Workspace queries now read/write terminal_backend, but the active embedded schema and DB.migrate() path do not create that column. The added internal/db/migrations/000014_* files are not wired into database initialization, so fresh or existing databases can fail with no such column: terminal_backend.

    Fix: Add terminal_backend TEXT NOT NULL DEFAULT 'tmux' to the actual workspace schema and add an idempotent ALTER TABLE middleman_workspaces ADD COLUMN ... migration in the path run by DB.init(), or wire the migrations directory into startup.

Medium

  • internal/server/api_test.go:7581 - The new e2e coverage forces PtyOwnerInProcess: true, so it does not exercise the production durable-owner path that launches the hidden middleman pty-owner subprocess via os.Executable, detaches it, writes state, and reattaches after server recreation. A regression in the actual fallback path could ship while this test still passes.

    Fix: Add a full-stack test using the default out-of-process owner path and verify create, terminal attach, server recreation reattach, and delete cleanup.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 27, 2026

roborev: Combined Review (c424dce)

High-level verdict: Changes are not ready to merge due to one database migration blocker and a credential exposure risk.

High

  • internal/db/queries.go:2254
    Queries now insert/select middleman_workspaces.terminal_backend, but the active embedded schema/startup migration path does not appear to add that column. The added internal/db/migrations/*.sql files are not wired into the shown db.init() flow, so fresh or upgraded databases can fail with no such column: terminal_backend.

    Fix: Add terminal_backend TEXT NOT NULL DEFAULT '' to the middleman_workspaces table definition and add an idempotent ALTER TABLE ... ADD COLUMN migration to the startup path used by db.init().

  • internal/ptyowner/client.go:86
    The detached middleman pty-owner helper is launched with the full inherited server environment. RunOwner strips GitHub tokens from the shell child, but the helper process itself can still retain MIDDLEMAN_GITHUB_TOKEN and related credentials. Code running in an untrusted PR workspace may be able to read the parent/helper environment on common Unix systems, for example through /proc/$PPID/environ.

    Fix: Set cmd.Env before cmd.Start() to a sanitized allowlisted environment, using the same token-stripping approach applied to the shell child.

Medium

  • internal/ptyowner/owner.go:80, internal/ptyowner/owner.go:148
    The owner protocol listens on loopback TCP and decodes an unbounded JSON request before token validation. Any local user who discovers the localhost port can connect, hold connections, or send very large JSON payloads, causing file descriptor, goroutine, or memory exhaustion without knowing the owner token.

    Fix: Prefer a Unix domain socket or Windows named pipe under the 0700 session directory, and add pre-auth connection deadlines plus request size limits before JSON decoding.

  • internal/server/api_test.go:7578
    The full-stack pty-owner E2E path uses PtyOwnerInProcess: true, which bypasses the production path that launches the hidden detached middleman pty-owner helper. This leaves executable resolution, subcommand invocation, process detach, state handoff, restart reattach, and cleanup under-tested.

    Fix: Add an E2E test using the default out-of-process ptyowner.Client path with a test executable or helper binary, covering create, terminal attach, restart reattach, and delete cleanup.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 27, 2026

roborev: Combined Review (13d7633)

High-risk issues remain around schema migration, token exposure, and PTY/tmux cleanup semantics.

High

  • internal/ptyowner/client.go:92: Client.Ensure starts the detached middleman pty-owner helper without setting cmd.Env, so it inherits server secrets such as MIDDLEMAN_GITHUB_TOKEN. Even if the shell child environment is sanitized, untrusted workspace code may be able to read the helper’s environment via /proc/$PPID/environ.

    • Fix: Set a sanitized environment before cmd.Start(), preferably using the same stripping logic as the PTY child or a minimal allowlist, and add a regression test covering the helper process environment.
  • internal/db/queries.go:2257: The code inserts/selects middleman_workspaces.terminal_backend, but the applied schema and db.init() migrations were not updated. Fresh or existing databases can fail workspace create/list/get with no such column: terminal_backend.

    • Fix: Add terminal_backend to the workspace schema for new databases and add an idempotent ALTER TABLE ... ADD COLUMN terminal_backend TEXT NOT NULL DEFAULT '' migration in DB initialization.

Medium

  • internal/workspace/manager.go:890: cleanupTmuxSession returns immediately for pty-owner workspaces after stopping the base owner, bypassing cleanup for stored tmux runtime sessions. Deleting or retrying a pty-owner workspace can leak tmux-backed agent/runtime sessions.

    • Fix: Stop the pty owner for the base session, then continue through stored tmux session cleanup for any non-base runtime sessions before deleting their DB rows.
  • internal/ptyowner/client.go:229: Client.Stop returns once the owner acknowledges the stop request, but the owner acknowledges before the child is fully stopped and state is removed. Delete can report success while PTY owner process/state still exists.

    • Fix: Make Stop wait until the owner exits or the state directory is removed, or have the owner acknowledge only after shutdown completes.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 27, 2026

roborev: Combined Review (c8eaf35)

High-level verdict: changes need fixes before merge due to one lifecycle bug and one test coverage gap.

High

  • internal/ptyowner/client.go:80
    exec.CommandContext is used to launch a detached owner process, but cmd.Wait() is intentionally never called. This can leak the context-monitoring goroutine until cancellation, and when the context is canceled it may race with cmd.Process.Release() and unexpectedly kill the durable owner process.
    Fix: use exec.Command(exe, ...) without a context for a process intended to detach and outlive the request.

Medium

  • internal/server/api_test.go:7655
    The pty-owner delete e2e test manually calls ptyowner.Client.Stop before checking that the owner state directory is gone, so the test would still pass if DeleteWorkspace failed to stop or clean up the owner.
    Fix: remove the manual Stop from the assertion path and verify state directory/process cleanup immediately after DeleteWorkspace; keep any extra stop logic only in t.Cleanup.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 27, 2026

roborev: Combined Review (636150e)

Medium-risk issues found: durable PTY owner state can be incorrectly deleted, and the owner is vulnerable to local unauthenticated connection DoS.

Medium

  • internal/ptyowner/client.go:44: Ensure deletes the session directory and starts a replacement owner after any Ping error, including context cancellation, deadline expiry, or transient dial/read failures against a still-running owner. This can orphan the original durable PTY, overwrite owner.json, and lose cleanup access to the user's running shell.

    Fix: Only remove/recreate state for positively stale or absent owners. Return caller/context/protocol errors without deleting live state.

  • internal/ptyowner/client.go:232: Stop treats every *net.OpError as stale owner state, so a timeout or canceled dial can remove the state directory and report successful cleanup while the owner process is still alive.

    Fix: Classify only explicit connection-refused or absent-owner cases as stale. Preserve state or return the error for cancellation, timeout, and temporary network failures.

  • internal/ptyowner/owner.go:73, internal/ptyowner/owner.go:121, internal/ptyowner/owner.go:148: The PTY owner listens on a random loopback TCP port and spawns a goroutine per accepted connection. handleConn blocks indefinitely on json.Decoder.Decode before token validation or a read deadline, allowing a local process to open many idle connections and exhaust file descriptors/goroutines.

    Fix: Prefer a Unix domain socket or Windows named pipe under the existing 0700 session directory. Also add connection limits and short read deadlines before the first authenticated request. If TCP remains necessary, close unauthenticated idle connections promptly and cap concurrent unauthenticated handlers.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 27, 2026

roborev: Combined Review (e528d93)

Medium-risk issues found around PTY owner RPC timeout handling and missing coverage for the detached helper path.

Medium

  • internal/ptyowner/client.go:121
    PTY owner client RPCs dial with the request context, but the subsequent JSON Encode/Decode calls have no read/write deadline. If a stale owner.json points at a reused localhost port that accepts the TCP connection but never speaks the owner protocol, Ping, Ensure, Stop, Snapshot, or Attach can hang indefinitely and block workspace setup/delete/terminal flows.
    Fix: Set per-connection deadlines or close the connection when ctx.Done() fires around every owner RPC, and treat protocol timeout as stale/failed state where appropriate.

  • internal/server/api_test.go:7569
    The full-stack pty-owner E2E test uses PtyOwnerInProcess: true, so it bypasses the production path that launches the hidden middleman pty-owner helper, detaches it, passes flags/env, writes state, and reconnects after server recreation. That leaves the main durability path for no-tmux workspaces uncovered.
    Fix: Add a full-stack test for the default detached helper path, using the test binary/helper invocation pattern, and assert create, terminal attach, restart reattach, and delete cleanup through the real subprocess path.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 27, 2026

roborev: Combined Review (c267b25)

High-level verdict: changes look mostly sound, but the durable pty-owner path has one High-risk Windows lifecycle bug and one Medium test coverage gap.

High

  • internal/ptyowner/detach_windows.go:7
    The Windows helper is not detached from the parent console/process group. A normal Ctrl+C or console shutdown of middleman can terminate the pty-owner helper as well, which breaks the durable workspace guarantee on the main no-tmux platform.
    Fix: Set Windows process creation flags such as CREATE_NEW_PROCESS_GROUP and DETACHED_PROCESS in detachCommand.

Medium

  • internal/server/api_test.go:7630
    The pty-owner restart e2e path creates a second Server while the original fixture server is still alive, so it does not actually verify that an owner survives middleman shutdown and can be reattached after recreation.
    Fix: Shut down the first server before constructing the replacement server, or add a separate e2e test that explicitly exercises shutdown then reattach.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 27, 2026

roborev: Combined Review (15d4dc8)

No Medium, High, or Critical issues found.

All review agents reported the changes as clean.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Base automatically changed from restore-workspace-agent-activity to main April 27, 2026 18:09
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 28, 2026

roborev: Combined Review (a1641c9)

Medium risk: PTY owner startup is not serialized and can orphan duplicate helpers under concurrent session creation.

Medium

  • internal/ptyowner/client.go:43 - Ensure can race for the same session. Two callers can both observe no live owner, remove the state directory, start detached helpers, and whichever helper writes state last becomes the only manageable one, leaving the other PTY owner orphaned.

    Fix: Add per-session startup serialization, preferably a file lock or atomic owner-state claim, then recheck liveness after acquiring it before launching a helper.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 28, 2026

roborev: Combined Review (7749e3f)

No Medium, High, or Critical findings were reported.

All review agents found no actionable issues to include.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 28, 2026

roborev: Combined Review (ba8b139)

The PR has one medium-severity issue; no high or critical findings were reported.

Medium

  • internal/ptyowner/owner.go:151 - Unauthenticated local DoS via unbounded JSON decode

    The PTY owner accepts loopback TCP connections and decodes the first request into Request before authenticating the token. Since Request includes Data []byte, a local process can connect and send an oversized JSON/base64 data field without knowing the token, causing memory allocation before authentication and potentially exhausting memory or killing the durable PTY owner session.

    Suggested fix: apply a strict decode size limit before authentication and decode the initial frame into a small auth/control-only struct that excludes Data. Also add per-frame limits for later control/input messages, such as an 8-64 KiB cap for JSON control frames and bounded terminal input chunks.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 28, 2026

roborev: Combined Review (d63015d)

No Medium, High, or Critical findings were reported.

All reviewers found the changes clean; the security review identified no exploitable issues.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 29, 2026

roborev: Combined Review (c1182f9)

High-risk delete race remains.

High

  • internal/server/huma_routes.go:3237 - Moving runtime.BeginStopping into the delete cleanup hook leaves a race between the dirty preflight check and the stopping marker. A runtime launch can start after a clean dirty check, write into the worktree, and still be deleted by a non-force workspace delete.

    Fix: Hold the runtime stopping marker before entering workspaces.Delete, and only call StopWorkspace from the post-preflight cleanup hook.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 29, 2026

roborev: Combined Review (d1acb64)

No Medium, High, or Critical issues found.

All review agents reported the code as clean.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 29, 2026

roborev: Combined Review (3f77c2a)

Summary verdict: changes need fixes before merge due to one high-severity delete race and two medium-severity pty-owner issues.

High

  • internal/server/huma_routes.go:3240
    Moving runtime.BeginStopping into the post-preflight delete hook reopens a race: a runtime launch can start after the clean-worktree dirty check passes but before the stopping marker is set. That can create new uncommitted changes which are then deleted without triggering the dirty-worktree rejection.
    Fix: Hold the stopping marker before the dirty preflight as before, or re-run the dirty check after acquiring the marker and stopping runtime sessions before destructive cleanup.

Medium

  • internal/terminal/handler.go:113
    The pty-owner terminal branch returns before the existing claimTerminalSlot path, allowing multiple browser connections to attach to the same base workspace terminal concurrently. That regresses the single-active-terminal invariant and can interleave input into one shell.
    Fix: Claim and release the terminal slot around the pty-owner attach/bridge path as well, and add concurrency coverage for pty-owner terminals.

  • internal/ptyowner/paths.go:63, internal/ptyowner/client.go:409
    Pty-owner state files can be hijacked if the state directory already exists with unsafe permissions. writeState and acquireStartLock create directories with 0700, but they do not verify ownership, symlink status, or permissions when the root/session directory already exists. Client.connect then trusts owner.json and dials the recorded address, sending the stored token and terminal input.
    Fix: Reject symlinks, verify ownership is the current user, reject group/other-writable modes, only chmod directories already owned by the current user, verify owner.json ownership/mode before reading, and consider storing state under the config data dir rather than deriving it from the worktree parent.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 29, 2026

roborev: Combined Review (d7474ff)

PR has one medium-severity issue that should be fixed before merge.

Medium

  • internal/server/huma_routes.go:3240 - runtime.BeginStopping(input.ID) is acquired only in the post-preflight delete hook, after the dirty-worktree check has already passed. A local caller can race a runtime launch between the clean dirty check and this hook; that process can create new uncommitted worktree changes, and delete can continue without rechecking, defeating the non-force dirty-check protection.

    Remediation: acquire the runtime stopping marker before the dirty preflight and hold it through DB removal, or acquire it before destructive cleanup and repeat the dirty check while launches are blocked.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Runtime agent tmux sessions can outlive the server process, so they need the same durable ownership semantics as workspace tmux sessions. Tagging them at creation time lets startup cleanup reap crash-window sessions that were never persisted, while treating stored runtime cleanup failures as fatal keeps retryable stop/delete state from being erased.

The tmux wrapper now passes only a small safe environment through argv because inherited process environment values can contain credentials and are visible to local process inspection.

Verified with:
- go test ./internal/workspace/localruntime -run 'TestManagerLaunchCommandWrapsAgentsInTmuxWhenEnabled|TestManagerLaunchCommandMarksWrappedAgentTmuxSession|TestManagerLaunchCommandUsesSanitizedEnvForWrappedAgent' -shuffle=on\n- go test ./internal/workspace -run 'TestManagerCleanupTmuxSessionKillsRuntimeSessionsForWorkspace|TestManagerCleanupTmuxSessionPreservesStoredRowsAfterRuntimeKillFailure|TestManagerReapOrphanTmuxSessionsKeepsStoredRuntimeSessions|TestManagerReapOrphanTmuxSessionsKillsUnknownManagedSessions' -shuffle=on\n- go test ./internal/server -run 'TestWorkspaceRuntimeLaunchAgentCreatesProbeableTmuxSessionE2E|TestServerStartupReapsUnrecordedRuntimeTmuxSessionE2E|TestWorkspaceRuntimeStopTmuxCleanupFailureKeepsSessionE2E' -shuffle=on\n- go test ./internal/db ./internal/workspace/localruntime ./internal/workspace ./internal/server ./internal/config -shuffle=on\n- git diff --check\n\nGenerated with Codex\nCo-authored-by: Codex <codex@openai.com>
Runtime tmux sessions are only recoverable after a restart if they are tagged before launch continues. If tmux cannot create the session with the owner marker, or cannot verify the marker afterward, fail the wrapper and clean up the just-created session instead of attaching to something startup cleanup cannot trust.

Verified with:
- go test ./internal/workspace/localruntime -run 'TestManagerLaunchCommandMarksWrappedAgentTmuxSession|TestManagerLaunchCommandCleansUpWhenOwnerMarkingFails|TestManagerLaunchCommandUsesSanitizedEnvForWrappedAgent' -shuffle=on\n- go test ./internal/db ./internal/workspace/localruntime ./internal/workspace ./internal/server ./internal/config -shuffle=on\n- git diff --check\n\nGenerated with Codex\nCo-authored-by: Codex <codex@openai.com>
The owner-marker failure path is the risky case for restart cleanup: a runtime tmux session may have been created, but cannot be trusted unless marker failure also cleans up and leaves the stored runtime state recoverable. Cover that path directly in the launcher test and through the HTTP API with real SQLite so future changes cannot silently turn marker failures into leaked sessions.

Verified with:
- go test ./internal/workspace/localruntime -run 'TestManagerLaunchCommandCleansUpWhenOwnerMarkingFails|TestManagerLaunchCommandMarksWrappedAgentTmuxSession' -shuffle=on\n- go test ./internal/server -run 'TestWorkspaceRuntimeLaunchTmuxOwnerMarkerFailureCleansSessionE2E|TestWorkspaceRuntimeLaunchAgentCreatesProbeableTmuxSessionE2E' -shuffle=on\n- go test ./internal/db ./internal/workspace/localruntime ./internal/workspace ./internal/server ./internal/config -shuffle=on\n- git diff --check\n\nGenerated with Codex\nCo-authored-by: Codex <codex@openai.com>
The riskiest marker failure is the tmux command sequence embedded in new-session, because tmux may have created the runtime session before rejecting the owner marker. Make the regression tests fail that embedded marker command and assert cleanup still kills the created runtime session, both in the launcher and through the HTTP/SQLite path.

Verified with:
- go test ./internal/workspace/localruntime -run 'TestManagerLaunchCommandCleansUpWhenOwnerMarkingFails|TestManagerLaunchCommandMarksWrappedAgentTmuxSession' -shuffle=on\n- go test ./internal/server -run 'TestWorkspaceRuntimeLaunchTmuxOwnerMarkerFailureCleansSessionE2E|TestWorkspaceRuntimeLaunchAgentCreatesProbeableTmuxSessionE2E' -shuffle=on\n- go test ./internal/db ./internal/workspace/localruntime ./internal/workspace ./internal/server ./internal/config -shuffle=on\n- git diff --check\n\nGenerated with Codex\nCo-authored-by: Codex <codex@openai.com>
Middleman currently requires tmux before a workspace can become ready, which blocks durable workspaces on hosts such as Windows. This plan lays out a small pty-owner process, backend selection, attach flow, cleanup, and verification path before implementation starts.
mariusvniekerk and others added 18 commits April 29, 2026 17:14
Middleman previously required tmux before a workspace could become ready, which prevented durable workspaces on hosts where tmux is unavailable. This adds a middleman-owned PTY owner process, uses it as the workspace terminal fallback, and bridges the browser terminal to that owner while preserving tmux as the preferred backend.
Persist the terminal backend that created each workspace so attach, activity snapshots, and delete keep using pty-owner even if tmux becomes available later. Also make pty-owner stop tolerate stale owner state and terminate the process group/tree during cleanup.
Leave pre-existing workspace rows unclassified during the backend migration and infer pty-owner ownership from existing owner state on first use. Also make the owner command's Unix session setup explicit before start so process-tree cleanup targets the right group.
Guard pty closure so owner stop and RunOwner return cannot close the go-pty handle concurrently under the race detector.
Add a real pty-owner shutdown regression that stops an in-process owner while RunOwner is returning, covering the race-detector failure path from CI.
Address remaining roborev blockers for the durable pty-owner backend: repair current databases missing terminal_backend, avoid leaking token env into detached helpers, wait for owner state removal on stop, and continue cleanup of stored tmux runtime sessions for pty-owner workspaces.

Validation:
- go test ./internal/db ./internal/ptyowner ./internal/workspace -shuffle=on
- go test ./internal/server -run TestWorkspaceListTmuxActivityStressDoesNotLeakProcesses -shuffle=on
- go test ./... -shuffle=on
- go test ./internal/ptyowner -race -shuffle=on
- go test ./internal/server -race -run TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E -shuffle=on
- GOOS=windows GOARCH=amd64 go test -c -o /tmp/middleman-cmd-windows.test.exe ./cmd/middleman
- GOOS=windows GOARCH=amd64 go test -c -o /tmp/middleman-ptyowner-windows.test.exe ./internal/ptyowner

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
Address follow-up roborev lifecycle feedback: the durable pty-owner helper should not be tied to the request context after launch, and the e2e delete assertion now verifies workspace deletion cleans up owner state before any fallback cleanup runs.

Validation:
- go test ./internal/ptyowner ./internal/server -run 'TestOwnerHelperEnvironmentStripsCredentials|TestOwnerStopWhileRunOwnerReturns|TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E' -shuffle=on\n- go test ./internal/ptyowner -race -shuffle=on\n- go test ./internal/server -race -run TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E -shuffle=on\n- GOOS=windows GOARCH=amd64 go test -c -o /tmp/middleman-cmd-windows.test.exe ./cmd/middleman\n- GOOS=windows GOARCH=amd64 go test -c -o /tmp/middleman-ptyowner-windows.test.exe ./internal/ptyowner\n- go test ./... -shuffle=on\n\nGenerated with Codex\nCo-authored-by: Codex <codex@openai.com>
Address roborev feedback that owner state could be deleted on caller cancellation or transient connection errors. Only positively absent/refused owners are treated as stale, and the owner now bounds unauthenticated loopback connections with a connection cap and first-request deadline.

Validation:
- go test ./internal/ptyowner -shuffle=on
- go test ./internal/ptyowner -race -shuffle=on
- go test ./internal/server -race -run TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E -shuffle=on
- GOOS=windows GOARCH=amd64 go test -c -o /tmp/middleman-ptyowner-windows.test.exe ./internal/ptyowner
- go test ./... -shuffle=on

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
Address roborev feedback that owner RPCs could hang after TCP connect and that the e2e path still bypassed the detached helper. Client RPCs now apply bounded connection deadlines and the no-tmux e2e exercises the real subprocess launch path through a test helper.

Validation:
- go test ./internal/ptyowner ./internal/server -run 'TestClientPingHonorsContextAfterConnect|TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E' -shuffle=on\n- go test ./internal/ptyowner -race -shuffle=on\n- go test ./internal/server -race -run TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E -shuffle=on\n- GOOS=windows GOARCH=amd64 go test -c -o /tmp/middleman-cmd-windows.test.exe ./cmd/middleman\n- GOOS=windows GOARCH=amd64 go test -c -o /tmp/middleman-ptyowner-windows.test.exe ./internal/ptyowner\n- go test ./... -shuffle=on\n\nGenerated with Codex\nCo-authored-by: Codex <codex@openai.com>
The pty-owner helper must survive middleman exiting on Windows just like the Unix detached process path. Use Windows process creation flags for a detached process group, and tighten the restart e2e so it reattaches only after the original server has shut down.
After rebasing the PR onto main, several older tmux-runtime test copies were replayed next to newer upstream versions. Remove the duplicate copies so the workspace and localruntime test packages build while preserving the current upstream coverage and this PR's pty-owner tests.

Verified with:
- go test ./internal/workspace ./internal/workspace/localruntime ./internal/server -shuffle=on

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
Concurrent Ensure calls for the same session could all observe absent state and launch detached owners, leaving only the last state file manageable. Add a per-session atomic startup lock, recheck liveness while holding it, and cover the race with a concurrent Ensure regression test.

Verified with:
- go test ./internal/ptyowner -run TestClientEnsureSerializesConcurrentStarts -shuffle=on
- go test ./internal/ptyowner -shuffle=on
- go test ./internal/server -run TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E -shuffle=on

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
Startup tmux pruning now skips workspaces that resolve to the pty-owner backend, including legacy rows with existing owner state. This prevents a restart with tmux available from marking a durable pty-owner workspace errored before it can reattach.

Verified with:

- go test ./internal/workspace -run TestManagerPruneMissingTmuxSessionsRemovesStaleRecords -shuffle=on

- go test ./internal/server -run 'TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E|TestWorkspaceRuntimeSessionTerminalWebSocketBasePathE2E|TestWorkspaceRuntimeSessionTerminalSkipsAltScreenReplayE2E' -shuffle=on

- go test ./... -shuffle=on
Limit the unauthenticated owner control frame before token validation and cap later attach request frames and terminal input chunks. This prevents oversized local JSON payloads from forcing unbounded allocation in the durable owner.

Verified with:

- go test ./internal/ptyowner -run 'TestOwnerRejectsOversizedUnauthenticatedRequest|TestOwnerAttachInputAndReplay' -shuffle=on

- go test ./internal/ptyowner -shuffle=on
Renumber the terminal backend migration after main added repo overview migrations, update the merged runtime delete guard to preserve dirty-delete behavior, and align workspace tests with the runtime tmux session timestamp contract.
The rebase exposed test-only fallout: a merged require.New(t) assertion kept the old package-level signature, and the server runtime helper could trip Go deadlock detection after launch because its blocking modes used select {} in the only active goroutine. That made dirty-delete runtime coverage race the helper process exit instead of the delete path under test.

The OpenAPI spec update is the hook-generated output from the rebased Huma schema.

Validation:
- go test ./internal/db -run TestOpenAndSchema -shuffle=on
- go test ./internal/server -run TestWorkspaceDeleteDirtyKeepsRuntimeSessionsE2E -shuffle=on
- go test ./cmd/middleman ./internal/db ./internal/ptyowner ./internal/terminal ./internal/workspace ./internal/server -shuffle=on

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
The rebased settings API type now requires agents on Settings, and one repo import modal mock still returned the older partial shape. That made the frontend typecheck fail in CI even though the test behavior itself was unchanged.

Validation:
- bunx vitest run src/lib/components/settings/RepoImportModal.test.ts
- bun run typecheck
- go test ./internal/db -run TestOpenAndSchema -shuffle=on
- go test ./internal/server -run TestWorkspaceDeleteDirtyKeepsRuntimeSessionsE2E -shuffle=on

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
Hold the runtime stopping marker before the non-force dirty check so a concurrent launch cannot create fresh worktree changes after the preflight passes. Existing runtime sessions are still stopped only after a clean preflight, preserving dirty-delete rejection behavior.
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 29, 2026

roborev: Combined Review (ddabdaf)

No Medium, High, or Critical issues found.

All review agents reported the changes as clean.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Keep the runtime stopping guard change while removing the explanatory block and avoiding a second test agent target. The dirty-delete E2E still verifies sessions survive a 409 and launches are unblocked afterward.
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 29, 2026

roborev: Combined Review (6fc3204)

Summary verdict: One medium-severity regression should be addressed before merge.

Medium

  • internal/terminal/handler.go:113
    The pty-owner terminal path returns before the existing claimTerminalSlot flow, allowing multiple WebSocket clients to attach to the same workspace terminal concurrently. This regresses the single-active-terminal invariant and can interleave input or resize events against one owner session.
    Fix: Claim and defer-release the terminal slot before branching on tmux vs pty-owner, or add equivalent slot handling inside the pty-owner branch.

  • internal/ptyowner/client.go:118
    _ = cmd.Process.Release() detaches the Go process handle but does not reap the child if the pty-owner process exits while the parent server is still running, leaving a zombie process.
    Fix: Replace it with go cmd.Wait() so the child process is reaped when it exits.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Reject a second active workspace terminal connection so pty-owner workspaces keep the same single-client invariant as the tmux attach flow. Also wait on detached pty-owner helpers from a background goroutine so exited helpers are reaped while the server is still running.

Verified with:
- go test ./internal/terminal -shuffle=on
- go test ./internal/ptyowner -shuffle=on
- go test ./internal/server -run TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E -shuffle=on
- go test ./internal/terminal ./internal/ptyowner ./internal/server -run 'TestHandlerRejectsConcurrentWorkspaceTerminals|TestHandlerAttachesPtyOwnerTerminal|TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E|TestClientEnsureSerializesConcurrentStarts' -shuffle=on
- git diff --check

🤖 Generated with [OpenAI Codex](https://openai.com/codex)
Co-authored-by: OpenAI Codex <noreply@openai.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 30, 2026

roborev: Combined Review (e538ae3)

Medium issues remain: final terminal output may be dropped on shell exit, and the new WebSocket attachment behavior lacks e2e coverage.

Medium

  • internal/ptyowner/client.go:276
    The attachment reader closes Done before Output, while bridgePtyOwnerAttachment returns as soon as Done is ready. Output already queued before the exit frame can be skipped, which can lose final terminal output when the owner shell exits.
    Fix: Close or drain Output before signaling Done, or have the terminal bridge drain attachment.Output until closed before writing the exit message and closing the WebSocket.

  • internal/terminal/handler.go:390
    The new single-attachment behavior is only covered at the slot helper/unit level. This is a user-visible WebSocket workflow change, and the repository requires e2e coverage for maintainer workflow regressions.
    Fix: Add a full-stack WebSocket e2e test that opens one workspace terminal, verifies a second connection is rejected, then verifies a new connection succeeds after the first disconnects.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Ensure pty-owner attachments close their output stream before signaling done, then wait for the WebSocket bridge output writer to drain before sending the terminal exit frame. Add full-stack e2e coverage for single terminal attachment behavior and final pty-owner output on shell exit.

Verified with:
- go test ./internal/server -run 'TestWorkspacePtyOwnerTerminalRejectsConcurrentAttachmentsE2E|TestWorkspacePtyOwnerTerminalFlushesFinalOutputOnExitE2E|TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E' -shuffle=on
- go test ./internal/terminal ./internal/ptyowner -shuffle=on
- git diff --check

🤖 Generated with [OpenAI Codex](https://openai.com/codex)
Co-authored-by: OpenAI Codex <noreply@openai.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 30, 2026

roborev: Combined Review (a66ab50)

Medium-risk issue found: the PTY attach path can lose final buffered output on process exit.

Medium

  • internal/ptyowner/owner.go:203 - The attach writer may drop buffered final output when the PTY exits because select can choose the <-o.done branch before draining already queued output chunks. Drain the subscriber output channel to closure before sending the exit frame, or otherwise prioritize pending output over the done case.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Make the pty-owner terminal concurrency e2e wait for the server to observe the first WebSocket close before asserting a later attachment succeeds. The production slot release is intentionally tied to the request handler cleanup path, so the test needs to account for that asynchronous close.

Verified with:
- go test ./internal/server -run 'TestWorkspacePtyOwnerTerminalRejectsConcurrentAttachmentsE2E|TestWorkspacePtyOwnerTerminalFlushesFinalOutputOnExitE2E|TestWorkspaceCreatesPtyOwnerSessionWhenTmuxUnavailableE2E' -shuffle=on
- go test ./internal/terminal ./internal/ptyowner -shuffle=on
- git diff --check

🤖 Generated with [OpenAI Codex](https://openai.com/codex)
Co-authored-by: OpenAI Codex <noreply@openai.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 30, 2026

roborev: Combined Review (86e3264)

Verdict: One medium-severity race can drop final PTY output/exit frames; no high or critical findings reported.

Medium

  • internal/ptyowner/owner.go:329
    • wait() / RunOwner can close subscriber channels or exit the helper as soon as the child process exits, while drainOutput() and active connection handlers may still be flushing buffered PTY output. This can drop final terminal bytes and/or the ResponseExit control frame for attached clients.
    • Fix: synchronize process exit with PTY drain completion. Wait for drainOutput() to reach EOF/error before closing subscribers or returning from RunOwner, optionally with a short timeout to avoid hanging on detached background children. Also give active handleConn routines a brief grace period to flush pending writes.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Wait briefly for the PTY reader to drain after the owned process exits before closing subscribers and completing the owner. The attach writer now drains subscriber output to channel closure before sending the exit response, preventing the done path from racing ahead of final terminal bytes.

Verified with:
- go test ./internal/ptyowner -shuffle=on
- go test ./internal/server -run 'TestWorkspacePtyOwnerTerminalFlushesFinalOutputOnExitE2E|TestWorkspacePtyOwnerTerminalRejectsConcurrentAttachmentsE2E' -shuffle=on
- go test ./internal/terminal -shuffle=on
- git diff --check

🤖 Generated with [OpenAI Codex](https://openai.com/codex)
Co-authored-by: OpenAI Codex <noreply@openai.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 30, 2026

roborev: Combined Review (e9ebef6)

No medium-or-higher severity findings were reported.

All agents found the reviewed changes clean.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant