feat: Sessions - bidirectional durable agent streams by ericallam · Pull Request #3417 · triggerdotdev/trigger.dev

ericallam · 2026-04-20T14:18:13Z

⚠️ Not released yet. This PR is the server-side foundation only. The SDK changes that customers will actually use (chat.agent migration, chat.createStartSessionAction, useTriggerChatTransport updates) live on a separate branch and ship together in an upcoming @trigger.dev/sdk prerelease. Until that prerelease is published, this surface is reachable only via direct HTTP.

What this gives Trigger.dev users

A new first-class primitive, Session, for durable, task-bound, bidirectional I/O that outlives any single run. Sessions are the run manager for chat.agent going forward, and they unblock anything else that needs "one identifier, many runs over time" with a stable channel pair the client can write to and subscribe to.

Use cases unblocked

Chat agents that persist across many runs. One session per chat (keyed on your own chatId via externalId), turns 1..N attach to the same Session, the UI subscribes once and keeps receiving output as new runs take over.
Approval loops and long-running tasks with user feedback. The task waits on .in, the client writes to .in, the server enforces no-writes-after-close.
Workflow progress streams that live past the run. Subscribe to .out after the task finishes to replay history.
Resume-next-day flows. A session is a durable row, not a transient stream. Send a message a day later and the server triggers a fresh run on the same session.

How it works (Session-as-run-manager)

A Session row is task-bound (taskIdentifier + triggerConfig are required) and owns its current run via currentRunId + currentRunVersion for optimistic claim. Three trigger paths:

Session create — POST /api/v1/sessions creates the row and triggers the first run synchronously.
Append-time probe — POST /realtime/v1/sessions/:session/in/append checks if the current run is alive; if it has terminated (idle exit, crash, etc.), the server triggers a new run before processing the append.
End-and-continue handoff — POST /api/v1/sessions/:session/end-and-continue, called by the running agent, triggers a fresh run and atomically swaps currentRunId. Used by chat.requestUpgrade() for version handoffs.

Every triggered run is recorded in the SessionRun audit table with a reason (initial, continuation, upgrade, manual).

Public API surface

Control plane

POST /api/v1/sessions — create. Idempotent on (env, externalId). Triggers the first run, returns the session and a session-scoped public access token. Returns 409 if the upserted row is already closed.
GET /api/v1/sessions/:session — retrieve by friendlyId (session_abc...) or by your own externalId (server disambiguates by prefix).
GET /api/v1/sessions — list with filters (type, tag, taskIdentifier, externalId, derived status ACTIVE/CLOSED/EXPIRED, created-at range) and cursor pagination. Backed by ClickHouse.
PATCH /api/v1/sessions/:session — update tags / metadata / externalId.
POST /api/v1/sessions/:session/close — terminate. Idempotent, hard-blocks new server-brokered writes.
POST /api/v1/sessions/:session/end-and-continue — agent-only handoff to a fresh run.

Realtime

PUT /realtime/v1/sessions/:session/:io — initialize a channel. Returns S2 credentials in headers so high-throughput clients can write direct to S2.
GET /realtime/v1/sessions/:session/:io — SSE subscribe. Supports Last-Event-ID resume and an opt-in X-Peek-Settled: 1 header that fast-closes the stream when the upstream is already settled (trigger:turn-complete), eliminating long-poll wait on reconnect-on-reload paths.
POST /realtime/v1/sessions/:session/:io/append — server-side appends.
POST /api/v1/runs/:runFriendlyId/session-streams/wait — runs wait on a session stream as a waitpoint, with a race-check to avoid suspending if data already landed.

Auth scopes

sessions is a new resource type. read:sessions:{id}, write:sessions:{id}, admin:sessions:{id} flow through the existing JWT validator. Session-scoped public access tokens minted by the server replace browser-held trigger-task tokens for chat-style flows — the browser never sees a run identifier or a run-scoped token in steady state.

What's coming after this PR

SDK + chat.agent migration: separate branch, separate PR, ships in the next @trigger.dev/sdk prerelease alongside this server deploy. Customers using the prerelease chat.agent will follow the upgrade guide.
Dashboard surfaces: dedicated agent list, agent playground, agent view on the run dashboard. Tracking separately.

Implementation notes

Postgres Session table: scalar scoping columns (projectId, runtimeEnvironmentId, environmentType, organizationId) without FKs, matching the January TaskRun FK-removal decision. Point-lookup indexes only — list queries go to ClickHouse. Terminal markers (closedAt, expiresAt) are write-once.
ClickHouse sessions_v1: ReplacingMergeTree, partitioned by month, ordered by (org_id, project_id, environment_id, created_at, session_id). Tags indexed via tokenbf_v1 skip index.
SessionsReplicationService: mirrors RunsReplicationService exactly — leader-locked logical replication consumer, ConcurrentFlushScheduler, retry with exponential backoff + jitter, identical metric shape. Dedicated slot + publication so the two consume independently.
S2 keys: sessions/{addressingKey}/{out|in}. The existing runs/{runId}/{streamId} key format for run-scoped streams is untouched.
Optimistic claim: ensureRunForSession triggers a run upfront (cheap to cancel if it loses the race), then attempts an updateMany keyed on currentRunVersion. Loser cancels its triggered run and reuses the winner's. No DB lock held across the trigger.

What did NOT change

Run-scoped streams.pipe / streams.input and the existing /realtime/v1/streams/{runId}/... routes are unchanged. Sessions are net-new — not a reshaping of the current streams API.

Deploy notes

Set SESSION_REPLICATION_CLICKHOUSE_URL and SESSION_REPLICATION_ENABLED=1 to enable the replication consumer.
The Session table needs REPLICA IDENTITY FULL set on the prod source DB before the publication is created (same one-time DDL we did for TaskRun). Required for delete events to carry full column values.
Cross-form authorization on the GET /api/v1/sessions/:session loader (a JWT minted for either form authorizes both URL forms). Action routes are URL-form-specific, matching how the SDK mints PATs.

Verification

Webapp typecheck clean (10/10).
apps/webapp/test/sessionsReplicationService.test.ts — round-trip tests for insert/update/delete through Postgres logical replication into ClickHouse via testcontainers.
Live end-to-end against local dev: create + retrieve (both forms) + update + close, .out.initialize + .out.append x2 + .in.send + .out.subscribe over SSE, list with all filter combinations + pagination, end-and-continue swap, X-Peek-Settled fast-close (verified in browser via reconnect-on-reload and via curl). Replicated row lands in ClickHouse within ~1s.
Multi-round Devin + CodeRabbit review feedback addressed (read-after-write paths use prisma writer, info-leak on auth-routes masked as 403, peek-settled discriminator parsing fix, etc.).

Test plan

pnpm run typecheck --filter webapp
pnpm run test --filter webapp ./test/sessionsReplicationService.test.ts --run
Start the webapp with SESSION_REPLICATION_CLICKHOUSE_URL and SESSION_REPLICATION_ENABLED=1. Confirm the slot and publication auto-create on boot.
POST /api/v1/sessions and verify the row replicates to trigger_dev.sessions_v1 within a couple of seconds.
POST /api/v1/sessions/:id/close, then confirm POST /realtime/v1/sessions/:id/out/append returns 400.
Reuse a closed session's externalId on POST /api/v1/sessions and confirm 409.
GET /realtime/v1/sessions/:id/out with X-Peek-Settled: 1 after a turn completes and confirm X-Session-Settled: true response header + immediate close.

changeset-bot · 2026-04-20T14:18:22Z

🦋 Changeset detected

Latest commit: 188fa43

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 29 packages

Name	Type
@trigger.dev/core	Patch
@trigger.dev/build	Patch
trigger.dev	Patch
@trigger.dev/python	Patch
@trigger.dev/redis-worker	Patch
@trigger.dev/schema-to-json	Patch
@trigger.dev/sdk	Patch
@internal/cache	Patch
@internal/clickhouse	Patch
@internal/llm-model-catalog	Patch
@internal/redis	Patch
@internal/replication	Patch
@internal/run-engine	Patch
@internal/schedule-engine	Patch
@internal/testcontainers	Patch
@internal/tracing	Patch
@internal/tsql	Patch
@internal/zod-worker	Patch
d3-chat	Patch
references-d3-openai-agents	Patch
references-nextjs-realtime	Patch
references-realtime-hooks-test	Patch
references-realtime-streams	Patch
references-telemetry	Patch
@internal/sdk-compat-tests	Patch
@trigger.dev/react-hooks	Patch
@trigger.dev/rsc	Patch
@trigger.dev/database	Patch
@trigger.dev/otlp-importer	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

coderabbitai · 2026-04-20T14:18:33Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Introduces a durable Session primitive end-to-end: a new Prisma Session model and migration, a ClickHouse sessions_v1 table and query/insert helpers, ClickHouse-backed SessionsRepository, a SessionsReplicationService that streams Postgres logical replication into ClickHouse (with retry/ack/flush/leader-lock logic), session-friendly ID export (SessionId) and API Zod schemas, multiple REST and realtime routes for session CRUD, streaming and append, session-stream waitpoint support with Redis-backed pending sets, environment config and startup wiring, helper utilities, and end-to-end replication tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: Sessions - bidirectional durable agent streams' clearly summarizes the main change, specifying the new Sessions feature with its core capability of bidirectional streaming.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request provides a comprehensive description covering objectives, use cases, public API surface, implementation notes, and verification steps.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/tri-8627-session-primitive-server-side-schema-routes-clickhouse

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Durable, typed, bidirectional I/O primitive that outlives a single run. Ship target is agent/chat use cases; run-scoped streams.pipe/streams.input are untouched and do not create Session rows. Postgres - New Session table: id, friendlyId, externalId, type (plain string), denormalised project/environment/organization scalar columns (no FKs), taskIdentifier, tags String[], metadata Json, closedAt, closedReason, expiresAt, timestamps - Point-lookup indexes only (friendlyId unique, (env, externalId) unique, expiresAt). List queries are served from ClickHouse so Postgres stays minimal and insert-heavy. Control-plane API - POST /api/v1/sessions create (idempotent via externalId) - GET /api/v1/sessions list with filters (type, tag, taskIdentifier, externalId, status ACTIVE|CLOSED|EXPIRED, period/from/to) and cursor pagination, ClickHouse-backed - GET /api/v1/sessions/:session retrieve — polymorphic: `session_` prefix hits friendlyId, otherwise externalId - PATCH /api/v1/sessions/:session update tags/metadata/externalId - POST /api/v1/sessions/:session/close terminal close (idempotent) Realtime (S2-backed) - PUT /realtime/v1/sessions/:session/:io returns S2 creds - GET /realtime/v1/sessions/:session/:io SSE subscribe - POST /realtime/v1/sessions/:session/:io/append server-side append - S2 key format: sessions/{friendlyId}/{out|in} Auth - sessions added to ResourceTypes. read:sessions:{id}, write:sessions:{id}, admin:sessions:{id} scopes work via existing JWT validation. ClickHouse - sessions_v1 ReplacingMergeTree table - SessionsReplicationService mirrors RunsReplicationService exactly: logical replication with leader-locked consumer, ConcurrentFlushScheduler, retry with exponential backoff + jitter, identical metric shape. Dedicated slot + publication (sessions_to_clickhouse_v1[_publication]). - SessionsRepository + ClickHouseSessionsRepository expose list, count, tags with cursor pagination keyed by (created_at DESC, session_id DESC). - Derived status (ACTIVE/CLOSED/EXPIRED) computed from closed_at + expires_at; in-memory fallback on list results to catch pre-replication writes. Verification - Webapp typecheck 10/10 - Core + SDK build 3/3 - sessionsReplicationService.test.ts integration tests 2/2 (insert + update round-trip via testcontainers) - Live round-trip against local dev: create -> retrieve (friendlyId and externalId) -> out.initialize -> out.append x2 -> in.send -> out.subscribe (receives records) -> close -> ClickHouse sessions_v1 shows the replicated row with closed_reason - Live list smoke: tag, type, status CLOSED, externalId, and cursor pagination

…te/update The session_ prefix identifies internal friendlyIds. Allowing it in a user-supplied externalId would misroute subsequent GET/PATCH/close requests through resolveSessionByIdOrExternalId to a friendlyId lookup, returning null or the wrong session. Reject at the schema boundary so both routes surface a clean 422.

Without allowJWT/corsStrategy, frontend clients holding public access tokens hit 401 on GET /api/v1/sessions and browser preflights fail. Matches the single-session GET/PATCH/close routes and the runs list endpoint.

- Derive isCached from the upsert result (id mismatch = pre-existing row) instead of doing a separate findFirst first. The pre-check was racy — two concurrent first-time POSTs could both return 201 with isCached: false. Using the returned row's id is atomic and saves a round-trip. - Scope the list endpoint's authorization to the standard action/resource pattern (matches api.v1.runs.ts): task-scoped JWTs can list sessions filtered by their task, and broader super-scopes (read:sessions, read:all, admin) authorize unfiltered listing. - Log and swallow unexpected errors on POST rather than returning the raw error.message. Prisma/internal messages can leak column names and query fragments.

Give Session channels run-engine waitpoint semantics so a task can suspend while idle on a session channel and resume when an external client sends a record — parallel to what streams.input offers run-scoped streams. Webapp - POST /api/v1/runs/:runFriendlyId/session-streams/wait — creates a manual waitpoint attached to {sessionId, io} and race-checks the S2 stream starting at lastSeqNum so pre-arrived data fires it immediately. Mirrors the existing input-stream waitpoint route. - sessionStreamWaitpointCache.server.ts — Redis set keyed on {sessionFriendlyId, io}, drained atomically on each append so concurrent multi-tab waiters all wake together. - realtime.v1.sessions.$session.$io.append now drains pending waitpoints after every record lands and completes each with the appended body. - S2RealtimeStreams.readSessionStreamRecords — session-channel parallel of readRecords, feeds the race-check path. Core - CreateSessionStreamWaitpoint request/response schemas alongside the existing Session CRUD schemas. Server API contract only — the client ApiClient + SDK wrapper ship on the AI-chat branch.

Two fixes needed by browser clients hitting the public session API (TriggerChatTransport's direct accessToken path, WebSocket-less session drivers, anything origin'd off the dashboard): - POST /api/v1/sessions: allowJWT: true + corsStrategy: "all" on the action. Pre-fix, the create endpoint only accepted secret-key auth, so any browser-originated sessions.create(...) 401'd. The loader (list) already had these; matches that shape. - POST /realtime/v1/sessions/:session/:io/append: export both { action, loader } so Remix routes the OPTIONS preflight to the route builder's CORS handler. With only { action } exported, the preflight returns 400 'No loader for route' and Chrome surfaces the follow-up POST as net::ERR_FAILED. Same pattern as /api/v1/tasks/:id/trigger (which already exports both). Validated by an end-to-end UI smoke on references/ai-chat: new chat → send → streamed assistant reply in ~4s → second turn reuses the same session + run, lastEventId advances 10 → 21.

Nine fixes from CodeRabbit + Devin review: - api.v1.sessions.$session.close.ts: - Export { action, loader } so CORS preflight reaches the route builder's OPTIONS handler. Same fix already applied to the append route — Devin caught that I'd missed this one. Without the loader, browser clients hitting POST /close fail preflight. - Switch to `prisma.session.updateMany({ where: { id, closedAt: null }, ... })` so concurrent closes can't overwrite the original `closedAt` / `closedReason`. Loser hits count === 0 and re-reads the winning row — closedness is write-once at the DB level. (CodeRabbit: TOCTOU.) - entry.server.tsx: Wrap the async `sessionsReplicationInstance.shutdown` in a sync handler with `.catch(...)`. SIGTERM/SIGINT fire during process teardown and a rejection from `_replicationClient.stop()` would become an unhandled promise rejection. Matches the pattern in `dynamicFlushScheduler.server.ts`. (CodeRabbit: unhandled rejection risk.) - api.v1.runs.$runFriendlyId.session-streams.wait.ts: - Swallowed race-check catch now logs `warn` with sessionFriendlyId / io / waitpointId / error. Silent failures in the S2-read / engine-complete / cache-remove path were indistinguishable from the expected cache-drain-on-append fast path. - Outer 500 path no longer forwards `error.message` (Prisma / engine / S2 internals could leak). Logs server-side and returns a generic "Something went wrong"; 422 ServiceValidationError path unchanged. (CodeRabbit: info-leak + logging gap.) - realtime.v1.sessions.$session.$io.ts: Add `method: "PUT"` to the route config so the route builder enforces method validation before the handler runs. Removed the now-redundant `request.method !== "PUT"` check inside the handler. (CodeRabbit: defense-in-depth.) - services/sessionsRepository/sessionsRepository.server.ts: `ISessionsRepository` is now a `type` alias, per repo coding guideline ("use types over interfaces"). Structural-typing means implementing classes don't need source changes. (CodeRabbit.) - services/sessionStreamWaitpointCache.server.ts: Replace separate SADD + PEXPIRE with a single atomic Lua script. Solves two distinct concerns at once: 1. Partial-failure window (CodeRabbit): if SADD succeeded and PEXPIRE failed, the key would persist with no TTL. The Lua script fails both or succeeds both. 2. TTL-race (Devin, twice): each waitpoint registers with its own `ttlMs` derived from the caller's timeout. The old code called PEXPIRE unconditionally, so a short-TTL registration would shrink the shared key's TTL below a longer-TTL sibling — evicting the sibling from Redis and degrading the append-path fast drain to engine-timeout-only. The script only PEXPIREs if the new TTL is greater than the current PTTL (or the key has no TTL yet), so the key lives as long as the longest-TTL member. Outstanding: one unresolved thread asking to rename `CloseSessionRequestBody.reason` → `closedReason` for symmetry with the DB column. Holding that for an API-taste call — will follow up. Validated: `pnpm run typecheck --filter webapp` clean.

Devin catch on #3417 — the ClickHouse sessions list was slicing `sessionIds.slice(1, size + 1)` on the backward path, which skipped the item closest to the cursor and surfaced the sentinel (the `size+1`th item that proves hasMore=true) to the user. Trace, with items c01…c11 and cursor=c07 (page size 3): - Backward query: `session_id > c07 ORDER BY ASC LIMIT 4` → `[c08, c09, c10, c11]`. Legitimate content is the first three (`[c08, c09, c10]`); `c11` is the sentinel. - Previous slice: `[c09, c10, c11]` → displayed DESC `[c11, c10, c09]` — user never sees c08, sees sentinel c11 instead. Fix: collapse both directions to `sessionIds.slice(0, size)`. The sentinel is always the last item regardless of direction, so the two branches had no reason to diverge. Cursor computations (`previousCursor = reversed.at(1)`, `nextCursor = reversed.at(size)`) already line up with the corrected slice — no change needed there. Verified: webapp typecheck clean.

/realtime/v1/sessions/:session/:io=out now peeks the tail record in S2 at connection time. When the tail chunk is trigger:turn-complete, the agent has finished a turn and is either idle-waiting on .in or has exited — either way no more chunks will arrive without further user action. In that case the downstream S2 read switches to wait=0 so the SSE drains and closes in ~1s instead of long-polling for 60s, and the response carries X-Session-Settled: true so the client can tell the close is terminal rather than a normal 60s cycle. Mid-turn tails (streaming UIMessageChunks in flight) fall through to the existing wait=60 long-poll. Crashed-mid-turn is indistinguishable from live-streaming at this point and gets the same 60s retry loop as today — that's a separate hardening, not in scope here. The peek uses GET /records?tail_offset=1&count=1&wait=0 (single-digit ms on S2), then unwraps the agent-side envelope written by StreamsWriterV2: record.body parses to {data: <chunk>, id}, where <chunk> is the raw UIMessageChunk object. No double-parse on data. 404 / 416 from the peek (stream never written / empty stream) short- circuit to settled=false so first-connect on a freshly-created session keeps the long-poll semantics the agent's first chunks depend on. Verified end-to-end against an idle chat-agent-smoke session: caught- up reconnect (Last-Event-ID = tail) closes in 1.08s with the header; behind reconnect (Last-Event-ID < tail) drains remaining records then closes in 0.94s with the header; empty-stream reconnect keeps the 60s long-poll behavior unchanged.

Session is now the run manager for chat.agent and any future task-bound session. Atomically creates the row + triggers the first run + tracks the current run via optimistic claim, with a SessionRun audit log for provenance. Schema: - Session gains `taskIdentifier`, `triggerConfig` (JSON), `currentRunId` (non-FK), `currentRunVersion` (monotonic int for optimistic claim). - New SessionRun audit table — one row per run a session triggers, with `reason: "initial" | "continuation" | "upgrade" | "manual"`. Lifecycle: - `POST /api/v1/sessions`: idempotent on `(env, externalId)`, refreshes triggerConfig on cache hit, runs `ensureRunForSession` (probe + optimistic claim), returns a session-scoped PAT. JWT auth path dropped — secret-key only. The customer's server is the only entry point for session creation. - `POST /api/v1/sessions/:s/end-and-continue`: server-orchestrated handoff (cancels current run, triggers a fresh one, swaps currentRunId via `updateMany where currentRunVersion`). Powers `chat.requestUpgrade()` from inside the agent runtime. - `POST /realtime/v1/sessions/:s/:io/append`: probe + ensureRunForSession before append so messages arriving while no run is alive boot one transparently. Cross-form addressing on write paths: - `createActionApiRoute` now runs `findResource` before `authorization`, matching `createLoaderApiRoute`. Action routes get an optional `resource` argument on `authorization.resource()` — backwards-compatible (existing 4-arg callbacks unchanged). - Append + end-and-continue use the new ordering to authorize against `{paramSession, friendlyId, externalId}` so a JWT minted for either form authorizes either URL form. Helpers: - `mintSessionToken.server.ts`: server-side session-PAT factory (`read:sessions:{key} + write:sessions:{key}`, 1h TTL). - `sessionRunManager.server.ts`: `ensureRunForSession` (probe + claim) and `swapSessionRun` (force handoff with optimistic claim + cancel-on-loss). Pre-mutation existence reads switched to `$replica` (close, end-and- continue, PATCH).

Three fixes after pushing the Sessions-as-run-manager commit: - `api.v1.sessions.$session.end-and-continue.ts` was destructuring only `{ action }` from `createActionApiRoute`, which means Remix had no handler for OPTIONS preflight on this route. Browser CORS would 405. Sibling routes (`close.ts`) already export `{ action, loader }`. Fix: destructure and export both. - `ensureRunForSession`'s pathological "lost the claim race AND the winner's run was already terminal" branch recursed without bound. In practice progress through the run engine bounds it, but a misconfigured task that crashes before being dequeued could blow the stack. Add a hidden `_attempt` counter, throw `SessionRunManagerError` once it exceeds 3. - `sessionsReplicationService.test.ts` was failing in CI because the sessions-as-run-manager schema migration made `taskIdentifier` and `triggerConfig` required on `Session`. The two `prisma.session.create` calls in the test predate the migration. Add the now-required fields to both fixtures.

Two fixes from Devin review on the sessions-as-run-manager commit: - `SessionItem.currentRunId`'s contract is the `run_*` friendlyId, but `serializeSession` returns the raw Prisma cuid. The `POST /sessions` create path overrides correctly via a TaskRun lookup, but GET, PATCH, and the three return paths in close.ts were passing the cuid through. A consumer using `currentRunId` from those endpoints in a downstream `GET /api/v1/runs/:runId` call would 404. Add a `serializeSessionWithFriendlyRunId` helper next to `serializeSession` that resolves via `$replica.taskRun.findFirst` (TaskRun friendlyIds are immutable, so replica lag is harmless), and switch the five affected return sites to use it. List endpoints stay on `serializeSession` to avoid N+1 lookups when paginating. The create endpoint keeps its existing manual lookup because it also needs the friendlyId for the response's `runId` field, and `session.currentRunId` is stale relative to the post-`ensureRunForSession` claim outcome. - Drop dead `lastChunkType` recomputation in `streamResponseFromSessionStream`. The variable was bound but never used; the conditional below it re-evaluated the same expression. Use the bound value in the condition.

Collapse `session-out-settled-signal.md` and `sessions-public-api-cors.md` into the single `session-primitive.md`, and rewrite that one to a high- level two-sentence summary that covers everything actually shipping in this PR (sessions-as-run-manager, end-and-continue, waitpoints, etc.). The CORS/JWT-on-create story is also out of date now that POST /api/v1/sessions is secret-key only.

…friendlyId Switch the two read-after-write taskRun lookups (POST /api/v1/sessions and POST /api/v1/sessions/:s/end-and-continue) from $replica back to prisma. Both reads happen immediately after triggering a run on the writer; replica lag would null the result and turn a successful create into a 500, or fall back to leaking the internal cuid in the end-and-continue response.

…n sessionRunManager The lost-race re-read in ensureRunForSession and swapSessionRun reads the Session row that the winner just wrote on the writer. Reading from $replica could return pre-race state and either (1) cause ensureRunForSession to recurse with a stale currentRunVersion, fail the next claim, and waste runs until max-attempts; or (2) cause swapSessionRun to return swapped: false with the calling run's own id, misleading the caller into thinking it is still authoritative.

The S2 record envelope wraps the agent-written chunk as {data: <chunkAsString>, id: partId} because StreamsWriterV2 hands appendPart an already-stringified chunk. The peek-settled check treated envelope.data as an object, so typeof === 'object' always returned false and the trigger:turn-complete sentinel was never matched. Reconnect-on-reload silently degraded to the full long-poll path. Parse envelope.data once more so the type discriminator is surfaced.

… run lookup Same read-after-write pattern as the other lost-race re-reads: the run was just triggered on the writer milliseconds before, so a $replica.findFirst can return null due to replication lag. The null silently no-ops the cancellation and leaks an orphan run that no session will ever claim.

When the upsert path returns a previously-closed row, return 409 before ensureRunForSession fires. Otherwise we'd trigger a fresh run on a closed session that can't receive .in input (append handler rejects writes to closed sessions), wasting compute on a run that exits the moment it tries to read. close is one-way; callers must use a different externalId to start a new session.

The race-check in api.v1.runs.$runFriendlyId.session-streams.wait was selecting the realtime stream instance via run.realtimeStreamsVersion, but session streams are always v2 (S2) — the writer (appendPartToSessionStream) and the SSE subscribe both hardcode v2. For a v1 run the race-check silently fell back to a non-S2 instance, the instanceof check missed, and the optimization was skipped. Hardcode v2 for parity with the rest of the session surface.

…ized routes createActionApiRoute now runs findResource before authorization so the auth scope check can expand to alternate identifiers of the resolved resource (Sessions are addressable by both friendlyId and externalId). Side-effect: an authenticated-but-underscoped caller could probe resource existence by observing 404 vs 403. Mask the 404 as 403 with the same response shape as the auth-failed branch when the route declares authorization, so the two cases are indistinguishable to callers without scopes. Routes without authorization keep returning 404.

Previous fix unconditionally returned 403 when findResource was null on a route with authorization, breaking PRIVATE-key callers (e.g. server SDK) hitting the existing api.v2.runs.cancel route — they always pass authorization but the new code returned 403 with a factually wrong message ('Unauthorized: missing required scopes') even though they had full permissions. New ordering: run authorization first (with the resolved resource as the 5th arg, so cross-form session auth still works), then check resource-null → 404. This gives: - PRIVATE key + missing resource: auth passes → 404 (correct) - Underscoped JWT + missing resource: auth fails (resource not in scope) → 403 (no info leak vs existing resource) - Underscoped JWT + existing resource: auth fails → 403 (unchanged) Only auth callbacks that destructure the resource (loader for realtime.v1.sessions.$session.$io) need to handle null — they all already do, since findResource was already nullable in pre-PR loaders.