Skip to content

feat: Sessions - bidirectional durable agent streams#3417

Merged
ericallam merged 23 commits intomainfrom
feature/tri-8627-session-primitive-server-side-schema-routes-clickhouse
Apr 28, 2026
Merged

feat: Sessions - bidirectional durable agent streams#3417
ericallam merged 23 commits intomainfrom
feature/tri-8627-session-primitive-server-side-schema-routes-clickhouse

Conversation

@ericallam
Copy link
Copy Markdown
Member

@ericallam ericallam commented Apr 20, 2026

⚠️ Not released yet. This PR is the server-side foundation only. The SDK changes that customers will actually use (chat.agent migration, chat.createStartSessionAction, useTriggerChatTransport updates) live on a separate branch and ship together in an upcoming @trigger.dev/sdk prerelease. Until that prerelease is published, this surface is reachable only via direct HTTP.

What this gives Trigger.dev users

A new first-class primitive, Session, for durable, task-bound, bidirectional I/O that outlives any single run. Sessions are the run manager for chat.agent going forward, and they unblock anything else that needs "one identifier, many runs over time" with a stable channel pair the client can write to and subscribe to.

Use cases unblocked

  • Chat agents that persist across many runs. One session per chat (keyed on your own chatId via externalId), turns 1..N attach to the same Session, the UI subscribes once and keeps receiving output as new runs take over.
  • Approval loops and long-running tasks with user feedback. The task waits on .in, the client writes to .in, the server enforces no-writes-after-close.
  • Workflow progress streams that live past the run. Subscribe to .out after the task finishes to replay history.
  • Resume-next-day flows. A session is a durable row, not a transient stream. Send a message a day later and the server triggers a fresh run on the same session.

How it works (Session-as-run-manager)

A Session row is task-bound (taskIdentifier + triggerConfig are required) and owns its current run via currentRunId + currentRunVersion for optimistic claim. Three trigger paths:

  1. Session createPOST /api/v1/sessions creates the row and triggers the first run synchronously.
  2. Append-time probePOST /realtime/v1/sessions/:session/in/append checks if the current run is alive; if it has terminated (idle exit, crash, etc.), the server triggers a new run before processing the append.
  3. End-and-continue handoffPOST /api/v1/sessions/:session/end-and-continue, called by the running agent, triggers a fresh run and atomically swaps currentRunId. Used by chat.requestUpgrade() for version handoffs.

Every triggered run is recorded in the SessionRun audit table with a reason (initial, continuation, upgrade, manual).

Public API surface

Control plane

  • POST /api/v1/sessions — create. Idempotent on (env, externalId). Triggers the first run, returns the session and a session-scoped public access token. Returns 409 if the upserted row is already closed.
  • GET /api/v1/sessions/:session — retrieve by friendlyId (session_abc...) or by your own externalId (server disambiguates by prefix).
  • GET /api/v1/sessions — list with filters (type, tag, taskIdentifier, externalId, derived status ACTIVE/CLOSED/EXPIRED, created-at range) and cursor pagination. Backed by ClickHouse.
  • PATCH /api/v1/sessions/:session — update tags / metadata / externalId.
  • POST /api/v1/sessions/:session/close — terminate. Idempotent, hard-blocks new server-brokered writes.
  • POST /api/v1/sessions/:session/end-and-continue — agent-only handoff to a fresh run.

Realtime

  • PUT /realtime/v1/sessions/:session/:io — initialize a channel. Returns S2 credentials in headers so high-throughput clients can write direct to S2.
  • GET /realtime/v1/sessions/:session/:io — SSE subscribe. Supports Last-Event-ID resume and an opt-in X-Peek-Settled: 1 header that fast-closes the stream when the upstream is already settled (trigger:turn-complete), eliminating long-poll wait on reconnect-on-reload paths.
  • POST /realtime/v1/sessions/:session/:io/append — server-side appends.
  • POST /api/v1/runs/:runFriendlyId/session-streams/wait — runs wait on a session stream as a waitpoint, with a race-check to avoid suspending if data already landed.

Auth scopes

sessions is a new resource type. read:sessions:{id}, write:sessions:{id}, admin:sessions:{id} flow through the existing JWT validator. Session-scoped public access tokens minted by the server replace browser-held trigger-task tokens for chat-style flows — the browser never sees a run identifier or a run-scoped token in steady state.

What's coming after this PR

  • SDK + chat.agent migration: separate branch, separate PR, ships in the next @trigger.dev/sdk prerelease alongside this server deploy. Customers using the prerelease chat.agent will follow the upgrade guide.
  • Dashboard surfaces: dedicated agent list, agent playground, agent view on the run dashboard. Tracking separately.

Implementation notes

  • Postgres Session table: scalar scoping columns (projectId, runtimeEnvironmentId, environmentType, organizationId) without FKs, matching the January TaskRun FK-removal decision. Point-lookup indexes only — list queries go to ClickHouse. Terminal markers (closedAt, expiresAt) are write-once.
  • ClickHouse sessions_v1: ReplacingMergeTree, partitioned by month, ordered by (org_id, project_id, environment_id, created_at, session_id). Tags indexed via tokenbf_v1 skip index.
  • SessionsReplicationService: mirrors RunsReplicationService exactly — leader-locked logical replication consumer, ConcurrentFlushScheduler, retry with exponential backoff + jitter, identical metric shape. Dedicated slot + publication so the two consume independently.
  • S2 keys: sessions/{addressingKey}/{out|in}. The existing runs/{runId}/{streamId} key format for run-scoped streams is untouched.
  • Optimistic claim: ensureRunForSession triggers a run upfront (cheap to cancel if it loses the race), then attempts an updateMany keyed on currentRunVersion. Loser cancels its triggered run and reuses the winner's. No DB lock held across the trigger.

What did NOT change

Run-scoped streams.pipe / streams.input and the existing /realtime/v1/streams/{runId}/... routes are unchanged. Sessions are net-new — not a reshaping of the current streams API.

Deploy notes

  • Set SESSION_REPLICATION_CLICKHOUSE_URL and SESSION_REPLICATION_ENABLED=1 to enable the replication consumer.
  • The Session table needs REPLICA IDENTITY FULL set on the prod source DB before the publication is created (same one-time DDL we did for TaskRun). Required for delete events to carry full column values.
  • Cross-form authorization on the GET /api/v1/sessions/:session loader (a JWT minted for either form authorizes both URL forms). Action routes are URL-form-specific, matching how the SDK mints PATs.

Verification

  • Webapp typecheck clean (10/10).
  • apps/webapp/test/sessionsReplicationService.test.ts — round-trip tests for insert/update/delete through Postgres logical replication into ClickHouse via testcontainers.
  • Live end-to-end against local dev: create + retrieve (both forms) + update + close, .out.initialize + .out.append x2 + .in.send + .out.subscribe over SSE, list with all filter combinations + pagination, end-and-continue swap, X-Peek-Settled fast-close (verified in browser via reconnect-on-reload and via curl). Replicated row lands in ClickHouse within ~1s.
  • Multi-round Devin + CodeRabbit review feedback addressed (read-after-write paths use prisma writer, info-leak on auth-routes masked as 403, peek-settled discriminator parsing fix, etc.).

Test plan

  • pnpm run typecheck --filter webapp
  • pnpm run test --filter webapp ./test/sessionsReplicationService.test.ts --run
  • Start the webapp with SESSION_REPLICATION_CLICKHOUSE_URL and SESSION_REPLICATION_ENABLED=1. Confirm the slot and publication auto-create on boot.
  • POST /api/v1/sessions and verify the row replicates to trigger_dev.sessions_v1 within a couple of seconds.
  • POST /api/v1/sessions/:id/close, then confirm POST /realtime/v1/sessions/:id/out/append returns 400.
  • Reuse a closed session's externalId on POST /api/v1/sessions and confirm 409.
  • GET /realtime/v1/sessions/:id/out with X-Peek-Settled: 1 after a turn completes and confirm X-Session-Settled: true response header + immediate close.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 20, 2026

🦋 Changeset detected

Latest commit: 188fa43

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 29 packages
Name Type
@trigger.dev/core Patch
@trigger.dev/build Patch
trigger.dev Patch
@trigger.dev/python Patch
@trigger.dev/redis-worker Patch
@trigger.dev/schema-to-json Patch
@trigger.dev/sdk Patch
@internal/cache Patch
@internal/clickhouse Patch
@internal/llm-model-catalog Patch
@internal/redis Patch
@internal/replication Patch
@internal/run-engine Patch
@internal/schedule-engine Patch
@internal/testcontainers Patch
@internal/tracing Patch
@internal/tsql Patch
@internal/zod-worker Patch
d3-chat Patch
references-d3-openai-agents Patch
references-nextjs-realtime Patch
references-realtime-hooks-test Patch
references-realtime-streams Patch
references-telemetry Patch
@internal/sdk-compat-tests Patch
@trigger.dev/react-hooks Patch
@trigger.dev/rsc Patch
@trigger.dev/database Patch
@trigger.dev/otlp-importer Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Introduces a durable Session primitive end-to-end: a new Prisma Session model and migration, a ClickHouse sessions_v1 table and query/insert helpers, ClickHouse-backed SessionsRepository, a SessionsReplicationService that streams Postgres logical replication into ClickHouse (with retry/ack/flush/leader-lock logic), session-friendly ID export (SessionId) and API Zod schemas, multiple REST and realtime routes for session CRUD, streaming and append, session-stream waitpoint support with Redis-backed pending sets, environment config and startup wiring, helper utilities, and end-to-end replication tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: Sessions - bidirectional durable agent streams' clearly summarizes the main change, specifying the new Sessions feature with its core capability of bidirectional streaming.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request provides a comprehensive description covering objectives, use cases, public API surface, implementation notes, and verification steps.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/tri-8627-session-primitive-server-side-schema-routes-clickhouse

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@ericallam ericallam force-pushed the feature/tri-8627-session-primitive-server-side-schema-routes-clickhouse branch from 2210fe2 to 4cadc19 Compare April 23, 2026 09:10
devin-ai-integration[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

Durable, typed, bidirectional I/O primitive that outlives a single run.
Ship target is agent/chat use cases; run-scoped streams.pipe/streams.input
are untouched and do not create Session rows.

Postgres
- New Session table: id, friendlyId, externalId, type (plain string),
  denormalised project/environment/organization scalar columns (no FKs),
  taskIdentifier, tags String[], metadata Json, closedAt, closedReason,
  expiresAt, timestamps
- Point-lookup indexes only (friendlyId unique, (env, externalId) unique,
  expiresAt). List queries are served from ClickHouse so Postgres stays
  minimal and insert-heavy.

Control-plane API
- POST   /api/v1/sessions           create (idempotent via externalId)
- GET    /api/v1/sessions           list with filters (type, tag,
                                     taskIdentifier, externalId, status
                                     ACTIVE|CLOSED|EXPIRED, period/from/to)
                                     and cursor pagination, ClickHouse-backed
- GET    /api/v1/sessions/:session  retrieve — polymorphic: `session_` prefix
                                     hits friendlyId, otherwise externalId
- PATCH  /api/v1/sessions/:session  update tags/metadata/externalId
- POST   /api/v1/sessions/:session/close  terminal close (idempotent)

Realtime (S2-backed)
- PUT    /realtime/v1/sessions/:session/:io           returns S2 creds
- GET    /realtime/v1/sessions/:session/:io           SSE subscribe
- POST   /realtime/v1/sessions/:session/:io/append    server-side append
- S2 key format: sessions/{friendlyId}/{out|in}

Auth
- sessions added to ResourceTypes. read:sessions:{id},
  write:sessions:{id}, admin:sessions:{id} scopes work via existing JWT
  validation.

ClickHouse
- sessions_v1 ReplacingMergeTree table
- SessionsReplicationService mirrors RunsReplicationService exactly:
  logical replication with leader-locked consumer, ConcurrentFlushScheduler,
  retry with exponential backoff + jitter, identical metric shape.
  Dedicated slot + publication (sessions_to_clickhouse_v1[_publication]).
- SessionsRepository + ClickHouseSessionsRepository expose list, count,
  tags with cursor pagination keyed by (created_at DESC, session_id DESC).
- Derived status (ACTIVE/CLOSED/EXPIRED) computed from closed_at + expires_at;
  in-memory fallback on list results to catch pre-replication writes.

Verification
- Webapp typecheck 10/10
- Core + SDK build 3/3
- sessionsReplicationService.test.ts integration tests 2/2 (insert + update
  round-trip via testcontainers)
- Live round-trip against local dev: create -> retrieve (friendlyId and
  externalId) -> out.initialize -> out.append x2 -> in.send -> out.subscribe
  (receives records) -> close -> ClickHouse sessions_v1 shows the replicated
  row with closed_reason
- Live list smoke: tag, type, status CLOSED, externalId, and cursor pagination
…te/update

The session_ prefix identifies internal friendlyIds. Allowing it in a
user-supplied externalId would misroute subsequent GET/PATCH/close
requests through resolveSessionByIdOrExternalId to a friendlyId lookup,
returning null or the wrong session. Reject at the schema boundary so
both routes surface a clean 422.
Without allowJWT/corsStrategy, frontend clients holding public access
tokens hit 401 on GET /api/v1/sessions and browser preflights fail.
Matches the single-session GET/PATCH/close routes and the runs list
endpoint.
- Derive isCached from the upsert result (id mismatch = pre-existing row)
  instead of doing a separate findFirst first. The pre-check was racy —
  two concurrent first-time POSTs could both return 201 with
  isCached: false. Using the returned row's id is atomic and saves a
  round-trip.

- Scope the list endpoint's authorization to the standard action/resource
  pattern (matches api.v1.runs.ts): task-scoped JWTs can list sessions
  filtered by their task, and broader super-scopes (read:sessions,
  read:all, admin) authorize unfiltered listing.

- Log and swallow unexpected errors on POST rather than returning the
  raw error.message. Prisma/internal messages can leak column names and
  query fragments.
Give Session channels run-engine waitpoint semantics so a task can
suspend while idle on a session channel and resume when an external
client sends a record — parallel to what streams.input offers
run-scoped streams.

Webapp
- POST /api/v1/runs/:runFriendlyId/session-streams/wait — creates a
  manual waitpoint attached to {sessionId, io} and race-checks the S2
  stream starting at lastSeqNum so pre-arrived data fires it
  immediately. Mirrors the existing input-stream waitpoint route.
- sessionStreamWaitpointCache.server.ts — Redis set keyed on
  {sessionFriendlyId, io}, drained atomically on each append so
  concurrent multi-tab waiters all wake together.
- realtime.v1.sessions.$session.$io.append now drains pending
  waitpoints after every record lands and completes each with the
  appended body.
- S2RealtimeStreams.readSessionStreamRecords — session-channel
  parallel of readRecords, feeds the race-check path.

Core
- CreateSessionStreamWaitpoint request/response schemas alongside
  the existing Session CRUD schemas. Server API contract only —
  the client ApiClient + SDK wrapper ship on the AI-chat branch.
Two fixes needed by browser clients hitting the public session API
(TriggerChatTransport's direct accessToken path, WebSocket-less
session drivers, anything origin'd off the dashboard):

- POST /api/v1/sessions: allowJWT: true + corsStrategy: "all" on
  the action. Pre-fix, the create endpoint only accepted secret-key
  auth, so any browser-originated sessions.create(...) 401'd. The
  loader (list) already had these; matches that shape.
- POST /realtime/v1/sessions/:session/:io/append: export both
  { action, loader } so Remix routes the OPTIONS preflight to the
  route builder's CORS handler. With only { action } exported, the
  preflight returns 400 'No loader for route' and Chrome surfaces
  the follow-up POST as net::ERR_FAILED. Same pattern as
  /api/v1/tasks/:id/trigger (which already exports both).

Validated by an end-to-end UI smoke on references/ai-chat:
new chat → send → streamed assistant reply in ~4s → second turn
reuses the same session + run, lastEventId advances 10 → 21.
@ericallam ericallam force-pushed the feature/tri-8627-session-primitive-server-side-schema-routes-clickhouse branch from f4406d7 to 4f2c0e7 Compare April 23, 2026 17:07
devin-ai-integration[bot]

This comment was marked as resolved.

Nine fixes from CodeRabbit + Devin review:

- api.v1.sessions.$session.close.ts:
  - Export { action, loader } so CORS preflight reaches the route
    builder's OPTIONS handler. Same fix already applied to the
    append route — Devin caught that I'd missed this one. Without
    the loader, browser clients hitting POST /close fail preflight.
  - Switch to `prisma.session.updateMany({ where: { id, closedAt:
    null }, ... })` so concurrent closes can't overwrite the
    original `closedAt` / `closedReason`. Loser hits count === 0 and
    re-reads the winning row — closedness is write-once at the DB
    level. (CodeRabbit: TOCTOU.)

- entry.server.tsx:
  Wrap the async `sessionsReplicationInstance.shutdown` in a sync
  handler with `.catch(...)`. SIGTERM/SIGINT fire during process
  teardown and a rejection from `_replicationClient.stop()` would
  become an unhandled promise rejection. Matches the pattern in
  `dynamicFlushScheduler.server.ts`. (CodeRabbit: unhandled rejection
  risk.)

- api.v1.runs.$runFriendlyId.session-streams.wait.ts:
  - Swallowed race-check catch now logs `warn` with
    sessionFriendlyId / io / waitpointId / error. Silent failures in
    the S2-read / engine-complete / cache-remove path were
    indistinguishable from the expected cache-drain-on-append fast
    path.
  - Outer 500 path no longer forwards `error.message` (Prisma /
    engine / S2 internals could leak). Logs server-side and returns
    a generic "Something went wrong"; 422 ServiceValidationError
    path unchanged. (CodeRabbit: info-leak + logging gap.)

- realtime.v1.sessions.$session.$io.ts:
  Add `method: "PUT"` to the route config so the route builder
  enforces method validation before the handler runs. Removed the
  now-redundant `request.method !== "PUT"` check inside the handler.
  (CodeRabbit: defense-in-depth.)

- services/sessionsRepository/sessionsRepository.server.ts:
  `ISessionsRepository` is now a `type` alias, per repo coding
  guideline ("use types over interfaces"). Structural-typing means
  implementing classes don't need source changes. (CodeRabbit.)

- services/sessionStreamWaitpointCache.server.ts:
  Replace separate SADD + PEXPIRE with a single atomic Lua script.
  Solves two distinct concerns at once:

  1. Partial-failure window (CodeRabbit): if SADD succeeded and
     PEXPIRE failed, the key would persist with no TTL. The Lua
     script fails both or succeeds both.
  2. TTL-race (Devin, twice): each waitpoint registers with its own
     `ttlMs` derived from the caller's timeout. The old code called
     PEXPIRE unconditionally, so a short-TTL registration would
     shrink the shared key's TTL below a longer-TTL sibling —
     evicting the sibling from Redis and degrading the append-path
     fast drain to engine-timeout-only. The script only PEXPIREs if
     the new TTL is greater than the current PTTL (or the key has
     no TTL yet), so the key lives as long as the longest-TTL
     member.

Outstanding: one unresolved thread asking to rename
`CloseSessionRequestBody.reason` → `closedReason` for symmetry with
the DB column. Holding that for an API-taste call — will follow up.

Validated: `pnpm run typecheck --filter webapp` clean.
devin-ai-integration[bot]

This comment was marked as resolved.

Devin catch on #3417 — the ClickHouse sessions list was slicing
`sessionIds.slice(1, size + 1)` on the backward path, which skipped
the item closest to the cursor and surfaced the sentinel (the
`size+1`th item that proves hasMore=true) to the user.

Trace, with items c01…c11 and cursor=c07 (page size 3):
- Backward query: `session_id > c07 ORDER BY ASC LIMIT 4` →
  `[c08, c09, c10, c11]`. Legitimate content is the first three
  (`[c08, c09, c10]`); `c11` is the sentinel.
- Previous slice: `[c09, c10, c11]` → displayed DESC `[c11, c10, c09]`
  — user never sees c08, sees sentinel c11 instead.

Fix: collapse both directions to `sessionIds.slice(0, size)`. The
sentinel is always the last item regardless of direction, so the two
branches had no reason to diverge. Cursor computations
(`previousCursor = reversed.at(1)`, `nextCursor = reversed.at(size)`)
already line up with the corrected slice — no change needed there.

Verified: webapp typecheck clean.
/realtime/v1/sessions/:session/:io=out now peeks the tail record in S2
at connection time. When the tail chunk is trigger:turn-complete, the
agent has finished a turn and is either idle-waiting on .in or has
exited — either way no more chunks will arrive without further user
action. In that case the downstream S2 read switches to wait=0 so the
SSE drains and closes in ~1s instead of long-polling for 60s, and the
response carries X-Session-Settled: true so the client can tell the
close is terminal rather than a normal 60s cycle.

Mid-turn tails (streaming UIMessageChunks in flight) fall through to
the existing wait=60 long-poll. Crashed-mid-turn is indistinguishable
from live-streaming at this point and gets the same 60s retry loop as
today — that's a separate hardening, not in scope here.

The peek uses GET /records?tail_offset=1&count=1&wait=0 (single-digit
ms on S2), then unwraps the agent-side envelope written by
StreamsWriterV2: record.body parses to {data: <chunk>, id}, where
<chunk> is the raw UIMessageChunk object. No double-parse on data.

404 / 416 from the peek (stream never written / empty stream) short-
circuit to settled=false so first-connect on a freshly-created session
keeps the long-poll semantics the agent's first chunks depend on.

Verified end-to-end against an idle chat-agent-smoke session: caught-
up reconnect (Last-Event-ID = tail) closes in 1.08s with the header;
behind reconnect (Last-Event-ID < tail) drains remaining records then
closes in 0.94s with the header; empty-stream reconnect keeps the 60s
long-poll behavior unchanged.
devin-ai-integration[bot]

This comment was marked as resolved.

Session is now the run manager for chat.agent and any future task-bound
session. Atomically creates the row + triggers the first run + tracks
the current run via optimistic claim, with a SessionRun audit log for
provenance.

Schema:
- Session gains `taskIdentifier`, `triggerConfig` (JSON), `currentRunId`
  (non-FK), `currentRunVersion` (monotonic int for optimistic claim).
- New SessionRun audit table — one row per run a session triggers,
  with `reason: "initial" | "continuation" | "upgrade" | "manual"`.

Lifecycle:
- `POST /api/v1/sessions`: idempotent on `(env, externalId)`, refreshes
  triggerConfig on cache hit, runs `ensureRunForSession` (probe +
  optimistic claim), returns a session-scoped PAT. JWT auth path
  dropped — secret-key only. The customer's server is the only entry
  point for session creation.
- `POST /api/v1/sessions/:s/end-and-continue`: server-orchestrated
  handoff (cancels current run, triggers a fresh one, swaps
  currentRunId via `updateMany where currentRunVersion`). Powers
  `chat.requestUpgrade()` from inside the agent runtime.
- `POST /realtime/v1/sessions/:s/:io/append`: probe + ensureRunForSession
  before append so messages arriving while no run is alive boot one
  transparently.

Cross-form addressing on write paths:
- `createActionApiRoute` now runs `findResource` before `authorization`,
  matching `createLoaderApiRoute`. Action routes get an optional
  `resource` argument on `authorization.resource()` —
  backwards-compatible (existing 4-arg callbacks unchanged).
- Append + end-and-continue use the new ordering to authorize against
  `{paramSession, friendlyId, externalId}` so a JWT minted for either
  form authorizes either URL form.

Helpers:
- `mintSessionToken.server.ts`: server-side session-PAT factory
  (`read:sessions:{key} + write:sessions:{key}`, 1h TTL).
- `sessionRunManager.server.ts`: `ensureRunForSession` (probe + claim)
  and `swapSessionRun` (force handoff with optimistic claim +
  cancel-on-loss).

Pre-mutation existence reads switched to `$replica` (close, end-and-
continue, PATCH).
devin-ai-integration[bot]

This comment was marked as resolved.

Three fixes after pushing the Sessions-as-run-manager commit:

- `api.v1.sessions.$session.end-and-continue.ts` was destructuring only
  `{ action }` from `createActionApiRoute`, which means Remix had no
  handler for OPTIONS preflight on this route. Browser CORS would 405.
  Sibling routes (`close.ts`) already export `{ action, loader }`. Fix:
  destructure and export both.

- `ensureRunForSession`'s pathological "lost the claim race AND the
  winner's run was already terminal" branch recursed without bound. In
  practice progress through the run engine bounds it, but a misconfigured
  task that crashes before being dequeued could blow the stack. Add a
  hidden `_attempt` counter, throw `SessionRunManagerError` once it
  exceeds 3.

- `sessionsReplicationService.test.ts` was failing in CI because the
  sessions-as-run-manager schema migration made `taskIdentifier` and
  `triggerConfig` required on `Session`. The two `prisma.session.create`
  calls in the test predate the migration. Add the now-required fields
  to both fixtures.
devin-ai-integration[bot]

This comment was marked as resolved.

Two fixes from Devin review on the sessions-as-run-manager commit:

- `SessionItem.currentRunId`'s contract is the `run_*` friendlyId, but
  `serializeSession` returns the raw Prisma cuid. The `POST /sessions`
  create path overrides correctly via a TaskRun lookup, but GET, PATCH,
  and the three return paths in close.ts were passing the cuid through.
  A consumer using `currentRunId` from those endpoints in a downstream
  `GET /api/v1/runs/:runId` call would 404. Add a
  `serializeSessionWithFriendlyRunId` helper next to `serializeSession`
  that resolves via `$replica.taskRun.findFirst` (TaskRun friendlyIds
  are immutable, so replica lag is harmless), and switch the five
  affected return sites to use it. List endpoints stay on
  `serializeSession` to avoid N+1 lookups when paginating. The create
  endpoint keeps its existing manual lookup because it also needs the
  friendlyId for the response's `runId` field, and `session.currentRunId`
  is stale relative to the post-`ensureRunForSession` claim outcome.

- Drop dead `lastChunkType` recomputation in
  `streamResponseFromSessionStream`. The variable was bound but never
  used; the conditional below it re-evaluated the same expression.
  Use the bound value in the condition.
Collapse `session-out-settled-signal.md` and `sessions-public-api-cors.md`
into the single `session-primitive.md`, and rewrite that one to a high-
level two-sentence summary that covers everything actually shipping in
this PR (sessions-as-run-manager, end-and-continue, waitpoints, etc.).
The CORS/JWT-on-create story is also out of date now that POST
/api/v1/sessions is secret-key only.
devin-ai-integration[bot]

This comment was marked as resolved.

…friendlyId

Switch the two read-after-write taskRun lookups (POST /api/v1/sessions
and POST /api/v1/sessions/:s/end-and-continue) from $replica back to
prisma. Both reads happen immediately after triggering a run on the
writer; replica lag would null the result and turn a successful create
into a 500, or fall back to leaking the internal cuid in the
end-and-continue response.
devin-ai-integration[bot]

This comment was marked as resolved.

…n sessionRunManager

The lost-race re-read in ensureRunForSession and swapSessionRun reads
the Session row that the winner just wrote on the writer. Reading from
$replica could return pre-race state and either (1) cause
ensureRunForSession to recurse with a stale currentRunVersion, fail the
next claim, and waste runs until max-attempts; or (2) cause
swapSessionRun to return swapped: false with the calling run's own id,
misleading the caller into thinking it is still authoritative.
devin-ai-integration[bot]

This comment was marked as resolved.

The S2 record envelope wraps the agent-written chunk as
{data: <chunkAsString>, id: partId} because StreamsWriterV2 hands
appendPart an already-stringified chunk. The peek-settled check
treated envelope.data as an object, so typeof === 'object' always
returned false and the trigger:turn-complete sentinel was never
matched. Reconnect-on-reload silently degraded to the full long-poll
path. Parse envelope.data once more so the type discriminator is
surfaced.
devin-ai-integration[bot]

This comment was marked as resolved.

… run lookup

Same read-after-write pattern as the other lost-race re-reads:
the run was just triggered on the writer milliseconds before, so a
$replica.findFirst can return null due to replication lag. The null
silently no-ops the cancellation and leaks an orphan run that no
session will ever claim.
devin-ai-integration[bot]

This comment was marked as resolved.

When the upsert path returns a previously-closed row, return 409 before
ensureRunForSession fires. Otherwise we'd trigger a fresh run on a
closed session that can't receive .in input (append handler rejects
writes to closed sessions), wasting compute on a run that exits the
moment it tries to read. close is one-way; callers must use a different
externalId to start a new session.
The race-check in api.v1.runs.$runFriendlyId.session-streams.wait was
selecting the realtime stream instance via run.realtimeStreamsVersion,
but session streams are always v2 (S2) — the writer (appendPartToSessionStream)
and the SSE subscribe both hardcode v2. For a v1 run the race-check
silently fell back to a non-S2 instance, the instanceof check missed,
and the optimization was skipped. Hardcode v2 for parity with the rest
of the session surface.
…ized routes

createActionApiRoute now runs findResource before authorization so the
auth scope check can expand to alternate identifiers of the resolved
resource (Sessions are addressable by both friendlyId and externalId).
Side-effect: an authenticated-but-underscoped caller could probe
resource existence by observing 404 vs 403. Mask the 404 as 403 with
the same response shape as the auth-failed branch when the route
declares authorization, so the two cases are indistinguishable to
callers without scopes. Routes without authorization keep returning
404.
devin-ai-integration[bot]

This comment was marked as resolved.

Previous fix unconditionally returned 403 when findResource was null on
a route with authorization, breaking PRIVATE-key callers (e.g. server
SDK) hitting the existing api.v2.runs.cancel route — they always pass
authorization but the new code returned 403 with a factually wrong
message ('Unauthorized: missing required scopes') even though they had
full permissions.

New ordering: run authorization first (with the resolved resource as
the 5th arg, so cross-form session auth still works), then check
resource-null → 404. This gives:
- PRIVATE key + missing resource: auth passes → 404 (correct)
- Underscoped JWT + missing resource: auth fails (resource not in
  scope) → 403 (no info leak vs existing resource)
- Underscoped JWT + existing resource: auth fails → 403 (unchanged)

Only auth callbacks that destructure the resource (loader for
realtime.v1.sessions.$session.$io) need to handle null — they all
already do, since findResource was already nullable in pre-PR
loaders.
@ericallam ericallam merged commit c69e939 into main Apr 28, 2026
43 checks passed
@ericallam ericallam deleted the feature/tri-8627-session-primitive-server-side-schema-routes-clickhouse branch April 28, 2026 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants