Skip to content

feat(llmo): async URL CSV export for agentic traffic#2401

Open
akshaymagapu wants to merge 10 commits into
mainfrom
feat/agentic-traffic-urls-export-api
Open

feat(llmo): async URL CSV export for agentic traffic#2401
akshaymagapu wants to merge 10 commits into
mainfrom
feat/agentic-traffic-urls-export-api

Conversation

@akshaymagapu
Copy link
Copy Markdown
Contributor

@akshaymagapu akshaymagapu commented May 12, 2026

What

Two endpoints for the agentic-traffic URL Performance dashboard's async CSV export:

POST /sites/:siteId/agentic-traffic/urls/export
GET  /sites/:siteId/agentic-traffic/urls/export/:exportId

How

POST canonicalises the filter set, hashes it into a deterministic exportId, and checks S3 first. Same filters → same key → cache hit. On a miss, an SQS message is enqueued and the reporting-worker (spacecat-reporting-worker#616) runs the export via the data-service RPC (mysticat-data-service#589). The user polls GET until metadata.json flips to success (presigned download URLs) or failed (reason).

POST  ── ListObjectsV2 + GetObject(metadata.json)
        ├─ status=success   →  200 ready    + presigned URLs
        ├─ status=failed    →  200 failed   + failureReason
        ├─ status=processing → 202 processing
        └─ no metadata      →  sqs.sendMessage → 202 processing

GET   ── same S3 cache check, no SQS path; pins exportId to /^[a-f0-9]{64}$/

Design notes

  • exportId = sha256(stableStringify(canonical filter set)) — order-stable JSON serialisation guarantees identical filters always produce the same key.
  • Aurora may split large exports into urls.csv + urls.csv_part2 + …; listExportCsvObjects returns them in stable part order.
  • Presigned URLs expire after 7 days.
  • Status polling reads only from S3 — no DB round-trip, no writer-pool pressure.

Config

Env var Fallback
AGENTIC_TRAFFIC_EXPORT_BUCKET S3_REPORT_BUCKETctx.s3.s3Bucket
AGENTIC_TRAFFIC_EXPORT_QUEUE_URL REPORT_JOBS_QUEUE_URL
AGENTIC_TRAFFIC_EXPORT_REGION ctx.runtime.regionus-east-1

Today the dedicated env vars are unset; fallbacks resolve to the existing report bucket / queue.

OpenAPI

New AgenticTrafficUrlsExportRequest / AgenticTrafficUrlsExportResponse schemas; new paths in llmo-api.yaml; agentic-traffic-by-url-api.md updated.

Tests

103 passing in llmo-agentic-traffic.test.js. Covers cache hit, cache miss + SQS enqueue, processing short-circuit, platform-code mapping into the hash, missing-config rejection (POST + GET), split-part presigning, exportId-shape validation, S3-error fallthrough.

Related

akshaymagapu and others added 3 commits May 12, 2026 16:02
Adds POST/GET endpoints for asynchronous URL-level CSV exports of the
agentic traffic dashboard:

  POST /sites/:siteId/agentic-traffic/urls/export
  GET  /sites/:siteId/agentic-traffic/urls/export/:exportId

Flow:

  UI clicks export
    -> API endpoint receives filters
    -> API computes deterministic exportId (sha256 of canonical filter set)
    -> API checks S3 cache (S3_REPORT_BUCKET, agentic-traffic/url-exports/...)
    -> if CSV present + metadata=success: return presigned URL(s) (200 ready)
    -> if metadata=failed: return 200 failed + reason
    -> if metadata=processing: return 202 processing
    -> otherwise enqueue SQS job (REPORT_JOBS_QUEUE_URL) -> 202 processing

  Worker handles the SQS job, calls the data-service RPC
  (wrpc_agentic_traffic_urls_export_to_s3), and writes metadata.json.

  UI polls the status endpoint until ready/failed.

Implementation notes:

- Filter set is canonicalized (version + siteId + startDate/endDate +
  platform/categoryName/agentType/userAgent/contentType/successRate/
  urlPathSearch + format) and stable-stringified before hashing, so the
  same filters always produce the same exportId regardless of JSON key
  order. Same filters -> same S3 key -> cache hit on retry.

- Listing handles Aurora's split-file convention: when query_export_to_s3
  splits a large export, additional objects appear as urls.csv_part2 /
  urls.csv_part3 / ... alongside urls.csv. The list step returns them in
  stable part order so the presigned-URL array matches.

- Presigned URLs expire after 7 days (the SQS-driven export is async and
  the user may walk away from the polling tab).

- Export bucket/queue/region resolve from env in priority order:
  AGENTIC_TRAFFIC_EXPORT_BUCKET > S3_REPORT_BUCKET > ctx.s3.s3Bucket, and
  AGENTIC_TRAFFIC_EXPORT_QUEUE_URL > REPORT_JOBS_QUEUE_URL. Missing
  config returns 400 with a descriptive message rather than 500.

- parseAgenticTrafficParams now captures urlPathSearch (already supported
  by the data-service by-url RPC). Other handlers ignore the extra
  field; only the export hashes it into the exportId.

OpenAPI:

- New AgenticTrafficUrlsExportRequest / AgenticTrafficUrlsExportResponse
  schemas.
- New llmo-api paths for export + status with 200/202 distinction.

Tests:

- 10 controller tests covering cache hit, queueing, processing
  short-circuit, platform-code mapping into the hash, missing config
  rejection, status processing/ready/failed states, split-part presigning,
  and exportId-shape validation.
- Routes index test updated to include the new endpoints in both the
  controller mock and the route listing.

Requires:
- spacecat-infrastructure: aurora s3Export role association (PR #518)
- mysticat-data-service: wrpc_agentic_traffic_urls_export_to_s3 RPC
- spacecat-reporting-worker: agentic-traffic-urls-export SQS handler

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The reporting-worker (spacecat-reporting-worker#616) resolves its
allowed bucket from S3_REPORTING_BUCKET_NAME (the existing reports
bucket env in that Lambda's environment). The API service was only
checking S3_REPORT_BUCKET — fine in envs where both env vars resolve
to the same bucket, but a mismatch in any env where only one is set
would make the worker reject the SQS message with 's3Bucket must
match the configured export bucket'.

Add S3_REPORTING_BUCKET_NAME as an additional fallback in the API's
getExportConfig so both names work and resolution stays consistent
with the worker regardless of which env var the deploy config sets.

Order: AGENTIC_TRAFFIC_EXPORT_BUCKET (preferred, dedicated)
     → S3_REPORTING_BUCKET_NAME (worker's name)
     → S3_REPORT_BUCKET (older name, some envs still have it)
     → ctx.s3.s3Bucket (SDK default).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
API service convention is S3_REPORT_BUCKET. The worker has its own
S3_REPORTING_BUCKET_NAME convention in its Lambda env; both names
resolve to the same spacecat-{env}-reports bucket at deploy time so
cross-service consistency isn't an issue. Reverting the extra fallback
to keep each repo using its native env-var name.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

This PR will trigger a minor release when merged.

CI flagged two uncovered branches in
createAgenticTrafficUrlsExportStatusHandler that the existing tests
missed:

  - lines 748-749: `if (!hasText(s3Bucket)) return badRequest(...)` —
    missing-config branch on the GET endpoint (the POST endpoint
    version was already tested, but the GET version wasn't).
  - lines 776-778: the `catch (error)` block — unexpected S3 PUT/GET
    failure during status check.

Two added tests close both. Project-wide coverage threshold is 100%
and the prior failure was at 99.96% / 99.89% / 99.96% — these tests
push it back to 100%.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@akshaymagapu akshaymagapu changed the title feat(llmo): async S3-backed CSV export for agentic URL traffic feat(llmo): async URL CSV export for agentic traffic May 12, 2026
akshaymagapu and others added 4 commits May 12, 2026 16:36
POST is the user's "I want this export" signal. Returning the prior
failure verbatim permanently locked the cache key for the same filter
set — users had to either manually delete metadata.json from S3 or
tweak filters to change the exportId before they could retry.

Drop the isExportFailed early-return from the POST handler so failed
metadata falls through to the enqueue path, identical to "no metadata".
The worker overwrites the failed metadata.json on the retry; the cache
contract still holds (same filters → same exportId → same S3 key) so
retries are free of side effects.

GET keeps the explicit 'failed' branch — status polling is a pure
read; reporting the failure is still its job.

Test added asserting the new POST-with-failed-metadata behavior. 103 →
104 passing.

OpenAPI POST description updated to document the new status semantics.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…be/spacecat-api-service into feat/agentic-traffic-urls-export-api
CI flagged two more uncovered branches in llmo-agentic-traffic.js:

  - lines 712-714: catch (error) in the POST handler — unexpected S3
    or SQS error inside the try block. Added a test that rejects the
    S3 send stub and asserts 500 + the log line.
  - lines 740-741: missing s3Client / ListObjectsV2Command /
    GetObjectCommand / getSignedUrl guard on the GET status handler.
    Added a test that strips ctx.s3 and asserts 400.

Tests 104 → 106 passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@akshaymagapu akshaymagapu added the enhancement New feature or request label May 12, 2026
akshaymagapu and others added 2 commits May 12, 2026 17:40
CI was still at 99.6% lines / 96.97% branches with two uncovered ranges:

- Lines 160-161: stableStringify array branch. The canonical export
  payload is a flat object of primitives — the array branch was
  unreachable. Removed; the recursive object branch is sufficient.
- Lines 663-664: POST !hasText(s3Bucket) || !hasText(queueUrl) — the
  second config-check that trips after the s3?.s3Client guard. The
  existing 'not configured' test strips s3/sqs entirely and trips the
  earlier check, so this branch was never exercised. Added a test where
  S3/SQS SDKs are present but env vars / s3Bucket are stripped.

107 tests passing; targeted coverage on llmo-agentic-traffic.js shows
100% statements/lines/functions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously verified 100% statements/lines/functions but missed that
branches were still at 97.58%. CI's 100% threshold caught it. Going
through every reachable `||` / `??` fallback now.

Tests added (107 → 113):
- defaults s3Region to 'us-east-1' when AGENTIC_TRAFFIC_EXPORT_REGION
  and ctx.runtime.region are both missing, plus requestedBy fallback
  to 'unknown' when ctx.attributes.authInfo.profile.email is absent.
- ListObjectsV2 returning a response without the Contents field —
  exercises the `result.Contents || []` defense.
- success metadata that lacks rowCount/filesUploaded/bytesUploaded —
  exercises the `?? null` and `?? csvKeys.length` paths.
- failed metadata without failureReason — exercises the 'Export
  failed' default reason.
- GET status with undefined ctx.params.exportId — distinct from the
  'not-a-hash' truthy-but-invalid case; covers `exportId || ''`.
- GetObject error shaped as `error.$metadata.httpStatusCode = 404`
  rather than `error.name = 'NoSuchKey'` — both are treated as a
  missing-metadata signal.

Dead-branch annotation:
- The `Number(key.match(...)?.[1] || 1)` fallback in
  listExportCsvObjects' sort comparator is unreachable —
  listExportCsvObjects' own filter already guarantees keys end in
  `_partN` for the non-csvKey branch. Added `/* c8 ignore next */`
  with an explanatory comment so the dead branch is documented rather
  than silently failing coverage.

Local verify before pushing this time: 100% statements / branches /
functions / lines on llmo-agentic-traffic.js.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant