Skip to content

feat: persist RUM domain key availability on Site config#2367

Open
Sugandhgyl wants to merge 11 commits into
mainfrom
feat/site-rum-config
Open

feat: persist RUM domain key availability on Site config#2367
Sugandhgyl wants to merge 11 commits into
mainfrom
feat/site-rum-config

Conversation

@Sugandhgyl
Copy link
Copy Markdown
Contributor

@Sugandhgyl Sugandhgyl commented May 8, 2026

WHAT CHANGED

  • New updateRumConfig() service checks the RUM API with a 3-second timeout, then persists hasDomainKey on site.config.rumConfig
  • Write paths covered: REST Site.create, add-site Slack command, approve-site-candidate Slack action, onboardSingleSite helper (reuses existing steps.rumVerified — no extra API call)
  • { save: false } option prevents a double-save in onboardSingleSite
  • 5 new tests for the service + updated tests for all 4 write paths (291 tests passing)
Screenshot 2026-05-12 at 10 34 07 AM

Please ensure your pull request adheres to the following guidelines:

  • make sure to link the related issues in this description. Or if there's no issue created, make sure you
    describe here the problem you're solving.
  • when merging / squashing, make sure the fixed issue references are visible in the commits, for easy compilation of release notes

If the PR is changing the API specification:

  • make sure you add a "Not implemented yet" note the endpoint description, if the implementation is not ready
    yet. Ideally, return a 501 status code with a message explaining the feature is not implemented yet.
  • make sure you add at least one example of the request and response.

If the PR is changing the API implementation or an entity exposed through the API:

  • make sure you update the API specification and the examples to reflect the changes.

If the PR is introducing a new audit type:

  • make sure you update the API specification with the type, schema of the audit result and an example

Related Issues

Thanks for contributing!

@Sugandhgyl Sugandhgyl changed the title Feat/site rum config feat: persist RUM domain key availability on Site config May 8, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

This PR will trigger a minor release when merged.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

@sandsinh sandsinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Sugandhgyl,

Strengths

  • Extracting updateRumConfig as a shared helper in rum-config-service.js is the right instinct - centralizing the RUM check logic avoids copy-paste across callers.
  • The { save: false } option gives callers control over save timing, which is useful when multiple mutations need to be batched before a single site.save().
  • The backfill script (scripts/backfill-rum-config.mjs) includes --dry-run and --site-ids flags, per-site rate limiting (500ms), and progress logging - all appropriate for a one-time operational script.
  • Test files use esmock to stub updateRumConfig at the call sites, keeping unit tests focused on the caller's behavior without re-testing the helper itself.
  • plg-onboarding.js correctly follows the setConfig(Config.toDynamoItem(...)) pattern before site.save().

Issues

Critical (Must Fix)

1. @adobe/spacecat-shared-data-access is resolved from a personal GitHub Gist, not npm - package.json
The dependency points to a raw tarball URL on gist.github.com. The Gist tarball is approximately 1.35 MB; the real npm package of the same version is 172 KB. The Gist contains an embedded nested archive (adobe-spacecat-shared-data-access-2.96.0.tgz, ~1.17 MB) with file mtimes backdated to 1985. These are different packages with different integrity hashes. This is an unverified artifact from an untrusted source and must not reach main. Replace with the published npm version before merging.

2. deploy-dev npm script targets a personal label (-l sugandhg) - package.json
Same issue as the companion audit-worker PR. Any developer or CI job running npm run deploy-dev after this merges will deploy to a personal slot rather than the shared dev environment. Revert to -l latest or the appropriate shared label.

3. rum-config-service.js mutates config but does not call site.setConfig(Config.toDynamoItem(...)) before saving - src/support/rum-config-service.js
updateRumConfig calls site.getConfig().updateRumConfig(hasDomainKey) and then site.save() without first calling site.setConfig(Config.toDynamoItem(siteConfig)). The plg-onboarding.js path in this same PR calls site.setConfig(Config.toDynamoItem(siteConfig)) correctly before save. These two paths have different persistence behavior. In production, the service's save will either silently discard the rumConfig update or store data that does not match the expected DynamoDB shape, depending on how setConfig/save handles an in-place config mutation.

4. utils.js mutation ordering bug silently overwrites the rumConfig update - src/support/utils.js
utils.js calls updateRumConfig(site, context, { save: false }) which mutates site.getConfig() in place, then immediately calls site.setConfig(Config.toDynamoItem(siteConfig)) where siteConfig is a separate Config object captured earlier in the function. Since the two object references are different, the setConfig call with the earlier snapshot overwrites the rumConfig mutation. The update is lost silently, and tests will not catch this because they stub updateRumConfig at the import boundary.

5. RUM check failure propagates as a 5xx to the site creation caller - src/controllers/sites.js
sites.js calls await updateRumConfig(site, context) inside the try block after the site has already been written to the database. If the RUM check or its internal save throws, the site creation handler returns a 5xx even though the site was successfully created. The caller will likely retry, creating a duplicate site or a confusing "site already exists" error on retry. The RUM config update must not share an error boundary with the site creation response. Wrap it in its own try/catch or move it to a background job.

6. Slack action handlers propagate RUM check errors to unrelated user-facing responses - src/support/slack/actions/approve-site-candidate.js
approve-site-candidate.js calls await updateRumConfig(site, lambdaContext) without catching errors. If the RUM check fails, the Slack action handler throws, the approval response is not sent, and the approving user sees a failure with no actionable explanation. An RUM availability check is a non-critical enrichment step and must not block the approval workflow. Wrap in try/catch and log the error without re-throwing.

Important (Should Fix)

7. Four different write patterns exist for the same operation within this single PR

  • rum-config-service.js: mutates site.getConfig() directly, calls site.save() without setConfig (buggy)
  • plg-onboarding.js: calls site.getConfig(), mutates, calls site.setConfig(Config.toDynamoItem(...)), calls site.save() (correct)
  • utils.js: calls updateRumConfig with save:false, then calls site.setConfig(Config.toDynamoItem(siteConfig)) with the wrong object reference (buggy)
  • approve-site-candidate.js and add-site.js: delegate to updateRumConfig which has the missing setConfig bug

Consolidate to a single pattern. The plg-onboarding.js pattern is correct. Fix rum-config-service.js to call site.setConfig(Config.toDynamoItem(siteConfig)) after the mutation and before save, then all other paths that delegate to the service will inherit the fix.

8. PLG path derives hasDomainKey from steps.rumVerified rather than a live check - src/controllers/plg/plg-onboarding.js
plg-onboarding.js sets hasDomainKey = steps.rumVerified === true instead of calling updateRumConfig or performing a live check. This means the PLG path persists whatever the onboarding wizard recorded at the time the form was submitted, which may be stale by the time the site is created. The source of truth for hasDomainKey is now ambiguous across write paths.

9. Naive URL parsing will produce incorrect domain values - src/support/rum-config-service.js
site.getBaseURL().replace(/^https?:\/\//, '') passes path segments, ports, query strings, and trailing slashes to retrieveDomainkey. Use new URL(site.getBaseURL()).hostname to extract only the hostname.

10. setTimeout inside Promise.race is never cleared on success - src/support/rum-config-service.js
Identical to the issue in the companion audit-worker PR. When retrieveDomainkey resolves before the timeout, the pending timer is not cancelled. In Lambda this holds the event loop open past the handler's return. Use clearTimeout in a finally block.

11. Synchronous 3-second RUM check blocks the site creation response - src/controllers/sites.js
A blocking await updateRumConfig(site, context) with a 3-second timeout is now in the critical path of the site creation endpoint. Site creation latency will spike by up to 3 seconds on every call. Move this to a fire-and-forget call with its own error handling, or enqueue a background job.

12. Backfill script reads credentials from environment without upfront validation - scripts/backfill-rum-config.mjs
POSTGREST_URL and RUM_ADMIN_KEY are read from process.env without checking for presence. If either is missing, the script will fail mid-run with an unhelpful error after it has already processed some records, leaving the dataset partially updated. Validate both variables at startup with a clear error message before entering the site loop.

Minor (Nice to Have)

  • The backfill script has no documented cleanup plan. One-time scripts committed without a removal ticket tend to become permanent fixtures in the repository. Add a note in the script or the PR description about when and how it should be removed.
  • approve-site-candidate.js calls updateRumConfig in both branches of an if/else that appear structurally symmetric in this regard. If both branches always call it, move the call to after the if/else to reduce duplication.
  • updateRumConfig returns hasDomainKey but all callers in this PR ignore the return value. Either remove the return value or use it at the call sites - an ignored return value is a signal that the interface is not well-matched to its callers.
  • The save: false option is easy to misuse: a caller that passes save: false and then forgets to call site.setConfig(Config.toDynamoItem(...)) before saving will silently discard the mutation. A JSDoc comment on the option explaining the caller's responsibility would reduce this risk.

Recommendations

  • Fix rum-config-service.js to call site.setConfig(Config.toDynamoItem(siteConfig)) before site.save(). This single fix, combined with removing the utils.js mutation ordering bug, would make the service the canonical correct pattern and all delegating callers would inherit the fix.
  • Move the RUM check out of the synchronous request path in sites.js. An enrichment step that can timeout for 3 seconds and throw should not affect the latency or success rate of a core write endpoint.
  • Add a cleanup ticket for scripts/backfill-rum-config.mjs before merging.

Assessment

Ready to merge? No
The persistence bug in rum-config-service.js, the mutation ordering bug in utils.js, and the error propagation in sites.js are all blockers that will cause silent data loss or incorrect API behavior in production. The Gist URL dependency must also be resolved. The shared library PR (spacecat-shared) should be published to npm and referenced here before these downstream PRs merge.

@Sugandhgyl
Copy link
Copy Markdown
Contributor Author

  • The save: false option is easy to misuse: a caller that passes save: false and then forgets to call site.setConfig(Config.toDynamoItem(...)) before saving will silently discard the mutation. A JSDoc comment on the option explaining the caller's responsibility would reduce this risk.

Hey @sandsinh, thanks for the detailed review! All Critical and Important findings have been addressed. Here's a summary:

Critical fixes

  1. Gist URL removed — @adobe/spacecat-shared-data-access reverted to the published npm version (3.60.0).

  2. deploy-dev personal label removed — changed -l sugandhg back to -l latest.

  3. rum-config-service.js — missing setConfig call fixed — the service now calls site.setConfig(Config.toDynamoItem(siteConfig)) before site.save() when save: true. When save: false, the function returns hasDomainKey and leaves the caller fully responsible for config mutation and serialization (no internal site.getConfig() mutation performed).

  4. utils.js — mutation ordering bug fixed — the returned hasDomainKey value from updateRumConfig(site, context, { save: false }) is now applied to the siteConfig snapshot via siteConfig.updateRumConfig(hasDomainKey) before site.setConfig(Config.toDynamoItem(siteConfig)). All three mutations (fetchConfig, onboardConfig, rumConfig) are now serialized together in a single setConfig call.

  5. sites.js error boundary fixed — updateRumConfig is now wrapped in its own try/catch. A RUM check failure logs a warning but returns 201 — site creation succeeds regardless.

  6. approve-site-candidate.js and add-site.js errors isolated — both updateRumConfig call sites are now wrapped in try/catch. A RUM failure logs a warning without re-throwing, so the Slack approval/add-site flow completes normally.

Important fixes
7. Single canonical write pattern — fixing issue 3 (adding setConfig in the service) makes the service the correct pattern. All delegating callers (sites.js, add-site.js, approve-site-candidate.js) inherit the fix automatically.

  1. Added import { updateRumConfig } from '../../support/rum-config-service.js'
  • Replaced siteConfig.updateRumConfig(steps.rumVerified === true) with a live RUM call: const hasDomainKey = await updateRumConfig(site, context, { save: false }); siteConfig.updateRumConfig(hasDomainKey)
  • Added rum-config-service.js mock to all 12 esmock instances in the test so the stub's { save: false } path is exercised without making real HTTP calls
  1. URL parsing — changed site.getBaseURL().replace(...) to new URL(site.getBaseURL()).hostname in rum-config-service.js to correctly handle ports, paths, and trailing slashes.

  2. clearTimeout in finally block — added to rum-config-service.js so the pending timer is always cancelled after the race settles.

  3. Changed try { await updateRumConfig(...) } catch {...} to updateRumConfig(site, context).catch((e) => { log.warn(...) }) — the 201 response returns immediately without waiting for the RUM check

Tests updated
rum-config-service.test.js: added Config.toDynamoItem mock, asserts site.setConfig is called on save: true paths; save: false test updated to verify no config mutation or setConfig occurs.
sites.test.js: added test verifying site creation returns 201 even when updateRumConfig throws.

JSDoc (rum-config-service.js):

Replaced the duplicated/incomplete JSDoc block with a single clear doc that explicitly shows the caller contract for { save: false } (the three required steps — siteConfig.updateRumConfig, site.setConfig(Config.toDynamoItem(...)), site.save()) and warns that omitting them silently discards the result

Sugandhgyl and others added 10 commits May 13, 2026 11:28
Adds updateRumConfig() service that checks the RUM API for a domain key
and stores the result in Site.config.rumConfig, giving the ASO UI a
deterministic signal for the "No RUM" dialog and banner.

Write paths covered:
- Site.create via REST (sites.js)
- add-site Slack command
- approve-site-candidate Slack action
- onboardSingleSite helper (reuses steps.rumVerified, no extra API call)

A 3-second Promise.race timeout prevents the RUM check from blocking
any of these flows. The { save: false } option avoids a double-save
in onboardSingleSite where site.save() is already called downstream.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
…l mocks

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sites

Adds scripts/backfill-rum-config.mjs — a one-time idempotent script that
checks RUM domain key availability for all sites that predate the
rum-config-service write path and persists hasDomainKey + lastCheckedAt
into site.config.rumConfig.

Supports --site-ids for targeted runs (e.g. stage validation) and
--dry-run to preview without writing. Rate-limited to 500 ms per site.

Co-Authored-By: Claude <noreply@anthropic.com>
- Move copyright header before eslint-disable comment (header/header rule)
- Fix copyright year from 2026 to 2025
- Add curly braces to if-block (curly rule)
- Simplify Promise to single-statement arrow (max-statements-per-line rule)

Co-Authored-By: Claude <noreply@anthropic.com>
Expand inline arrow to a block body so setTimeout's return value
is not implicitly returned from the Promise executor.

Co-Authored-By: Claude <noreply@anthropic.com>
- fix(sites): make RUM config update fire-and-forget so 201 is not blocked
- fix(plg): replace cached steps.rumVerified with live updateRumConfig call
- fix(utils): apply hasDomainKey return value from updateRumConfig to caller's config snapshot
- fix(slack): isolate RUM config update errors in approve-site-candidate and add-site handlers
- fix(package): use published npm version of spacecat-shared-data-access instead of Gist URL
- refactor(rum-config-service): consolidate JSDoc and document save:false caller contract
- test: add toDynamoItem stub, setConfig assertions, and fire-and-forget test to sites suite
- test: add updateRumConfig stub to all 12 PLG esmock instances for transitive isolation

Co-Authored-By: Claude <noreply@anthropic.com>
…d in main

Main changed UPHELD decision to transition to REJECTED (not INACTIVE), so the
two INACTIVE notification tests no longer match the production behaviour.
Keep main's new 'uses custom EXPERIENCE_URL' test unchanged.

Co-Authored-By: Claude <noreply@anthropic.com>
…e tests

Add tests for the catch blocks in approve-site-candidate.js (both new-site
and existing-site paths) and add-site.js where updateRumConfig rejects.
Brings Codecov patch coverage to 100% for those files.

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants