feat: persist RUM domain key availability on Site config#2367
feat: persist RUM domain key availability on Site config#2367Sugandhgyl wants to merge 11 commits into
Conversation
|
This PR will trigger a minor release when merged. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
sandsinh
left a comment
There was a problem hiding this comment.
Hey @Sugandhgyl,
Strengths
- Extracting
updateRumConfigas a shared helper inrum-config-service.jsis the right instinct - centralizing the RUM check logic avoids copy-paste across callers. - The
{ save: false }option gives callers control over save timing, which is useful when multiple mutations need to be batched before a singlesite.save(). - The backfill script (
scripts/backfill-rum-config.mjs) includes--dry-runand--site-idsflags, per-site rate limiting (500ms), and progress logging - all appropriate for a one-time operational script. - Test files use
esmockto stubupdateRumConfigat the call sites, keeping unit tests focused on the caller's behavior without re-testing the helper itself. plg-onboarding.jscorrectly follows thesetConfig(Config.toDynamoItem(...))pattern beforesite.save().
Issues
Critical (Must Fix)
1. @adobe/spacecat-shared-data-access is resolved from a personal GitHub Gist, not npm - package.json
The dependency points to a raw tarball URL on gist.github.com. The Gist tarball is approximately 1.35 MB; the real npm package of the same version is 172 KB. The Gist contains an embedded nested archive (adobe-spacecat-shared-data-access-2.96.0.tgz, ~1.17 MB) with file mtimes backdated to 1985. These are different packages with different integrity hashes. This is an unverified artifact from an untrusted source and must not reach main. Replace with the published npm version before merging.
2. deploy-dev npm script targets a personal label (-l sugandhg) - package.json
Same issue as the companion audit-worker PR. Any developer or CI job running npm run deploy-dev after this merges will deploy to a personal slot rather than the shared dev environment. Revert to -l latest or the appropriate shared label.
3. rum-config-service.js mutates config but does not call site.setConfig(Config.toDynamoItem(...)) before saving - src/support/rum-config-service.js
updateRumConfig calls site.getConfig().updateRumConfig(hasDomainKey) and then site.save() without first calling site.setConfig(Config.toDynamoItem(siteConfig)). The plg-onboarding.js path in this same PR calls site.setConfig(Config.toDynamoItem(siteConfig)) correctly before save. These two paths have different persistence behavior. In production, the service's save will either silently discard the rumConfig update or store data that does not match the expected DynamoDB shape, depending on how setConfig/save handles an in-place config mutation.
4. utils.js mutation ordering bug silently overwrites the rumConfig update - src/support/utils.js
utils.js calls updateRumConfig(site, context, { save: false }) which mutates site.getConfig() in place, then immediately calls site.setConfig(Config.toDynamoItem(siteConfig)) where siteConfig is a separate Config object captured earlier in the function. Since the two object references are different, the setConfig call with the earlier snapshot overwrites the rumConfig mutation. The update is lost silently, and tests will not catch this because they stub updateRumConfig at the import boundary.
5. RUM check failure propagates as a 5xx to the site creation caller - src/controllers/sites.js
sites.js calls await updateRumConfig(site, context) inside the try block after the site has already been written to the database. If the RUM check or its internal save throws, the site creation handler returns a 5xx even though the site was successfully created. The caller will likely retry, creating a duplicate site or a confusing "site already exists" error on retry. The RUM config update must not share an error boundary with the site creation response. Wrap it in its own try/catch or move it to a background job.
6. Slack action handlers propagate RUM check errors to unrelated user-facing responses - src/support/slack/actions/approve-site-candidate.js
approve-site-candidate.js calls await updateRumConfig(site, lambdaContext) without catching errors. If the RUM check fails, the Slack action handler throws, the approval response is not sent, and the approving user sees a failure with no actionable explanation. An RUM availability check is a non-critical enrichment step and must not block the approval workflow. Wrap in try/catch and log the error without re-throwing.
Important (Should Fix)
7. Four different write patterns exist for the same operation within this single PR
rum-config-service.js: mutatessite.getConfig()directly, callssite.save()withoutsetConfig(buggy)plg-onboarding.js: callssite.getConfig(), mutates, callssite.setConfig(Config.toDynamoItem(...)), callssite.save()(correct)utils.js: callsupdateRumConfigwithsave:false, then callssite.setConfig(Config.toDynamoItem(siteConfig))with the wrong object reference (buggy)approve-site-candidate.jsandadd-site.js: delegate toupdateRumConfigwhich has the missingsetConfigbug
Consolidate to a single pattern. The plg-onboarding.js pattern is correct. Fix rum-config-service.js to call site.setConfig(Config.toDynamoItem(siteConfig)) after the mutation and before save, then all other paths that delegate to the service will inherit the fix.
8. PLG path derives hasDomainKey from steps.rumVerified rather than a live check - src/controllers/plg/plg-onboarding.js
plg-onboarding.js sets hasDomainKey = steps.rumVerified === true instead of calling updateRumConfig or performing a live check. This means the PLG path persists whatever the onboarding wizard recorded at the time the form was submitted, which may be stale by the time the site is created. The source of truth for hasDomainKey is now ambiguous across write paths.
9. Naive URL parsing will produce incorrect domain values - src/support/rum-config-service.js
site.getBaseURL().replace(/^https?:\/\//, '') passes path segments, ports, query strings, and trailing slashes to retrieveDomainkey. Use new URL(site.getBaseURL()).hostname to extract only the hostname.
10. setTimeout inside Promise.race is never cleared on success - src/support/rum-config-service.js
Identical to the issue in the companion audit-worker PR. When retrieveDomainkey resolves before the timeout, the pending timer is not cancelled. In Lambda this holds the event loop open past the handler's return. Use clearTimeout in a finally block.
11. Synchronous 3-second RUM check blocks the site creation response - src/controllers/sites.js
A blocking await updateRumConfig(site, context) with a 3-second timeout is now in the critical path of the site creation endpoint. Site creation latency will spike by up to 3 seconds on every call. Move this to a fire-and-forget call with its own error handling, or enqueue a background job.
12. Backfill script reads credentials from environment without upfront validation - scripts/backfill-rum-config.mjs
POSTGREST_URL and RUM_ADMIN_KEY are read from process.env without checking for presence. If either is missing, the script will fail mid-run with an unhelpful error after it has already processed some records, leaving the dataset partially updated. Validate both variables at startup with a clear error message before entering the site loop.
Minor (Nice to Have)
- The backfill script has no documented cleanup plan. One-time scripts committed without a removal ticket tend to become permanent fixtures in the repository. Add a note in the script or the PR description about when and how it should be removed.
approve-site-candidate.jscallsupdateRumConfigin both branches of an if/else that appear structurally symmetric in this regard. If both branches always call it, move the call to after the if/else to reduce duplication.updateRumConfigreturnshasDomainKeybut all callers in this PR ignore the return value. Either remove the return value or use it at the call sites - an ignored return value is a signal that the interface is not well-matched to its callers.- The
save: falseoption is easy to misuse: a caller that passessave: falseand then forgets to callsite.setConfig(Config.toDynamoItem(...))before saving will silently discard the mutation. A JSDoc comment on the option explaining the caller's responsibility would reduce this risk.
Recommendations
- Fix
rum-config-service.jsto callsite.setConfig(Config.toDynamoItem(siteConfig))beforesite.save(). This single fix, combined with removing theutils.jsmutation ordering bug, would make the service the canonical correct pattern and all delegating callers would inherit the fix. - Move the RUM check out of the synchronous request path in
sites.js. An enrichment step that can timeout for 3 seconds and throw should not affect the latency or success rate of a core write endpoint. - Add a cleanup ticket for
scripts/backfill-rum-config.mjsbefore merging.
Assessment
Ready to merge? No
The persistence bug in rum-config-service.js, the mutation ordering bug in utils.js, and the error propagation in sites.js are all blockers that will cause silent data loss or incorrect API behavior in production. The Gist URL dependency must also be resolved. The shared library PR (spacecat-shared) should be published to npm and referenced here before these downstream PRs merge.
Hey @sandsinh, thanks for the detailed review! All Critical and Important findings have been addressed. Here's a summary: Critical fixes
Important fixes
Tests updated JSDoc (rum-config-service.js): Replaced the duplicated/incomplete JSDoc block with a single clear doc that explicitly shows the caller contract for { save: false } (the three required steps — siteConfig.updateRumConfig, site.setConfig(Config.toDynamoItem(...)), site.save()) and warns that omitting them silently discards the result |
Adds updateRumConfig() service that checks the RUM API for a domain key
and stores the result in Site.config.rumConfig, giving the ASO UI a
deterministic signal for the "No RUM" dialog and banner.
Write paths covered:
- Site.create via REST (sites.js)
- add-site Slack command
- approve-site-candidate Slack action
- onboardSingleSite helper (reuses steps.rumVerified, no extra API call)
A 3-second Promise.race timeout prevents the RUM check from blocking
any of these flows. The { save: false } option avoids a double-save
in onboardSingleSite where site.save() is already called downstream.
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
…l mocks Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sites Adds scripts/backfill-rum-config.mjs — a one-time idempotent script that checks RUM domain key availability for all sites that predate the rum-config-service write path and persists hasDomainKey + lastCheckedAt into site.config.rumConfig. Supports --site-ids for targeted runs (e.g. stage validation) and --dry-run to preview without writing. Rate-limited to 500 ms per site. Co-Authored-By: Claude <noreply@anthropic.com>
- Move copyright header before eslint-disable comment (header/header rule) - Fix copyright year from 2026 to 2025 - Add curly braces to if-block (curly rule) - Simplify Promise to single-statement arrow (max-statements-per-line rule) Co-Authored-By: Claude <noreply@anthropic.com>
Expand inline arrow to a block body so setTimeout's return value is not implicitly returned from the Promise executor. Co-Authored-By: Claude <noreply@anthropic.com>
- fix(sites): make RUM config update fire-and-forget so 201 is not blocked - fix(plg): replace cached steps.rumVerified with live updateRumConfig call - fix(utils): apply hasDomainKey return value from updateRumConfig to caller's config snapshot - fix(slack): isolate RUM config update errors in approve-site-candidate and add-site handlers - fix(package): use published npm version of spacecat-shared-data-access instead of Gist URL - refactor(rum-config-service): consolidate JSDoc and document save:false caller contract - test: add toDynamoItem stub, setConfig assertions, and fire-and-forget test to sites suite - test: add updateRumConfig stub to all 12 PLG esmock instances for transitive isolation Co-Authored-By: Claude <noreply@anthropic.com>
…d in main Main changed UPHELD decision to transition to REJECTED (not INACTIVE), so the two INACTIVE notification tests no longer match the production behaviour. Keep main's new 'uses custom EXPERIENCE_URL' test unchanged. Co-Authored-By: Claude <noreply@anthropic.com>
388c391 to
026f066
Compare
…e tests Add tests for the catch blocks in approve-site-candidate.js (both new-site and existing-site paths) and add-site.js where updateRumConfig rejects. Brings Codecov patch coverage to 100% for those files. Co-Authored-By: Claude <noreply@anthropic.com>
WHAT CHANGED
Please ensure your pull request adheres to the following guidelines:
describe here the problem you're solving.
If the PR is changing the API specification:
yet. Ideally, return a 501 status code with a message explaining the feature is not implemented yet.
If the PR is changing the API implementation or an entity exposed through the API:
If the PR is introducing a new audit type:
Related Issues
Thanks for contributing!