Skip to content

fix: site-level handler overrides were ignored, org-level disable always won#1590

Open
tkotthakota-adobe wants to merge 4 commits into
mainfrom
SITES-41916
Open

fix: site-level handler overrides were ignored, org-level disable always won#1590
tkotthakota-adobe wants to merge 4 commits into
mainfrom
SITES-41916

Conversation

@tkotthakota-adobe
Copy link
Copy Markdown
Contributor

@tkotthakota-adobe tkotthakota-adobe commented May 6, 2026

https://jira.corp.adobe.com/browse/SITES-41916

Problem:
Site-level handler overrides were ignored — org-level disable always won, so upgrading a site from
free-trial to paid couldn't re-enable audits. Also, enabling a handler could crash with a TypeError if
the disabled.sites array didn't exist.

  • isHandlerEnabledForSite now checks site-level before org-level — site enable overrides org disable

  • #updatedHandler cleans up both enabled/disabled lists to keep them consistent

  • Safe array access with || [] fallback to prevent TypeError on missing keys

    Rollback

    Path A — small blast radius (data fix, no code revert)

    Trigger: 1–2 spurious site re-enables visible in the [isHandlerEnabledForSite] site-override INFO log, or a small number of reported incorrect audit
    invocations.

    1. Identify the affected (siteId, orgId, handlerType) tuple(s) from Coralogix.
    2. Either:
    • Restore the prior S3 Configuration object version (versioning is enabled) for the affected site's config, or
    • Write a corrective config that removes siteId from enabled.sites for the affected handler.
    1. No code revert required. Resolution time: minutes.

    Path B — large blast radius (code revert)

    Trigger: Precedence test regressions, broad unexpected behavior change in audit fan-out across multiple sites/orgs, or [isHandlerEnabledForSite] INFO log
    firing at unexpected volume.

    1. Revert this PR on spacecat-shared.
    2. Publish a patch release of @adobe/spacecat-shared-data-access.
    3. Bump the dependency and deploy in each consumer (separate deploy per repo):
    • spacecat-api-service
    • spacecat-audit-worker
    • spacecat-autofix-worker
    • spacecat-jobs-dispatcher
    • spacecat-reporting-worker
    1. Recovery time scales with the slowest consumer pipeline.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

This PR will trigger no release when merged.

if (disabledSites.includes(siteId)) {
return false;
}
if (enabledSites.includes(siteId)) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment by @solaris007
Important: This flips existing config rows where a site is in enabled.sites and the org is in disabled.orgs from disabled to enabled at deploy time. Run a query against prod to enumerate affected rows before merging.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have prod access @solaris007 What is the best way to do this query?

Copy link
Copy Markdown
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tkotthakota-adobe,

Strengths

  • The new precedence ladder in isHandlerEnabledForSite (configuration.model.js:186-207) reads cleanly top-to-bottom: site-disable > site-enable > org-disable > default > org-enable. Much easier to reason about than the old nested-if approach, and the local destructuring makes each check self-documenting.
  • #updatedHandler now keeps enabled and disabled lists mutually exclusive on every transition (configuration.model.js:249-265), closing a stale-state hazard where the same entity could end up in both lists.
  • The (handler.disabled[entityKey] || []).filter(...) pattern (configuration.model.js:250-251) fixes a real correctness issue - the old .filter(...) || [] was dead code since .filter() always returns an array, but the parent could still be undefined.
  • Previously requested test for headline behavior change (Thread 2) has been addressed - configuration.model.test.js:248-256 directly asserts site-override-org-disable.
  • No new dependencies, CI passing, well-scoped diff.

Issues

Important (Should Fix)

1. Write path does not deliver the site-override for enabledByDefault: true handlers (configuration.model.js:253)

#updatedHandler only adds the entity to enabled[entityKey] when !handler.enabledByDefault. For enabledByDefault: true handlers, calling enableHandlerForSite when the org is in disabled.orgs results in: site not added to enabled.sites -> isHandlerEnabledForSite falls through to disabledOrgs check -> returns false.

The read path claims "site enable overrides org disable" universally, but the write path only delivers it for non-default handlers. The test at line 248 passes because it uses addHandler to pre-populate enabled.sites directly, bypassing the public API.

This may not affect the immediate use case (paid-tier audits are typically enabledByDefault: false), but it's the same shape of bug this PR exists to fix - a latent inconsistency between what the config reads and what the API writes.

Fix: Always add to enabled[entityKey] when enabled === true, regardless of enabledByDefault. Remove the !handler.enabledByDefault guard at line 253.

2. Pre-merge gate: enumerate prod config rows affected by the behavior flip

Building on the open thread at line 197 - @solaris007 correctly flagged that any handler where (site in enabled.sites AND org in disabled.orgs) will flip from "audit off" to "audit on" at deploy time. The author replied "I do not have prod access."

Path forward: someone with prod read access fetches the S3 singleton config and enumerates the affected (handler, siteId, orgId) triples. This is a read-only operation on a single JSON blob - should take under 30 minutes. Based on the count:

  • Zero: merge with confidence.
  • Small (<10): list them in the PR description so on-call knows what to expect.
  • Large: stage the rollout or pre-clean stale enabled.sites entries.

This is not a code fix but a deploy-readiness gate. The PR should not merge until this is answered.

3. New cleanup branches in #updatedHandler lack direct tests (configuration.model.test.js)

The PR adds write-side logic that removes entities from the opposite list on every transition (lines 250-251, 261-262 of the model). The existing tests start with handlers where the entity is NOT in the opposite list, so the cleanup is a no-op and never exercised. No test fails if those .filter(...) lines are removed.

Fix: Add round-trip tests, plus a enabledByDefault: false override test (the current test at line 248 uses enabledByDefault: true only). See inline comment for examples.

Minor (Nice to Have)

4. isHandlerEnabledForSite and isHandlerEnabledForOrg now use different precedence models, undocumented

After this PR, isHandlerEnabledForSite applies site-then-org precedence while isHandlerEnabledForOrg still uses "any disabled wins." A config where org O is in disabled.orgs and site S (in O) is in enabled.sites will report isHandlerEnabledForOrg(O) = false while isHandlerEnabledForSite(S) = true. This is likely intentional but undocumented.

Fix: Add a brief JSDoc on both methods noting the divergence.

5. "Site in both enabled.sites and disabled.sites" conflict resolution is implicit

The new code makes disabled.sites win (checked first at line 194). #updatedHandler keeps these disjoint on writes, but legacy data or manual edits could violate this. A one-line comment ("assumes enabled.sites and disabled.sites are disjoint; see #updatedHandler") or a test pinning the conflict resolution would make the invariant explicit.

Recommendations

  • Add a DEBUG-level log line when the site-override fires (enabledSites.includes(siteId) && disabledOrgs.includes(orgId)) to give on-call a grep target for post-deploy validation.
  • Update the PR description to include a "Deployment impact" section noting the behavior change for existing config rows.
  • Consider adding a JSDoc block on isHandlerEnabledForSite enumerating the full precedence rules.

Assessment

Ready to merge? No, with fixes.

Reasoning: The read-path fix is clean and well-tested for the headline case, but the write path leaves the enabledByDefault: true override unreachable via the public API (finding 1), the new cleanup branches lack direct test coverage (finding 3), and the prod-state enumeration from Thread 1 remains unanswered (finding 2). Fixing findings 1 and 3 are small code changes; finding 2 is a read-only prod query that someone with access can run.

Next Steps

  1. Fix the write path in #updatedHandler to always add to enabled[entityKey] when enabling (finding 1).
  2. Add the missing tests: enabledByDefault: false override, round-trip cleanup assertions (finding 3).
  3. Resolve Thread 1 by running the prod config enumeration before merging (finding 2).
  4. Minor items (findings 4-5) are optional but improve future readability.

@tkotthakota-adobe
Copy link
Copy Markdown
Contributor Author

Hi @solaris007 thank you for the review. I addressed review comments.

Copy link
Copy Markdown
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tkotthakota-adobe,

Thanks for the thorough iteration. All five prior findings + two of three recommendations cleanly addressed. Two Important test gaps remain and the prod-enumeration thread (Important issue 2 from the prior review) needs a reviewer-side action - that one is on me, not you.

Strengths

Previously flagged issues now addressed:

  • Important issue 1 (write path doesn't deliver site-override for enabledByDefault: true handlers): cleanly fixed at configuration.model.js:280-286. The if (!handler.enabledByDefault) guard is gone; enableHandlerForSite now reliably populates enabled.sites for both default and non-default handlers, making the override actually reachable via the public API. The new comment ("Always add to enabled so isHandlerEnabledForSite can find the site in enabled.sites and return true even when the org is in disabled.orgs (site-level override)") accurately describes the behavior.
  • Important issue 3 (cleanup branches lack direct tests): addressed via three new tests. Test 2 ("removes site from disabled.sites when re-enabling for a non-default handler") genuinely exercises the new write path - handler starts with disabled.sites: [siteId], after updateHandlerSites(..., true) both the cleanup branch AND the unconditional-add fix from Important issue 1 are exercised. Test 3 covers the symmetric case for enabledByDefault: true going through the else if branch.
  • Minor issue 4 (precedence divergence undocumented): JSDoc on both methods is accurate. I traced isHandlerEnabledForSite line-by-line against the 6-step ladder and it matches. JSDoc explicitly calls out the divergence between the two methods.
  • Minor issue 5 (implicit conflict resolution between enabled.sites/disabled.sites): JSDoc on isHandlerEnabledForSite documents both the disjointness assumption AND the invariant maintained by #updatedHandler. The "lists disjoint" assumption is now actively enforced by the writer rather than just assumed.
  • Recommendation 1 (DEBUG log on site-override fire): implemented at configuration.model.js:213-215. The log fires only when the override is actually triggered (site in enabled.sites AND org in disabled.orgs). Includes type, siteId, orgId for correlation.
  • Recommendation 3 (JSDoc precedence enumeration): addressed via the Minor issue 4 fix.

This iteration:

  • Write-path simplification reads cleanly. #updatedHandler no longer branches on enabledByDefault for the enable case; always-add-to-enabled / always-remove-from-disabled is easier to reason about and matches the new read path.
  • Asymmetric precedence model (site-overrides-org for isHandlerEnabledForSite; org-disable-always-wins for isHandlerEnabledForOrg) is now intentional and self-documented. The JSDoc contrasts the two methods explicitly. The asymmetry maps cleanly to how each method is used at its respective fan-out boundary (per-site vs per-org consumer in jobs-dispatcher reports).
  • The 6-step precedence table is now the de-facto contract. Future readers can cite it instead of reverse-engineering branching code.
  • Authorization model integrity is sound. (enabled.sites, disabled.orgs) configuration is written only by trusted admin paths in spacecat-api-service (admin x-api-key controllers, internal onboarding flows, Slack admin commands). End users cannot self-promote a site into enabled.sites. The "site overrides org" semantics are an explicit, policy-driven business rule (paid upgrade), not a privilege-escalation surface.
  • DEBUG log payload { type, siteId, orgId } contains only non-secret identifiers. No PII, tokens, or credentials.
  • Diff is tightly scoped to addressing prior findings; no drive-by changes.

Issues

Important (Should Fix)

  1. Headline production scenario has no end-to-end test. The PR exists to fix the free-trial -> paid upgrade path, but no test exercises the full flow: updateHandlerSites('paid-handler', siteId, true) on a state where the org is in disabled.orgs, then assert isHandlerEnabledForSite('paid-handler', site) is true. The new "paid-handler" test is read-path-only (uses addHandler to pre-populate, bypassing the write path); Test 2 exercises the write path but without org in disabled.orgs. Add one combining test - this is what tells future maintainers what the PR is for. See inline comment for the test shape.

  2. enabledByDefault: false disable branch is not directly tested. #updatedHandler now has three write paths: enable (any handler), disable with enabledByDefault: true, and disable with enabledByDefault: false (the else branch at configuration.model.js:292-294). The first two are covered; the third is not. With this PR's "always add to enabled on enable" change, the asymmetry of the third branch is load-bearing: a paid handler that gets enabled (now in enabled.sites) and then disabled must NOT linger as enabled. See inline comment for the test shape.

Recommendations

  • Prod-state enumeration (prior Important issue 2) - shifting to my side. Author correctly stated they lack prod access; this is a deployment-readiness gate, not a code fix. I will run the query against the prod Configuration singleton: pull the current S3 config, compute intersect(enabled.sites, sitesOf(disabled.orgs)) per handler, and report the count of (handler, siteId, orgId) triples that flip from disabled to enabled at deploy. If zero or small + expected, merge proceeds; if surprising, the team can decide whether to merge-as-is, pre-clean, or notify affected customers. Tracking this on the PR before approval.
  • DEBUG vs INFO for the override log. The author followed the prior review recommendation (DEBUG). Worth re-evaluating: DEBUG is typically filtered out in prod observability, so the post-deploy validation purpose may not work without changing log filters. For the first 30 days post-deploy, INFO would make the canary signal greppable by on-call without filter changes; afterwards demote to DEBUG. Reasonable counter-argument: keep DEBUG and document in the rollout runbook. Author's call - this reverses my prior recommendation, which I now think was suboptimal for the canary purpose.
  • Add "Deployment impact" section to PR description (recommendation 2 from prior review, still not addressed). One paragraph noting: configs where site-in-enabled-sites + org-in-disabled-orgs flip from disabled to enabled at deploy time, the prod-enumeration result once it is run, and the rollback path. The merge commit body should be accurate.
  • Document the rollback recovery path. If a customer reports unexpected audits-on after deploy: Path A (small blast radius) - clean up the prod config singleton (single S3 write, recovers immediately). Path B (large blast radius or unclear scope) - revert this PR in spacecat-shared, publish patch release, bump consumers (slow, multi-repo). Worth noting in the PR description or a follow-up runbook.
  • Future: precedence test matrix derived from JSDoc. The 6-row precedence ladder is now the de-facto contract. A small data-driven test where each row of the table is one assertion would lock the spec and the implementation together and prevent silent drift. Cheap to add, exactly the shape of test that does not regress when refactoring the body. Out of scope for this PR; worth a follow-up issue.

Minor

  1. Disjointness invariant lacks a pinning test. JSDoc says "disabled.sites and enabled.sites are assumed disjoint; disabled.sites wins if violated." A one-line test seeding both lists with the same siteId and asserting false would prevent silent precedence drift on legacy data.

  2. Idempotency of enable not pinned. Array.from(new Set([...])) deduplication is the only thing keeping enabled.sites from growing unbounded if updateHandlerSites(..., true) is called repeatedly. A single test asserting enabled.sites.length === 1 after two consecutive enables would lock this behavior.

  3. updateHandlerOrgs cleanup not directly tested. Same #updatedHandler code path as sites, but the read-side semantics for orgs are now formally asymmetric (no site-level override). One test verifying updateHandlerOrgs(type, orgId, true) followed by isHandlerEnabledForOrg(...) returns true after a prior disable would mirror the site-side assertion.

Assessment

Ready to merge? With fixes.

Reasoning: code-level work is clean. The write-path simplification reads well, the JSDoc captures the contract, and the cleanup branch tests address the prior review properly. Two Important test gaps remain - the headline end-to-end scenario and the enabledByDefault: false disable branch - both small additions. The prod-enumeration deployment-readiness gate (prior Important issue 2) is now reviewer-side; I will run the query and report results on the PR before approval.

Next Steps

  1. Add the two missing tests (Important issues 1 and 2). Both are small additions to the existing test file.
  2. Reviewer (me) runs the prod-state enumeration before approving. I will paste results into the PR.
  3. Decide on the DEBUG vs INFO log level question.
  4. Optional: update PR description with deployment impact and rollback path; add the Minor tests if you have appetite for completeness.

enabledByDefault: false,
enabled: { sites: [site.getId()], orgs: [] },
disabled: { sites: [], orgs: [site.getOrganizationId()] },
});
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important - the headline production scenario lacks an end-to-end test.

The PR exists to fix the free-trial -> paid upgrade path: org has handler X with enabledByDefault: false and is in disabled.orgs; we upgrade site S via updateHandlerSites('X', siteId, true); we expect isHandlerEnabledForSite('X', S) to return true afterwards. The new "paid-handler" test above is a READ-PATH-only assertion - it pre-populates state directly via addHandler and never calls updateHandlerSites. Test 2 below ("removes site from disabled.sites when re-enabling for a non-default handler") exercises the write path but with disabled.orgs: [], so the override-against-disabled-org case is not asserted post-write.

Without a test that combines write + read, a future regression that breaks updateHandlerSites for non-default handlers (e.g., someone re-introducing the if (!enabledByDefault) guard the prior review asked you to remove) would not be caught: the read-path "paid-handler" test would still pass because it bypasses the write path. Add one test:

it('upgrades non-default handler from disabled org via updateHandlerSites', () => {
  instance.addHandler('paid-handler', {
    enabledByDefault: false,
    enabled: { sites: [], orgs: [] },
    disabled: { sites: [], orgs: [site.getOrganizationId()] },
  });
  instance.updateHandlerSites('paid-handler', site.getId(), true);
  expect(instance.isHandlerEnabledForSite('paid-handler', site)).to.be.true;
});

This is the test that tells future maintainers what the PR is for.


instance.updateHandlerSites('default-cleanup-handler', site.getId(), false);

expect(instance.getHandler('default-cleanup-handler').enabled.sites).to.not.include(site.getId());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important - enabledByDefault: false disable branch is not directly tested.

#updatedHandler has three write paths after this PR's simplification:

  1. Enable: remove from disabled + add to enabled (any handler) - covered by Test 2 ("cleanup-handler")
  2. Disable + enabledByDefault: true: add to disabled + remove from enabled - covered by Test 3 ("default-cleanup-handler")
  3. Disable + enabledByDefault: false (the else branch at configuration.model.js:292-294): only remove from enabled, do NOT add to disabled - NOT covered.

With the new "always add to enabled on enable" change in this PR, the asymmetry of branch 3 is now load-bearing: a paid handler that gets enabled (now in enabled.sites) and then disabled must NOT linger in enabled.sites. Add at least:

it('removes site from enabled.sites when disabling for a non-default handler', () => {
  instance.addHandler('paid-disable-handler', {
    enabledByDefault: false,
    enabled: { sites: [site.getId()], orgs: [] },
    disabled: { sites: [], orgs: [] },
  });
  instance.updateHandlerSites('paid-disable-handler', site.getId(), false);
  expect(instance.getHandler('paid-disable-handler').enabled.sites).to.not.include(site.getId());
  expect(instance.getHandler('paid-disable-handler').disabled.sites).to.not.include(site.getId());
});

This pins the contract that disabling a non-default handler is "not in either list" (default-off applies), distinct from disabling a default handler ("in disabled list, default-on overridden").

@tkotthakota-adobe
Copy link
Copy Markdown
Contributor Author

Hi @solaris007 Let me what you find from production data and if anything else needed from me.

Copy link
Copy Markdown
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tkotthakota-adobe,

Returning to this after the 2026-05-08 review with no new commits to assess. The Round 1 fixes still hold cleanly, the two Round 2 test gaps remain, and tracing the consumer call sites surfaces a system-level consequence of the asymmetric precedence model that should be documented before merge.

Strengths

The prior round's findings stay addressed in commit 7462cfe3:

  • Read-path rewrite at configuration.model.js:192-226 reads top-to-bottom with destructured array bindings; the 6-step precedence ladder in JSDoc (lines 177-191) matches the code 1:1 and is a usable on-call reference.
  • Write-path simplification at configuration.model.js:279-295: enable is unconditional remove-from-disabled + add-to-enabled (the Round 1 Important 1 fix), disable is symmetric. Inline comments accurately describe intent.
  • JSDoc on isHandlerEnabledForOrg (lines 228-239) explicitly contrasts itself with the site variant - the asymmetry is now named, not implicit.
  • Cleanup branches have direct tests: Test 4 (line 268) exercises re-enable from disabled state, Test 5 (line 281) exercises disable cleanup of enabled.sites for default handlers.
  • Disjointness invariant documented in JSDoc (line 188) and actively enforced by #updatedHandler.
  • DEBUG log on override fire (line 214) is well placed and the payload { type, siteId, orgId } is non-sensitive.
  • Authorization model verified independently: the public write paths (updateHandlerSites, enableHandlerForSite) are reached only through admin-gated controllers (sites-audits-toggle.js short-circuits on accessControlUtil.hasAdminAccess()), internal onboarding flows, and the Slack admin command. End users cannot self-promote a site into enabled.sites.
  • Diff is tightly scoped to the bug fix and prior review feedback.

Issues

Important (Should Fix)

  1. Headline production scenario has no end-to-end test (test/unit/models/configuration/configuration.model.test.js).

    Re-raising from Round 2. The PR exists for the free-trial -> paid upgrade path, but no test composes write-path-then-read-path with the full org-disabled context. Test 3 (line 258) pre-populates enabled.sites via addHandler (read path only). Test 4 (line 268) exercises the write path but without org in disabled.orgs. Neither tests the exact scenario the PR is for. A future refactor of #updatedHandler could silently regress the headline behavior with all tests still green.

    Fix: one chained test, approximately 10 lines:

    it('enables a paid handler for a site whose org is in disabled.orgs (free-trial -> paid)', () => {
      instance.addHandler('paid-handler', {
        enabledByDefault: false,
        enabled: { sites: [], orgs: [] },
        disabled: { sites: [], orgs: [site.getOrganizationId()] },
      });
      expect(instance.isHandlerEnabledForSite('paid-handler', site)).to.be.false;
      instance.updateHandlerSites('paid-handler', site.getId(), true);
      expect(instance.isHandlerEnabledForSite('paid-handler', site)).to.be.true;
      expect(instance.getHandler('paid-handler').enabled.sites).to.include(site.getId());
    });
  2. enabledByDefault: false disable branch is reached but not behaviorally asserted (configuration.model.js:292-294).

    Re-raising from Round 2. The existing test at line 349-353 does exercise this branch through disableHandlerForSite('organic-keywords', site), but its assertion is expect(...disabled.sites).to.not.include(siteId) - a tautology since organic-keywords.disabled.sites was never populated by the prior enable. The branch's actual job (removing siteId from enabled.sites so the round-trip leaves no stale override) is not asserted.

    With this PR's "always add to enabled on enable" change, a paid handler can land in enabled.sites for sites whose org is in disabled.orgs. If the disable path's enabled.sites filter is broken or removed in a future refactor, a disabled handler would still read true via step 2 of the precedence ladder. Line-coverage tools would not catch it.

    Fix: add expect(...enabled.sites).to.not.include(siteId) to the line 349-353 test, or add a dedicated round-trip test on a paid handler. 3-5 lines.

  3. Cross-system consequence of the asymmetric model is not documented (configuration.model.js:228-239 JSDoc on isHandlerEnabledForOrg, PR description).

    Tracing the consumer call sites: isHandlerEnabledForOrg is used as a gate in spacecat-jobs-dispatcher/src/handlers/reports.js (filters orgs before sending org-targeted report payloads) and spacecat-reporting-worker/src/digest/handler-external.js (when a digest message arrives keyed on orgId, returns noContent() if the org is disabled AND enumerates all sites in the org without re-checking per site).

    Implication of the new asymmetric semantics: a paid site in enabled.sites whose org is in disabled.orgs will start receiving site-targeted reports (correct, this is what the PR enables) but will NOT receive org-level digests routed through orgId - the digest handler short-circuits at the org gate. The PR's stated invariant "site-level handler overrides were ignored, org-level disable always won" still holds in the digest fan-out path.

    This is an in-repo doc fix, not a cross-repo code fix. Two ways forward, your call:

    • (a) Declare it intentional: org-level reports follow org policy, site-level reports follow site policy. Document in the JSDoc on isHandlerEnabledForOrg ("intended for org-scoped report fan-out; site-level overrides do not propagate here") and in the PR description.
    • (b) Acknowledge the gap and file a Jira for the digest handler to re-check per-site. Link it from the PR description as a known limitation.

    Either is defensible. The point is the asymmetric model is documented at the method level but the cross-system consequence is invisible today.

  4. DEBUG log level defeats the canary purpose (configuration.model.js:214).

    The log fires only when the override path is taken - exactly the post-deploy validation signal. In SpaceCat Lambdas shipped to Coralogix, DEBUG is typically filtered before shipment. For the first 30 days post-deploy, on-call needs a greppable canary signal without per-subsystem filter changes. If the prod-state enumeration count is non-zero (Round 2 reviewer-side gate), the override fires on production traffic and the responder needs to confirm the behavior matches expectations.

    Fix: change this.log.debug(...) to this.log.info(...) for the first 30 days, then demote in a follow-up commit. Track the demotion ticket so it does not become permanent INFO noise. Author's call: if you prefer keeping DEBUG, write the grep target into the rollout runbook explicitly.

  5. Rollback path is undocumented in the PR description.

    The Configuration is a versioned S3 singleton, so two rollback levers exist with very different cost profiles:

    • Path A (preferred for small blast radius): restore the prior S3 object version, or write a corrective config that removes the now-conflicting (enabled.sites, disabled.orgs) rows. Recovers immediately. No redeploy.
    • Path B (large blast radius or root-cause unclear): revert this PR, publish a patch release, bump consumers in spacecat-api-service, spacecat-audit-worker, spacecat-autofix-worker, spacecat-jobs-dispatcher, spacecat-reporting-worker. Multi-repo, multi-hour.

    Fix: add a "Rollback" section to the PR description naming the S3 versioned-object lever first, then the revert-and-bump fallback.

Minor (Nice to Have)

  1. Disjointness invariant lacks a pinning test (test file). JSDoc says "disabled.sites and enabled.sites are assumed disjoint; disabled.sites wins if violated." A one-line test seeding both lists with the same siteId and asserting false would pin the conflict resolution against silent precedence drift on legacy data.

  2. Idempotency of enable not pinned (test file). Array.from(new Set([...])) deduplication is the only thing keeping enabled.sites from growing unbounded if updateHandlerSites(..., true) is called repeatedly (relevant for at-least-once admin endpoints). A single test asserting enabled.sites.length === 1 after two consecutive enables would lock this behavior.

  3. New write-shape side effect on default-enabled handlers is invisible to tests (test file). The existing tests for enabledByDefault: true handlers do not assert that enabled[entityKey] now grows on enable (it stayed empty pre-PR). Downstream consumers of getEnabledSiteIdsForHandler and direct readers of handler.enabled.* may see unexpected new entries. Tighten the existing two tests to also assert the entity now appears in enabled[entityKey].

  4. Hot-path log-volume risk (configuration.model.js:213-215). isHandlerEnabledForSite is called per-site in the jobs-dispatcher fan-out (audits, reports handlers). If the prod-state enumeration shows N affected (handler, siteId, orgId) triples, the log fires N times per dispatch tick. For a 5-min cron with thousands of sites, this multiplies. If the enumeration is non-trivial, sample the log (e.g. once per minute per handler-org pair) or move to a counter. Deferred until prod enumeration result is known.

  5. Downstream consumer set not enumerated in PR description. The fix lives in spacecat-shared but the runtime behavior change touches spacecat-api-service, spacecat-audit-worker, spacecat-autofix-worker, spacecat-jobs-dispatcher, and spacecat-reporting-worker (all read the same S3 singleton at runtime). Listing the consumers makes the deploy posture explicit for reviewers and on-call.

Recommendations

  • Codify the precedence ladder as a parametric test: the 6-row JSDoc table is now the de-facto contract for isHandlerEnabledForSite. A small data-driven test where each row is one assertion would lock spec and implementation together; this is the kind of test that does not regress when the body is refactored. Out of scope for this PR; worth a follow-up issue.
  • Move the prod-state enumeration into a PR template question, not a reviewer-side comment thread: Round 1 / Round 2 raised it as Important; the author correctly noted they lack prod access; the reviewer agreed to run it. That pattern places deploy-readiness knowledge in a reviewer's head. For behavior-flip PRs of this shape, the template should require a "Deployment impact" section with the enumeration answered before review. Workspace-level follow-up.
  • Forward-looking: align with the Entitlement model: as more handlers move to paid tiers, the Configuration.handler.enabled/disabled partition becomes the de-facto product entitlement system. The codebase already has an Entitlement model. Two sources of truth for "is this site entitled to this handler" is a brittle pattern. Worth a design conversation: should handler enable/disable per (site, org) be derived from Entitlements, or vice versa? Out of scope for this PR.
  • Post-deploy validation hooks: once Important 4 is resolved (DEBUG -> INFO), schedule explicit post-deploy checks at T+1h, T+24h, T+7d: count of override log fires, any unexpected handler types, consumer-side error spikes ([isHandlerEnabledForSite] error logs). Pairs with the rollback path documentation.
  • Add a write-time audit log at the admin entry points (sites-audits-toggle controller, Slack command, onboarding flows): the current DEBUG read-path log captures override evaluation, but the security-relevant event - WHO promoted the site WHEN - is not logged at the boundary. INFO-level audit at the controller layer with admin principal, target siteId/orgId, handler type, and before/after override state. Out of scope for this PR.

Out of scope, worth tracking

  • Reviewer-side prod-state enumeration (Round 1 Important 2, owned by solaris007): legitimate deploy-readiness gate, not a code change. Should be resolved before merge.
  • Cross-system fan-out gap if treated as a follow-up (Important 3 option b): Jira-tracked, not just commented.
  • getEnabledSiteIdsForHandler(type) will now return different (non-empty) data for default-enabled handlers after any explicit re-enable. If any consumer treats "site in enabled.sites" as "explicitly opted in" vs "default", that semantic distinction is now lost. Worth a grep follow-up in dependent repos.
  • The S3 singleton write protection is unchanged by this PR; the admin-only argument depends on IAM/bucket policy enforcement, which is an infra review item separate from this code.
  • Migration to PlatformConfiguration (mysticat-data-service) is in flight; the new API will need to inherit these precedence semantics.
  • Periodic reconciliation job that flags entries in enabled.sites whose org is in disabled.orgs against an entitlement source of truth. Hygiene control.
  • Precedence-table parametric test as a follow-up issue.
  • events2metrics rule on the override log line for dashboarding.

Assessment

Ready to merge? No, with fixes.

Reasoning: code-level work is clean and the architectural moves hold up. Three items keep this from a clean approval - two unresolved Round 2 test gaps (Importants 1 and 2, both small additions to an existing file), one new system-level consequence that should be named in JSDoc or the PR description (Important 3), one operational signal that should be addressable in this PR (Important 4 - one-line log level change), and the rollback path that should be in the description before merge (Important 5). The reviewer-side prod enumeration remains a deploy-readiness gate, not a code fix.

The long-term direction (Configuration handler enable/disable becoming a de-facto entitlement engine, alignment with the Entitlement model) deserves its own design pass before more paid handlers land. Not this PR's job to fix.

Next Steps

  1. Add the headline end-to-end test (Important 1).
  2. Tighten the enabledByDefault: false disable branch assertion (Important 2).
  3. Pick (a) or (b) for the cross-system fan-out gap and write it down (Important 3).
  4. Change DEBUG to INFO for the override log for the first 30 days (Important 4).
  5. Add a Rollback section to the PR description (Important 5).
  6. Reviewer-side prod-state enumeration before approving (out of scope, owned by solaris007).
  7. Minor items are optional but several are small additions to the existing test file.

disabled: { sites: [], orgs: [site.getOrganizationId()] },
});

expect(instance.isHandlerEnabledForSite('paid-handler', site)).to.be.true;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: headline end-to-end test missing (Round 2 unresolved).

The PR fixes the free-trial -> paid upgrade path, but no test composes write-path-then-read-path with the full org-disabled context. Test 3 (this test) pre-populates enabled.sites via addHandler (read path only). Test 4 below exercises the write path but without org in disabled.orgs. Neither tests the exact scenario the PR is for.

Fix: one chained test, approximately 10 lines:

it('enables a paid handler for a site whose org is in disabled.orgs (free-trial -> paid)', () => {
  instance.addHandler('paid-handler', {
    enabledByDefault: false,
    enabled: { sites: [], orgs: [] },
    disabled: { sites: [], orgs: [site.getOrganizationId()] },
  });
  expect(instance.isHandlerEnabledForSite('paid-handler', site)).to.be.false;
  instance.updateHandlerSites('paid-handler', site.getId(), true);
  expect(instance.isHandlerEnabledForSite('paid-handler', site)).to.be.true;
  expect(instance.getHandler('paid-handler').enabled.sites).to.include(site.getId());
});

instance.updateHandlerSites('default-cleanup-handler', site.getId(), false);

expect(instance.getHandler('default-cleanup-handler').enabled.sites).to.not.include(site.getId());
expect(instance.getHandler('default-cleanup-handler').disabled.sites).to.include(site.getId());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: enabledByDefault: false disable branch is reached but not behaviorally asserted (Round 2 unresolved).

The existing test at line 349-353 exercises this branch through disableHandlerForSite('organic-keywords', site), but its assertion is expect(...disabled.sites).to.not.include(siteId) - a tautology since organic-keywords.disabled.sites was never populated by the prior enable. The branch's actual job (removing siteId from enabled.sites so the round-trip leaves no stale override) is not asserted.

With this PR's "always add to enabled on enable" change, a paid handler can land in enabled.sites for sites whose org is in disabled.orgs. If the disable path's enabled.sites filter is broken or removed in a future refactor, a disabled handler would still read true via step 2 of the precedence ladder. Line-coverage tools would not catch it.

Fix: add expect(...enabled.sites).to.not.include(siteId) to the line 349-353 test, or add a dedicated round-trip test on a paid handler. 3-5 lines.

* Note: unlike isHandlerEnabledForSite, there is no site-level override here.
* An org in disabled.orgs is always disabled regardless of site-level entries.
*/
isHandlerEnabledForOrg(type, org) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: cross-system consequence of the asymmetric model is not documented.

Tracing the consumer call sites: isHandlerEnabledForOrg is used as a gate in spacecat-jobs-dispatcher/src/handlers/reports.js and spacecat-reporting-worker/src/digest/handler-external.js (the latter enumerates all sites in the org without re-checking per site when the org is disabled).

Implication: a paid site in enabled.sites whose org is in disabled.orgs will receive site-targeted reports (correct, this is what the PR enables) but will NOT receive org-level digests routed through orgId - the digest handler short-circuits at the org gate. The PR's stated invariant "site-level handler overrides were ignored, org-level disable always won" still holds in the digest fan-out path.

This is an in-repo doc fix. Pick one:

  • (a) Declare it intentional: extend this JSDoc to say "intended for org-scoped report fan-out; site-level overrides do not propagate here", document in PR description.
  • (b) File a Jira for the digest handler to re-check per-site, link from PR description as a known limitation.

Either is defensible. The asymmetric model is documented at the method level but the cross-system consequence is invisible today.

}
if (enabledSites.includes(siteId)) {
if (disabledOrgs.includes(orgId)) {
this.log.debug('[isHandlerEnabledForSite] site-override: site in enabled.sites overrides org in disabled.orgs', { type, siteId, orgId });
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: DEBUG log level defeats the canary purpose.

This log fires only when the override path is taken - exactly the post-deploy validation signal. In SpaceCat Lambdas shipped to Coralogix, DEBUG is typically filtered before shipment. For the first 30 days post-deploy, on-call needs a greppable canary signal without per-subsystem filter changes. If the prod-state enumeration count (reviewer-side gate) is non-zero, the override fires on production traffic and the responder needs to confirm the behavior matches expectations.

Fix: change this.log.debug(...) to this.log.info(...) for the first 30 days, then demote in a follow-up commit. Track the demotion ticket so it does not become permanent INFO noise. If you prefer keeping DEBUG, write the grep target into the rollout runbook explicitly.

@tkotthakota-adobe
Copy link
Copy Markdown
Contributor Author

Hi @solaris007 I must have missed it. Here is what addressed
Important fixes (all done):

  1. ✅ Write path bug — !handler.enabledByDefault guard removed
  2. ⚠️ Prod config enumeration — ops task, needs someone with S3 access
  3. ✅ Missing tests — e2e headline test, cleanup round-trip tests, enabledByDefault: false case
  4. ✅ JSDoc divergence — both methods document their precedence models + cross-system consequence
  5. ✅ Disjointness comment — in JSDoc + #updatedHandler
  6. ✅ DEBUG → INFO log
  7. ⚠️ Rollback section in PR description — needs to be written by you

Minor (all done):

  • ✅ Disjointness pinning test
  • ✅ Idempotency test
  • ✅ Disable branch tightened (enabled.sites cleared assertion added)

@solaris007 solaris007 self-requested a review May 12, 2026 03:51
Copy link
Copy Markdown
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tkotthakota-adobe,

Thanks for the thorough iteration on Round 3 feedback. Most prior findings are cleanly addressed; two operational items remain and one Minor test gap from Round 3 was not picked up.

Strengths

Previously flagged issues now addressed (Round 3):

  • Important 1 (headline e2e test missing): closed at configuration.model.test.js:268-279. The new test composes the full write-then-read round-trip starting from disabled.orgs=[orgId], asserts false initially, calls updateHandlerSites(..., true), then verifies both the read predicate flips to true AND enabled.sites was populated. Exactly the production scenario.
  • Important 2 (enabledByDefault: false disable round-trip): closed at configuration.model.test.js:385-391. Both write-shape sides pinned with enabled.sites includes before / excludes after. organic-keywords is enabledByDefault: false, so the test exercises the correct branch of #updatedHandler.
  • Important 3 (cross-system consequence not documented): closed at configuration.model.js:240-246. JSDoc verified accurate against both consumer call sites - in spacecat-jobs-dispatcher/src/handlers/reports.js, processInternal runs two independent fan-outs (sites via createSiteMapper, orgs via createOrgsMapper), so a paid site in enabled.sites whose org is in disabled.orgs does receive a siteId payload but is excluded from the org fan-out exactly as the JSDoc claims. The reporting worker re-applies isHandlerEnabledForOrg as a second gate on the orgId branch. The "asymmetry is intentional" sentence is a faithful summary.
  • Important 4 (DEBUG log defeats canary purpose): closed at configuration.model.js:214.
  • Minor 6 (disjointness pinning test): closed at configuration.model.test.js:307-315.
  • Minor 7 (idempotency test): closed at configuration.model.test.js:317-328.

This iteration:

  • Log payload { type, siteId, orgId } confirmed non-sensitive; supply-chain surface unchanged (no package.json/lockfile delta).
  • Authorization model integrity holds; the new commit is auth-neutral.
  • JSDoc + new behavioral tests together form a coherent contract.

Issues

Important (Should Fix)

1. Rollback section in PR description still missing; the procedural pushback should not stand.

You wrote in the latest top-level comment: "Rollback section in PR description - needs to be written by you." Respectfully, that inverts the standard. The person changing behavior is the person who knows what to undo and in what order. The reviewer's job is to require the section, not to write it.

This PR specifically needs two levers, both of which you are best positioned to specify:

  • Path A (small blast radius, data fix): if one or a small number of (siteId, orgId) tuples re-enable incorrectly after deploy, restore the prior S3 Configuration object version (versioning is enabled) OR write a corrective config removing the siteId from enabled.sites for the affected handler. No code revert; resolves in minutes.
  • Path B (large blast radius, code fix): if the read-path logic is broadly wrong, revert this PR, publish a patch release of spacecat-shared, bump the dependency in 5 consumers (spacecat-api-service, spacecat-audit-worker, spacecat-autofix-worker, spacecat-jobs-dispatcher, spacecat-reporting-worker). Each is a separate deploy. Recovery time scales with the slowest consumer.

The PR description should call out which signal triggers which path (e.g., 1-2 spurious site re-enables in the INFO override log -> Path A; precedence test regressions or broad behavior change in audit fan-out -> Path B). The merge commit body inherits the PR description; this is the document on-call reads when the page fires.

2. INFO log demotion follow-up is not tracked.

Round 3 explicitly required: "change to INFO for the first 30 days, then demote in a follow-up commit. Track the demotion ticket so it does not become permanent INFO noise." The DEBUG -> INFO change at configuration.model.js:214 is done, but no Jira ticket or TODO marker exists for the demotion. Without it, INFO becomes permanent and the canary signal degrades into background log volume.

Fix: file a Jira ticket for the T+30 demotion and add a code comment near line 213 referencing it (e.g., // TODO(SITES-NNNNN): demote to debug after 30-day canary window). One line of code, one ticket.

Minor (Nice to Have)

3. Minor 8 from Round 3 (write-shape side effect on default-enabled handlers) remains unresolved.

The existing tests for enabledByDefault: true handlers around configuration.model.test.js:330 and configuration.model.test.js:431 still assert only the negative case (disabled.sites does not include siteId). With this PR's write-path simplification, updateHandlerSites('<default-handler>', siteId, true) now also grows enabled.sites to [siteId] (previously stayed empty for default-true handlers). A regression where #updatedHandler stopped growing enabled.sites would slip past tests today.

One-line fix in at least one of those tests: expect(instance.getHandler('404').enabled.sites).to.include(site.getId());

4. No paired test for the documented asymmetry.

The JSDoc at configuration.model.js:240-246 makes a substantive behavioral claim - site override does NOT propagate to org-level. No test pins this. A two-line addition asserting both isHandlerEnabledForSite (true) and isHandlerEnabledForOrg (false) on the same handler with site-in-enabled-sites + org-in-disabled-orgs would lock the asymmetry against future drift.

5. Tautology kept alongside new assertion in "disables a handler for a site" test.

expect(...disabled.sites).to.not.include(site.getId()) remains in the test at configuration.model.test.js:387. For organic-keywords (enabledByDefault: false), the disable path enters the else branch and never touches disabled.sites, so this assertion is structurally guaranteed to pass. The new enabled.sites assertion is the real check; the dead line can be removed for clarity. Pure cleanup.

Recommendations

  • Optional: emit a counter alongside the INFO log keyed only on handler type (cardinality-bounded). configuration.override.site_over_org{type=X} gives Coralogix-independent observability and supports alerting once the canary window closes. Cheaper long-term signal than INFO logs.
  • Rephrase the cross-system JSDoc note more abstractly to avoid naming specific downstream services. If either changes which method gates their fan-out, the comment goes stale silently. Phrase: "callers that fan out on org policy" or point to a cross-repo spec.
  • Rename the new test handler at configuration.model.test.js:268-279 to paid-handler-migration or similar - the name paid-handler is reused from the immediately preceding test, and while beforeEach makes it safe, it's mildly confusing when diffing the file.
  • Assert mockLogger.info was called in the new e2e test if you want the canary log itself pinned by the test suite. Optional.
  • Forward-looking: handler enable/disable is becoming a de-facto entitlement engine. Configuration's isHandlerEnabledForSite / isHandlerEnabledForOrg are now expressing policy with non-obvious override semantics. The codebase already has an Entitlement model. Over time this deserves a dedicated abstraction. Not for this PR; roadmap.

Out of scope, worth tracking

  • Inverse asymmetry on the reporting-worker org-digest path: spacecat-reporting-worker/src/digest/handler-external.js calls Site.allByOrganizationId(orgId) and does NOT filter the returned sites through isHandlerEnabledForSite. A site in disabled.sites whose org is in enabled.orgs is still included in the org-level digest. In-consumer behavior, separate PR if it turns out to be wrong.
  • Prod-state enumeration (Round 1 Important 2) still owned by reviewer (me). Tracking.

Assessment

Ready to merge? With fixes.

Reasoning: code-level work is clean and the Round 3 Important findings 1-4 plus Minors 6-7 are all properly addressed. Two operational items remain - Rollback section in PR description (the author should write, not the reviewer), and a tracked demotion follow-up for the INFO log (one Jira + one code comment). Plus one Minor test gap from Round 3 (Minor 8) that wasn't picked up, and two new Minor cleanup items in the test file. None of these are code redesign; all are quick fixes to land.

Next Steps

  1. Write the Rollback section in the PR description (Path A + Path B + which signal triggers which).
  2. File a Jira ticket for INFO -> DEBUG demotion at T+30 days and add a TODO comment referencing it near configuration.model.js:213.
  3. Add a one-line assertion to one of the existing default-true handler tests to pin the new enabled.sites growth behavior (Minor 8).
  4. Optional: add the asymmetry-pinning test, drop the tautology assertion, and apply other Minor cleanups.
  5. Reviewer-side (me): run the prod-state enumeration before approving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants