Skip to content

fix: onboard flow fixes around bot detection and resolve canonical urls for some sites#1556

Open
tkotthakota-adobe wants to merge 19 commits into
mainfrom
SITES-43237
Open

fix: onboard flow fixes around bot detection and resolve canonical urls for some sites#1556
tkotthakota-adobe wants to merge 19 commits into
mainfrom
SITES-43237

Conversation

@tkotthakota-adobe
Copy link
Copy Markdown
Contributor

@tkotthakota-adobe tkotthakota-adobe commented Apr 23, 2026

https://jira.corp.adobe.com/browse/SITES-43237

spacecat-shared-utils

resolveCanonicalUrl crashed the Lambda silently on sites with Brotli-compressed responses (e.g.,
otempo.com.br). The function only needs the final URL and status code but @adobe/fetch was
auto-decompressing the body. The abandoned stream caused an unhandled Z_BUF_ERROR that killed the
process with no error message.

  • Added decode: false to skip unnecessary body decompression — fixes the crash
  • Added 7s total deadline across all recursive calls (HEAD → GET, redirects)
  • Changed detectBotBlocker from HEAD to GET so HTML body is available for challenge page detection
    (Cloudflare, Imperva, CAPTCHA)
  • Added 3s body-read timeout in detectBotBlocker to handle slow-streaming servers

@github-actions
Copy link
Copy Markdown

This PR will trigger a patch release when merged.

Copy link
Copy Markdown
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tkotthakota-adobe,

Good root-cause analysis on the Z_BUF_ERROR Lambda crash. The decode: false fix targets the actual problem, and the 7s deadline shared across recursive calls is the right shape for bounding total wall time. A few things need attention before this can merge.

Strengths

  • decode: false in resolveCanonicalUrl (url-helpers.js:167) is the correct root-cause fix for the Brotli decompression crash - skipping body decompression when only the URL and status code are needed.
  • The absolute deadline parameter (url-helpers.js:137, 150) shared across recursive calls bounds worst-case latency. Previously each hop had its own 10-20s budget and could chain unboundedly.
  • BODY_READ_TIMEOUT with Promise.race in detectBotBlocker (bot-blocker-detect.js:352-355) prevents indefinite hangs on servers that stream body data slowly after headers arrive.
  • #updatedHandler (configuration.model.js:251-269) now consistently maintains both enabled and disabled lists with defensive || [] fallback, fixing a real TypeError path.
  • The esmock-based test for response.text() rejection (bot-blocker-detect.test.js:714-733) is a clean approach for exercising the body-read fallback path.

Issues

Important (Should Fix)

Headline behavior change has no test - configuration.model.test.js:238

The PR's stated bug fix is "site enable overrides org disable." The new test only pins the opposite direction (site not listed + org in disabled.orgs -> false). There is no test constructing a handler with enabled: { sites: [siteId] } AND disabled: { orgs: [orgId] } and asserting the result is true. If a future refactor reintroduces the old precedence, CI stays green. Add at minimum:

it('returns true when site is in enabled.sites even if org is in disabled.orgs', () => {
  instance.addHandler('site-override-handler', {
    enabledByDefault: false,
    enabled: { sites: [site.getId()], orgs: [] },
    disabled: { sites: [], orgs: [site.getOrganizationId()] },
  });
  expect(instance.isHandlerEnabledForSite('site-override-handler', site)).to.be.true;
});

Similarly, the new #updatedHandler cleanup at configuration.model.js:267-268 (remove from enabled when disabling an enabled-by-default handler) is not exercised by any test. The removed c8 ignore next directives suggest these paths are now expected to be reachable.

Silent fallback paths have no operational signal - url-helpers.js:153, url-helpers.js:182, bot-blocker-detect.js:356

The PR adds three new failure paths that produce no logs, metrics, or traces:

  • url-helpers.js:153 - deadline expiry returns null, indistinguishable from "URL not resolvable"
  • url-helpers.js:182 - bare catch {} after retry exhaustion
  • bot-blocker-detect.js:356 - bare catch {} for body-read timeout, confidence silently drops to header-only

The PR description notes the original Z_BUF_ERROR was "killing the process with no error message." These new code paths reproduce the same observability gap for different failure modes. If body reads start timing out fleet-wide (e.g., a CDN change), on-call has no signal that detection quality has degraded. Add a log.warn with structured context ({ fn, url, cause }) on each fallback path. The tracingFetch import already implies a logging context is available.

Config precedence flip needs prod data audit before deploy - configuration.model.js:186-205

The new resolution order (enabledSites before disabledOrgs) means any existing configuration row where a site is in enabled.sites AND the org is in disabled.orgs will silently flip from disabled to enabled at deploy time. This is the intended behavior for the paid-upgrade scenario, but it should be verified against production data before release. Run a query against the configuration table counting rows where any handler has both enabled.sites and disabled.orgs populated for an overlapping (site, org) pair. If the count is zero, note it in the PR. If non-zero, review those rows individually to confirm they are all intended paid upgrades.

SSRF blast radius expanded by HEAD -> GET - bot-blocker-detect.js:340

Both detectBotBlocker and resolveCanonicalUrl accept caller-supplied URLs and fetch them server-side from a Lambda with no allowlist or private-CIDR filter. This PR widens the surface: previously HEAD requests meant response bodies never reached the Lambda; now GET with a 3s body read means internal service responses (EC2/ECS metadata, internal APIs) could be buffered in the html variable and passed to analyzeResponse.

Recommend adding a guard that rejects URLs whose hostname is an IP literal in private/loopback/link-local ranges (169.254.169.254, 10/8, 172.16/12, 192.168/16, 127/8). Even if the api-service onboarding endpoint validates upstream, defense-in-depth at the library level prevents future callers from inheriting the same exposure.

detectBotBlocker body read has no size cap - bot-blocker-detect.js:352

response.text() buffers the full HTML body up to the 3s timeout. For large SPA shells (10MB+), this materially increases Lambda heap during the window. Challenge markers (cf-ray, recaptcha sitekey, "Just a moment...") appear in the first KB. Cap the read at the first 64-256KB, or check Content-Length and bail to header-only analysis above some threshold. At onboarding cadence this is manageable; if detectBotBlocker is later reused at audit cadence for 1000+ sites, it becomes a cost and memory concern.

Minor (Nice to Have)

Test names and assertions no longer reflect actual behavior - url-helpers.test.js:409, bot-blocker-detect.test.js:700

url-helpers.test.js:409 is named "should use 15s timeout for GET requests" with an assertion expect(duration).to.be.below(16000). The actual budget is 7s (RESOLVE_CANONICAL_URL_TOTAL_TIMEOUT). The assertion passes vacuously. Rename to reflect the 7s deadline and tighten the assertion (e.g., below(8000)).

bot-blocker-detect.test.js:700 is named "proceeds with header-only analysis when GET body cannot be read" but the test setup provides a normal 200 response with readable body content. It tests the happy-path "body has no challenge markers" branch, not the body-read failure. Rename to match actual behavior.

BODY_READ_TIMEOUT value not pinned by any test - bot-blocker-detect.js:32

Changing the constant to 30000, 0, or removing the timeout entirely would not break any test. Add a test with a slow-streaming body (e.g., nock.delay({ body: 5000 }) or fake timers) that verifies the function returns header-only analysis within approximately 3s.

Recommendations

  • Consider splitting this into two PRs (shared-utils URL/bot-blocker fixes and shared-data-access config precedence). They have different blast radii and rollback stories. If the precedence change needs reverting, the URL resolver crash fix gets reverted with it.
  • Document the deadline contract in the resolveCanonicalUrl JSDoc: the deadline is shared across HEAD, GET, and all redirect hops.
  • The asymmetry "HEAD retries with GET on error, GET does not retry" (url-helpers.js:184) deserves a one-line comment explaining why - it is a chosen behavior that someone will "fix" during a future incident.
  • Add an operator-facing comment on the precedence change noting: "Emergency org-wide disable via disabled.orgs will not override sites already in enabled.sites - clear those entries explicitly."

Companion PR note

PR 2243 in spacecat-api-service references this PR's packages via gist tarballs. That PR was reviewed separately with REQUEST_CHANGES (gist-pinning flagged as Critical). Some discrepancies between the two PR descriptions are worth noting:

  • PR 2243 claims "8s safety timeout via Promise.race" - this PR implements a 7s deadline via AbortSignal (not Promise.race)
  • PR 2243 claims "30s timeout on site.save()" - no such change exists in this PR
  • The gist versions (data-access 3.55.0, shared-utils 1.112.5) are behind main

These are informational - the resolution is to wait for this PR to ship through the registry and then update PR 2243 to point at the published versions.

Assessment

Ready to merge? No - with fixes

The production code changes are well-motivated and the root-cause fix is correct. But the headline behavior change (site-enable-overrides-org-disable) has no test, the new fallback paths ship without observability (the same gap that hid the original crash), and the body read has no size cap. A prod data audit for the precedence change should be attached before deploy. All fixable in this branch.

Next Steps

  1. Add the missing test for site-enable-overrides-org-disable, plus the updatedHandler cleanup test.
  2. Add structured log.warn on the three new silent fallback paths.
  3. Run the prod data audit query for the config precedence change and attach results to the PR.
  4. Add a body-size cap or Content-Length check in detectBotBlocker.
  5. The remaining items (SSRF guard, test name fixes, timeout pinning) are lower priority.

Comment thread packages/spacecat-shared-utils/src/bot-blocker-detect/bot-blocker-detect.js Outdated
Comment thread packages/spacecat-shared-utils/src/url-helpers.js
Comment thread packages/spacecat-shared-utils/src/bot-blocker-detect/bot-blocker-detect.js Outdated
@tkotthakota-adobe tkotthakota-adobe changed the title fix: onboard flow fixes fix: onboard flow fixes around bot detection and resolve canonical urls for some sites May 6, 2026
@tkotthakota-adobe
Copy link
Copy Markdown
Contributor Author

tkotthakota-adobe commented May 6, 2026

Hi @solaris007 thanks for the review. Addressed below comments.

  1. Deadline expiry silent — added log.warn when resolveCanonicalUrl hits the deadline
  2. GET exhaustion silent — added log.warn in catch after GET retry fails
  3. Body-read timeout silent — added log.warn in detectBotBlocker body-read catch
  4. No body size cap — added 64 KB Content-Length check; skips body + warns if exceeded
  5. Wrong test name (15s) — renamed to reflect actual 7s deadline; tightened assertion to below(8000)
  6. Wrong test name (body unreadable) — renamed to reflect actual happy-path behavior
  7. BODY_READ_TIMEOUT unpinned — exported constant; added Content-Length cap test and fake-timer timeout test
  8. Moved audit enable changes to separate PR: fix: site-level handler overrides were ignored, org-level disable always won #1590

Copy link
Copy Markdown
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tkotthakota-adobe,

Good progress since the last review - most prior findings are well-addressed and the configuration changes are cleanly split to PR 1590.

Strengths

  • Root-cause fix (decode: false at url-helpers.js:170) correctly targets the Brotli decompression crash - skipping decompression on a call that only needs resp.url and resp.ok.
  • Three structured log.warn calls with { fn, url, cause/method/contentLength } context close the observability gap that hid the original Z_BUF_ERROR crash (prior finding on silent fallback paths - addressed).
  • 64KB Content-Length check (bot-blocker-detect.js:349-351) adds first-pass body-size gating (prior finding on size cap - addressed).
  • BODY_READ_TIMEOUT exported and pinned by fake-timer test at bot-blocker-detect.test.js:449 that ticks BODY_READ_TIMEOUT + 1 and asserts the warn fires (prior finding on timeout pinning - addressed).
  • Test names and assertions corrected to match the actual 7s deadline behavior (prior finding on test names - addressed).
  • Shared deadline threaded cleanly through all recursive hops with accurate JSDoc (url-helpers.js:140-145).
  • Configuration changes cleanly split to PR 1590, narrowing this PR to the URL/bot-blocker scope.

Issues

Important (Should Fix)

1. Chunked encoding bypasses the 64KB body-size cap (bot-blocker-detect.js:349)

The Content-Length check resolves to 0 when the header is missing (chunked transfer encoding), causing the cap to be skipped entirely:

const contentLength = parseInt(response.headers.get('content-length') || '0', 10);
if (contentLength > 0 && contentLength > BODY_READ_MAX_BYTES) { ... skip ... }
else { ... read body ... }

Body read then proceeds bounded only by the 3s wall-clock timeout. A server streaming chunked data can push multi-MB into response.text() within that window. Chunked encoding is the more common case for dynamic responses (reverse proxies, internal APIs). The 64KB cap was added to address the prior review's size concern, but it only covers servers that advertise Content-Length.

Fix: Read the body via a stream reader with a running byte counter that aborts at BODY_READ_MAX_BYTES, or accept the wall-clock-only bound and document it explicitly (e.g., a comment: "Content-Length check is best-effort; chunked responses are bounded by BODY_READ_TIMEOUT only").

2. SSRF surface widened by HEAD to GET with no private-IP guard (unresolved from prior review) (bot-blocker-detect.js:341)

This was raised in the prior review and the response skipped it (item numbering jumped from 5 to 7). Re-evaluating against the current diff:

isValidUrl (functions.js) only checks the URL parses and uses http/https - it accepts http://169.254.169.254/, http://10.0.0.1/, http://localhost/. Both detectBotBlocker and resolveCanonicalUrl fetch caller-supplied URLs server-side from a Lambda VPC.

The HEAD-to-GET change means internal service response bodies (up to 64KB, or unbounded if chunked per finding 1) are now buffered in the Lambda. analyzeResponse does not echo the body back to the caller, so this is not a direct read primitive. However: (a) timing and type/confidence fields leak reachability information, (b) resolveCanonicalUrl returns redirect targets - including internal URLs if a site 301s to one, and (c) log.warn calls include the URL, visible to anyone reading CloudWatch.

Fix: Either add a hostname guard that rejects URLs whose hostname is a private/loopback/link-local IP literal (169.254.x.x, 10/8, 172.16/12, 192.168/16, 127/8, ::1), or document the trust model explicitly in the JSDoc: "callers MUST validate that urlString resolves to a public hostname; this library does not enforce private-network protections." The second option is acceptable if the api-service onboarding endpoint already validates upstream.

Minor (Nice to Have)

3. GET-failure log.warn not asserted by any test (url-helpers.test.js:446)

The log.warn at url-helpers.js:194 ("[resolveCanonicalUrl] GET request failed") was added to address the prior "silent fallback" finding. The test at line 446 ("should return null when GET request errors") exercises the path but doesn't inject a log stub. If the log.warn is later removed, no test breaks.

Fix: Pass log = { warn: sinon.stub() } and assert expect(log.warn).to.have.been.calledWithMatch('[resolveCanonicalUrl] GET request failed').

4. Timer leak on body-read Promise.race (bot-blocker-detect.js:356-358)

When response.text() resolves before the 3s timeout, the setTimeout keeps the timer handle alive. At onboarding cadence this is negligible, but capturing the handle and clearing it in .finally is a cheap correctness improvement.

Recommendations

  • Both functions default log to console. In Lambda this bypasses structured logging (Powertools/Coralogix). Callers should pass their framework logger explicitly. Consider documenting this in the JSDoc.
  • The package-lock.json includes version bumps for sibling packages (data-access 3.55.0, drs-client 1.5.0, etc.) not touched in this PR's src/. This is likely monorepo churn from merging main. Consider regenerating the lockfile to only include changes this PR requires, or note the bumps in the PR description.
  • Consider adding an analyzeBody flag (default true) to detectBotBlocker so future high-cadence callers can opt into the cheaper header-only path without paying for the GET body read on every invocation.

Assessment

Ready to merge? With fixes.

Reasoning: The root-cause fix is correct and most prior findings are well-addressed. The two remaining Important items are: (1) the Content-Length cap is incomplete for chunked encoding - a real memory-pressure path reachable from user input, and (2) the SSRF concern from the prior review that was skipped without engagement - the HEAD-to-GET change widens the surface and the library has no private-IP guard. Both are addressable in this branch. The Minor items and Recommendations are non-blocking.

Next Steps

  1. Address the chunked encoding bypass - either stream-based byte cap or document the accepted risk.
  2. Address the SSRF concern - either add a hostname guard or document the trust model in JSDoc.
  3. Optional: pin the GET-failure log.warn in the test and clear the timer handle.

Comment thread packages/spacecat-shared-utils/src/bot-blocker-detect/bot-blocker-detect.js Outdated
@tkotthakota-adobe
Copy link
Copy Markdown
Contributor Author

Hi @solaris007 thanks for re-review again. I believe I addressed all concerns.

@tkotthakota-adobe
Copy link
Copy Markdown
Contributor Author

Hi @solaris007 Let me know if anything else needed.

@tkotthakota-adobe
Copy link
Copy Markdown
Contributor Author

Hi @solaris007 Sorry for the back and forth on this PR:
I believe I addressed all comments. I will rerun the review comments again tomorrow but I am pretty sue I addressed all. Waiting for this one to be approved and then fix below api PR tar ball comment:

adobe/spacecat-api-service#2243

Copy link
Copy Markdown
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tkotthakota-adobe,

Good progress on this PR. All four prior findings (P1-P4) are cleanly addressed - the SSRF guard wiring, the timer leak fix, the chunked-encoding caveat, and the GET-failure log assertion all landed well. A few things first that the guard does correctly: IPv4 normalization (hex/octal/decimal/compressed forms like 0x7f.0.0.1, 2130706433, 127.1), case folding (LOCALHOST), userinfo (http://evil.com@127.0.0.1/), and trailing-dot IPv4 (127.0.0.1.) are all handled by the WHATWG URL parser, so those classes of bypass are not concerns. Verified empirically.

That said, this re-review surfaces a set of gaps in the new SSRF guard itself that are worth addressing before merge.

Strengths

  • All four prior findings (P1-P4) addressed:
    • P1 (chunked encoding caveat): documented at bot-blocker-detect.js:384-385 with the right level of honesty about Content-Length being best-effort.
    • P2 (SSRF guard): wired into both detectBotBlocker (throws) and resolveCanonicalUrl (returns null + log.warn). The non-2xx recursive branch correctly re-validates each hop.
    • P3 (GET-failure log assertion): test now injects a log stub and asserts log.warn.calledWithMatch (url-helpers.test.js:451-454).
    • P4 (timer leak): clearTimeout(timer) cleanly captured outside the race and cleared in .finally (bot-blocker-detect.js:392-395).
  • 13 new SSRF tests cover the documented ranges with tight error-message assertions (rejectedWith('Private IP addresses are not allowed'), not a generic rejected check).
  • resolveCanonicalUrl tests pair each rejection with a log.warn calledWithMatch assertion - observability is pinned alongside behavior.

Issues

Important (Should Fix)

1. SSRF guard has empirically-verified bypass primitives (bot-blocker-detect.js:347-361, url-helpers.js:145-158)

The blocklist documentation implies broader coverage than the code provides. The following inputs all bypass both isNonPublicUrl and isNonPublicHostname:

Input hostname after URL parse Blocked?
http://0.0.0.0/ 0.0.0.0 NO - on Linux connects to 127.0.0.1
http://0/ 0.0.0.0 NO - single-zero shorthand
http://[::]/ [::] NO - IPv6 INADDR_ANY
http://[fe80::1]/ [fe80::1] NO - IPv6 link-local
http://[fc00::1]/ [fc00::1] NO - IPv6 ULA (RFC 4193)
http://[::ffff:127.0.0.1]/ [::ffff:7f00:1] NO - IPv4-mapped IPv6 loopback
http://localhost./ localhost. NO - trailing dot preserved for domain hosts

Fix: extend both functions to handle these cases, or replace the hand-rolled checks with ipaddr.js (well-audited, small, handles every IPv4/IPv6 category natively via parse(host).range() returning loopback/private/linkLocal/uniqueLocal/unspecified).

2. Post-redirect URL not re-validated (bot-blocker-detect.js:368-376, url-helpers.js:189-216)

tracingFetch and fetch follow HTTP redirects transparently by default (redirect: 'follow'). The current SSRF guard checks the input URL, but neither function passes redirect: 'manual', and resolveCanonicalUrl only re-validates the final URL on the non-2xx recursive branch - the 2xx success path returns resp.url directly via if (resp.ok) return ensureHttps(resp.url) (url-helpers.js:207).

Attack chain: attacker controls https://evil.example.com/ which returns 302 Location: http://127.0.0.1:9001/. Caller passes https://evil.example.com/ - guard passes - fetch internally follows the redirect, lands on 127.0.0.1:9001 - resp.ok is true - the internal URL is returned to the caller.

The JSDoc at url-helpers.js:167-168 promises "rejected on every hop including redirect targets" - the implementation only delivers this for non-2xx hops.

Fix: pass redirect: 'manual' and follow redirects manually with the guard applied on each hop. Re-validating resp.url after the fact does not prevent the outbound network connection that has already happened.

3. SSRF guard logic duplicated across two files (bot-blocker-detect.js:341-362, url-helpers.js:144-158)

isNonPublicUrl and isNonPublicHostname implement byte-for-byte identical blocklist logic. They share the same gaps from item 1; any fix to the boundary logic needs to be applied in lockstep. Future call sites that need SSRF protection will copy whichever version is closer, and the policy diverges.

Fix: extract to a single shared helper in packages/spacecat-shared-utils/src/ (e.g., network-policy.js exposing isPublicHttpHost(hostname)). Both files import. Pairs naturally with item 1.

4. detectBotBlocker JSDoc is orphaned from its declaration (bot-blocker-detect.js:314-366)

/** detectBotBlocker JSDoc */    // <-- now orphaned
/** isNonPublicUrl JSDoc */      // <-- absorbs the previous block
function isNonPublicUrl(...)
export async function detectBotBlocker(...)  // <-- has no JSDoc attached

JSDoc parsers (TypeScript, jsdoc CLI, VS Code IntelliSense) attach docstrings to the immediately-following declaration. The detailed detectBotBlocker documentation (param types, return type enum, throws clause) is now silently attached to isNonPublicUrl. Hovering over detectBotBlocker in an IDE shows empty docs.

Fix: move isNonPublicUrl and its JSDoc above the detectBotBlocker JSDoc block. If you take the refactor in item 3, this issue goes away.

5. Silent URL parse failure in resolveCanonicalUrl (url-helpers.js:188-190, url-helpers.test.js:455)

try {
  const { hostname } = new URL(urlString);
  ...
} catch {
  return null;  // silent
}

Inconsistent with the rest of the PR's observability work: deadline expiry, GET failure, and private hostname rejection all emit structured warnings. Pre-PR, a malformed URL would propagate to the outer catch which now logs [resolveCanonicalUrl] GET request failed. Post-PR, parse errors short-circuit before fetch and produce no signal. The test at url-helpers.test.js:455 exercises the path but does not inject a log stub - the silent behavior is unintentionally pinned.

Fix: } catch (e) { log.warn('[resolveCanonicalUrl] invalid URL', { fn: 'resolveCanonicalUrl', url: urlString, cause: e?.message }); return null; }, and update the test to assert.

6. 172.x boundary edges untested (bot-blocker-detect.test.js:60-72, url-helpers.test.js:485-505)

The check a === 172 && b >= 16 && b <= 31 is the textbook off-by-one shape. Only 172.16.0.1 is tested in both files. A refactor changing b <= 31 to b < 31 would still pass every current test.

Fix: add two boundary cases per file - 172.15.255.255 (must be public) and 172.32.0.1 (must be public).

Minor (Nice to Have)

7. clearTimeout fix has no test pinning it (bot-blocker-detect.js:392-395)

No test asserts clearTimeout is called when response.text() resolves first. Removing the .finally(() => clearTimeout(timer)) would not break any current test - the existing fake-timer test only exercises the timeout-wins branch.

Fix: with sinon.useFakeTimers(), after the body resolves assert clock.countTimers() === 0; or spy global.clearTimeout.

8. IPv6 loopback [::1] coverage gap in resolveCanonicalUrl tests (url-helpers.test.js)

detectBotBlocker has an [::1] test; resolveCanonicalUrl does not. Asymmetric coverage means a refactor that drops or alters one but not the other ships undetected.

Fix: add it('should reject IPv6 loopback ([::1])', ...) mirroring the detectBotBlocker test.

Recommendations

  • Extract a single shared network-policy helper (pairs with item 3). End state: one place that defines "do not fetch private hosts", fully tested, backed by ipaddr.js for IPv4 + IPv6 + boundary correctness.
  • Add a test that pins the chunked-encoding best-effort path (no Content-Length, body large but completes within timeout). The current slow-body test exercises the timeout-wins branch but not the documented "bounded by timeout only" path.
  • The magic constant 0.7 in 'detects generic CAPTCHA challenge via GET body' could use an exported symbol (e.g., CONFIDENCE_GENERIC_CHALLENGE) matching the existing pattern of CONFIDENCE_HIGH/CONFIDENCE_MEDIUM/CONFIDENCE_ABSOLUTE.
  • Add a single test feeding an unparseable input to isNonPublicUrl and asserting false. The /* c8 ignore next 3 */ suppresses coverage on a security-critical path.

Out of scope, worth tracking

  • DNS-resolved IP validation: an attacker registers a domain whose A record points directly to a private IP (no rebinding needed). Currently exploitable today and distinct from the documented rebinding caveat. If the threat model accepts this, document explicitly alongside the rebinding note. Long-term fix is dns.lookup then fetch by IP literal with the resolved Host header.
  • package-lock.json carries sibling-package version bumps (data-access 3.55.0, drs-client 1.5.0, html-analyzer 1.2.10, seo-client 1.3.0) plus new "peer": true annotations - likely main-merge churn but worth a sanity check.
  • Companion PR adobe/spacecat-api-service#2243 (gist tarball install) is reviewed separately.

Assessment

Ready to merge? With fixes recommended.

Reasoning: All four prior findings are addressed cleanly and the bug fix itself is correct. The SSRF guard is the right addition, but as currently implemented it has empirically-verified bypass primitives and does not re-validate URLs after fetch follows redirects, leaving the guard largely cosmetic against the documented threat model. Both gaps are addressable on this branch. The smaller Important items (duplication, JSDoc placement, silent parse, boundary tests) are quick refinements that pair naturally with the SSRF fix.

Next Steps

  1. Address the two SSRF guard gaps (items 1 and 2) - extending the blocklist (or swapping in ipaddr.js) and handling redirects manually.
  2. Collapse the duplicated logic into a single shared helper (item 3) - pairs naturally with item 1.
  3. Move the isNonPublicUrl block to fix the JSDoc placement (item 4) - or it goes away if item 3 is taken.
  4. Add the log.warn for malformed URL paths (item 5) and the 172.x boundary tests (item 6).
  5. The Minor items and Recommendations are non-blocking.

* @param {string} urlString - URL to check
* @returns {boolean}
*/
function isNonPublicUrl(urlString) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: SSRF guard has empirically-verified bypass primitives. The blocklist documentation implies broader coverage than the code provides. The following inputs all bypass this function (and isNonPublicHostname in url-helpers.js):

Input hostname after URL parse Blocked?
http://0.0.0.0/ 0.0.0.0 NO - on Linux connects to 127.0.0.1
http://0/ 0.0.0.0 NO - single-zero shorthand
http://[::]/ [::] NO - IPv6 INADDR_ANY
http://[fe80::1]/ [fe80::1] NO - IPv6 link-local
http://[fc00::1]/ [fc00::1] NO - IPv6 ULA
http://[::ffff:127.0.0.1]/ [::ffff:7f00:1] NO - IPv4-mapped IPv6 loopback
http://localhost./ localhost. NO - trailing dot preserved

Note: alternate IPv4 forms (0x7f.0.0.1, 0177.0.0.1, 2130706433, 127.1) ARE handled because the WHATWG URL parser normalizes them. IPv6 has no comparable normalization.

Fix: extend the blocklist, or replace with ipaddr.js (parse(host).range() returns loopback/private/linkLocal/uniqueLocal/unspecified natively).

throw new Error('Invalid baseUrl');
}

if (isNonPublicUrl(baseUrl)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: Post-redirect URL is not re-validated. tracingFetch and fetch follow HTTP redirects transparently by default (redirect: 'follow'). The guard above checks baseUrl, but neither function passes redirect: 'manual'. In resolveCanonicalUrl, the 2xx success path returns resp.url directly without re-validation (url-helpers.js:207).

Attack chain: attacker controls https://evil.example.com/ returning 302 Location: http://127.0.0.1:9001/. The guard passes on the input - fetch follows the redirect internally - the Lambda's outbound connection lands on the private IP. The JSDoc at url-helpers.js:167-168 claims redirect targets are validated, but the implementation only does this on the non-2xx recursive branch.

Fix: pass redirect: 'manual' and follow redirects manually with the guard applied per hop. Re-validating resp.url after the fact does not prevent the outbound network connection that has already happened.

Comment thread packages/spacecat-shared-utils/src/url-helpers.js Outdated
Comment thread packages/spacecat-shared-utils/src/bot-blocker-detect/bot-blocker-detect.js Outdated
Comment thread packages/spacecat-shared-utils/src/url-helpers.js Outdated
@tkotthakota-adobe
Copy link
Copy Markdown
Contributor Author

Hi @solaris007 Thanks for patiently reviewing the PR. Addressed concerns:
Here's a summary of all changes made to spacecat-shared-utils:


New file: src/network-policy.js

  • Single shared SSRF guard using ipaddr.js
  • Blocks all bypass primitives: 0.0.0.0, [::], [fe80::1], [fc00::1], [::ffff:127.0.0.1], localhost., plus all
    standard private/loopback/link-local ranges
  • Handles IPv4-mapped IPv6 via toIPv4Address().range()
  • One fix point for both call sites

src/bot-blocker-detect/bot-blocker-detect.js

  • Removed hand-rolled isNonPublicUrl + orphaned JSDoc — imports isNonPublicHostname from network-policy.js
  • Added redirect: 'manual' loop (up to 10 hops) — SSRF guard runs on every redirect target before the network
    connection
  • Private-IP redirect returns { crawlable: false, type: 'ssrf-redirect-blocked', confidence: 1.0 }
  • Fixed timer leak: response.text().finally(() => clearTimeout(timer))
  • Added chunked-encoding comment on Content-Length best-effort check

src/url-helpers.js

  • Removed hand-rolled isNonPublicHostname — imports from network-policy.js
  • Added redirect: 'manual' to fetch + manual Location-header redirect following — guard runs on each hop before
    connecting
  • catch (e) with log.warn('[resolveCanonicalUrl] invalid URL', ...) replacing silent catch { return null }
  • Updated JSDoc documenting redirect: 'manual' rationale

package.json

  • Added "ipaddr.js": "2.2.0" to dependencies

Tests

  • test/network-policy.test.js (new) — full coverage of isNonPublicHostname: all bypass cases, 172.x boundaries,
    public IPs allowed
  • bot-blocker-detect.test.js — bypass cases (0.0.0.0, [::], [fe80::1], [fc00::1], [::ffff:127.0.0.1], localhost.),
    172.x boundaries, redirect-to-private-IP blocked, safe redirect followed, redirect with no Location header,
    redirect with unparseable Location header
  • url-helpers.test.js — same bypass/boundary cases, invalid URL log assertion, redirect-to-private-IP blocked,
    [::1] for resolveCanonicalUrl

Copy link
Copy Markdown
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tkotthakota-adobe,

Solid progress on this re-review. All 6 Important findings and 1 of 2 Minor findings from the prior review are cleanly addressed - the network-policy.js extraction with ipaddr.js is the right shape, the manual redirect handling is real (verified at the network level), and the bypass primitive coverage holds up empirically. The remaining concerns are concentrated in the new redirect-loop code in detectBotBlocker - code-quality and observability polish, not security regressions.

Strengths

  • All review 3 bypass primitives genuinely blocked, verified end-to-end (test + implementation, not just test names): 0.0.0.0 (network-policy.test.js:32, unspecified range), [::] (line 60), [fe80::1] (65), [fc00::1]/[fd00::1] (70, 74), [::ffff:127.0.0.1] (79, via toIPv4Address().range() returning 'loopback'), [::ffff:10.0.0.1] (83), localhost. (46, trailing dot stripped), LOCALHOST (42, case-insensitive).
  • Manual redirect handling is the real thing, not a label. redirect: 'manual' is honored end-to-end (tracingFetch -> @adobe/fetch). The "blocks redirect to private IP before connecting" tests (bot-blocker-detect.test.js:696, url-helpers.test.js:1407) deliberately leave the 127.0.0.1 endpoint un-mocked; if auto-follow happened, nock would throw "No match for request". Both tests pass, proving no second-hop connection is attempted.
  • SSRF logic centralized in network-policy.js. Both call sites import; future fixes apply once.
  • 172.x boundary edges tested in both files (172.15.255.255 and 172.32.0.1 as public, 172.31.255.255 as private). Off-by-one shape locked down.
  • Structured log.warn on every fallback path with { fn, url, cause }: redirect-to-private-IP, body-too-large, body-read-failed, private-hostname-rejected, invalid-URL, GET-failure. Six observability gaps closed.
  • BODY_READ_TIMEOUT export + fake-timer test at bot-blocker-detect.test.js:1041 pins the body-read timeout behavior.
  • Prior Minor: IPv6 loopback [::1] test for resolveCanonicalUrl added at url-helpers.test.js:1353.
  • Silent URL parse failure now logged at url-helpers.js:537 with cause; test at url-helpers.test.js:1402 pins it.
  • ipaddr.js@^2.2.0 dependency is clean. No CVEs on the 2.x branch, MIT license, recommended over the vulnerable ip package. The semver pin is appropriate.

Issues

Important (Should Fix)

1. Redirect-limit exhaustion silently downgrades to "crawlable: true" (bot-blocker-detect.js:355-388)

The loop runs at most 11 iterations and breaks early on non-3xx. If every response is 3xx with a valid Location, the loop exits naturally with response.status still in [300, 400). analyzeResponse() has no 3xx branch - the code falls through to the default return { crawlable: true, type: 'unknown', confidence: 0.5 }. A site stuck in an 11+ hop redirect loop, or a deliberate redirect trap, gets reported as crawlable. No log, no distinguishing return, no test covers MAX_REDIRECTS+1.

A redirect loop is also a relevant bot-detection signal in its own right (browser challenge pages that bounce through many redirects). Misreporting it as crawlable degrades the detection quality the function is meant to provide.

Fix: after the loop, if response.status >= 300 && response.status < 400, return { crawlable: false, type: 'redirect-limit-exceeded', confidence: CONFIDENCE_HIGH }, add a log.warn, and a test that chains MAX_REDIRECTS+1 redirects.

2. detectBotBlocker JSDoc no longer matches the function (bot-blocker-detect.js, JSDoc block above the function declaration)

Three drifts:

  • Method changed from HEAD to GET, and now up to 11 requests get issued. Material rate/quota implications that consumers will not see in the contract.
  • The type taxonomy listed in the JSDoc does not include the new 'ssrf-redirect-blocked' sentinel produced at line 385. Consumers branching on type (the api-service onboarding flow does) will not know to handle it.
  • The log parameter is missing from @param, and @throws {Error} If baseUrl is invalid is incomplete - the function also throws 'Private IP addresses are not allowed'.

Fix: update the JSDoc to reflect GET semantics, the redirect-following budget, the new sentinel type, the log param, and the additional throw condition. This is the public contract; consumers will read it before they read the implementation.

3. SSRF response asymmetry inside detectBotBlocker (bot-blocker-detect.js:343, 385)

The same security condition (target hostname is non-public) surfaces two different ways depending on where it is detected:

  • Initial URL -> throw new Error('Private IP addresses are not allowed')
  • Redirect target -> return { crawlable: false, type: 'ssrf-redirect-blocked', confidence: 1.0 }

Callers now have to handle both surfaces for the same policy, and no comment explains why. Pick one (the sentinel return is more consumer-friendly because it lets the caller route on type instead of catching and re-parsing an Error message) and document it. This is also what forces the throw-catch-rethrow gymnastics in Minor item 1 below.

Minor (Nice to Have)

1. Throw-catch-rethrow with string-match on error message (bot-blocker-detect.js:340-351)

try {
  const { hostname } = new URL(baseUrl);
  if (isNonPublicHostname(hostname)) {
    throw new Error('Private IP addresses are not allowed');
  }
/* c8 ignore next 6 */
} catch (e) {
  if (e.message === 'Private IP addresses are not allowed') { throw e; }
  throw new Error('Invalid baseUrl');
}

The flow throws in order to be caught and re-thrown, then matches on e.message to distinguish the security throw from a new URL() parse error. The magic-string match is brittle, and the construction is what makes the /* c8 ignore next 6 */ necessary (which also suppresses coverage on the security rethrow path). Cleaner:

let hostname;
try { ({ hostname } = new URL(baseUrl)); } catch { throw new Error('Invalid baseUrl'); }
if (isNonPublicHostname(hostname)) {
  throw new Error('Private IP addresses are not allowed');
}

No string match, no c8 ignore breadth issue, same behavior. If Important item 3 above converts the security path to a sentinel return, the whole block collapses further.

2. BLOCKED_RANGES omits non-public ranges that ipaddr.js identifies (network-policy.js:21-25)

The set has 5 entries (loopback, private, linkLocal, uniqueLocal, unspecified). ipaddr.js v2.2.0 returns range strings for other non-publicly-routable categories not in the set:

  • 'carrierGradeNat' (100.64.0.0/10, RFC 6598 shared address space; overlaps with some AWS service ranges in certain VPC configurations)
  • 'broadcast' (255.255.255.255)
  • 'multicast' (224.0.0.0/4 and IPv6 ff00::/8)
  • 'reserved' (240.0.0.0/4 and IPv6 reserved)
  • IPv6 transition ranges: '6to4' (2002::/16), 'teredo' (2001::/32), 'rfc6052' (NAT64 64:ff9b::/96)

CGN is the most material - it is shared address space the attacker can put directly in baseUrl (no DNS rebinding needed). The IPv6 transition ranges depend on whether the network path enables 6to4 relays (not the default on Lambda runtimes). Defense-in-depth, not a present-day exploit.

Fix: extend BLOCKED_RANGES:

const BLOCKED_RANGES = new Set([
  'loopback', 'private', 'linkLocal', 'uniqueLocal', 'unspecified',
  'carrierGradeNat',  // 100.64.0.0/10 (RFC 6598)
  'broadcast',        // 255.255.255.255
  'multicast',        // 224.0.0.0/4, ff00::/8
  'reserved',         // 240.0.0.0/4, IPv6 reserved
  '6to4', 'teredo', 'rfc6052',  // IPv6 transition addresses
]);

Add a corresponding test for each.

Recommendations

  • Manual redirect-following diverges between the two callsites. detectBotBlocker follows redirects with a for-loop bounded by MAX_REDIRECTS = 10; resolveCanonicalUrl recurses, bounded only by deadline. Both implement the same idea with subtly different semantics (the recursive variant is unbounded by hop count). Any future improvement (cross-origin credential stripping, same-scheme enforcement, hostname diversity caps) has to be applied in two places. Worth a follow-up to extract a shared followRedirects(url, opts, { guard, maxHops, deadline }) helper and consume it from both callsites. Pairs with the forward-looking recommendation below.

  • Forward-looking: SSRF guard belongs at the fetch layer, not per-caller. As spacecat-shared-utils is consumed by scrapers, audits, autofix workers, and onboarding flows, every future URL-fetching callsite has to remember: validate upfront, set redirect: 'manual', drive the loop, re-validate every hop, choose an error contract. That contract will eventually be forgotten somewhere. The natural home is the shared fetch primitive itself - either inside tracingFetch (with an explicit opt-out for known-internal callsites) or behind a safeFetch wrapper. network-policy.js is the right first step; consider filing a tracking issue for the next.

  • MAX_REDIRECTS = 10 is off-by-one (bot-blocker-detect.js:354-356). The loop condition hop <= MAX_REDIRECTS runs 11 iterations. With the initial fetch counted, that is "up to 11 fetches following up to 10 redirects". Behavior is correct; the constant just slightly misnames. Either rename to MAX_HOPS = 11 or change the condition to hop < MAX_REDIRECTS.

  • Pin result.type === 'ssrf-redirect-blocked' in the "blocks redirect to private IP" test (bot-blocker-detect.test.js:696). Currently only result.crawlable === false and the log message are asserted. Locking in the type makes the new public contract testable.

  • clearTimeout fix still has no explicit assertion (prior review Minor item 7). The protective behavior is in code and the body-read-timeout test exercises the timeout-fires path. The opposite path (text() resolves first, timer must be cleared) is not directly pinned. With sinon.useFakeTimers + clock.countTimers() after a fast-resolving text(), it is a small assertion. Not material; flagging for follow-up.

Out of scope, worth tracking

  • DNS-resolved private-IP attack is the natural next layer of defense (attacker registers a domain pointing to a private IP, no rebinding needed). Mitigation requires dns.lookup + fetch-by-IP-literal with the resolved Host header, which is a larger change. Fits naturally into the safeFetch direction in the recommendation above.

Assessment

Ready to merge? With fixes.

Reasoning: The security fundamentals are sound. All 6 prior Important findings are addressed and the new commit improves the SSRF posture substantively. The three Important items remaining are concentrated in the new redirect-loop code in detectBotBlocker (silent exhaustion, JSDoc drift, response asymmetry) - they affect consumers of the library's public contract, not the security boundary. All three are small, in-branch fixes. The Minor items and Recommendations are non-blocking polish.

Next Steps

  1. Address the three Important items - redirect-limit exhaustion sentinel + log, JSDoc refresh, and pick a single SSRF response shape inside detectBotBlocker.
  2. The Minor items (throw/catch refactor, extended BLOCKED_RANGES) are quick refinements that pair naturally with item 3.
  3. Recommendations and Out-of-scope items belong in a follow-up PR or tracking issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants