fix: onboard flow fixes around bot detection and resolve canonical urls for some sites by tkotthakota-adobe · Pull Request #1556 · adobe/spacecat-shared

tkotthakota-adobe · 2026-04-23T21:39:38Z

https://jira.corp.adobe.com/browse/SITES-43237

spacecat-shared-utils

resolveCanonicalUrl crashed the Lambda silently on sites with Brotli-compressed responses (e.g.,
otempo.com.br). The function only needs the final URL and status code but @adobe/fetch was
auto-decompressing the body. The abandoned stream caused an unhandled Z_BUF_ERROR that killed the
process with no error message.

Added decode: false to skip unnecessary body decompression — fixes the crash
Added 7s total deadline across all recursive calls (HEAD → GET, redirects)
Changed detectBotBlocker from HEAD to GET so HTML body is available for challenge page detection
(Cloudflare, Imperva, CAPTCHA)
Added 3s body-read timeout in detectBotBlocker to handle slow-streaming servers

github-actions · 2026-04-24T01:06:10Z

This PR will trigger a patch release when merged.

solaris007

Hey @tkotthakota-adobe,

Good root-cause analysis on the Z_BUF_ERROR Lambda crash. The decode: false fix targets the actual problem, and the 7s deadline shared across recursive calls is the right shape for bounding total wall time. A few things need attention before this can merge.

Strengths

decode: false in resolveCanonicalUrl (url-helpers.js:167) is the correct root-cause fix for the Brotli decompression crash - skipping body decompression when only the URL and status code are needed.
The absolute deadline parameter (url-helpers.js:137, 150) shared across recursive calls bounds worst-case latency. Previously each hop had its own 10-20s budget and could chain unboundedly.
BODY_READ_TIMEOUT with Promise.race in detectBotBlocker (bot-blocker-detect.js:352-355) prevents indefinite hangs on servers that stream body data slowly after headers arrive.
#updatedHandler (configuration.model.js:251-269) now consistently maintains both enabled and disabled lists with defensive || [] fallback, fixing a real TypeError path.
The esmock-based test for response.text() rejection (bot-blocker-detect.test.js:714-733) is a clean approach for exercising the body-read fallback path.

Issues

Important (Should Fix)

Headline behavior change has no test - configuration.model.test.js:238

The PR's stated bug fix is "site enable overrides org disable." The new test only pins the opposite direction (site not listed + org in disabled.orgs -> false). There is no test constructing a handler with enabled: { sites: [siteId] } AND disabled: { orgs: [orgId] } and asserting the result is true. If a future refactor reintroduces the old precedence, CI stays green. Add at minimum:

it('returns true when site is in enabled.sites even if org is in disabled.orgs', () => {
  instance.addHandler('site-override-handler', {
    enabledByDefault: false,
    enabled: { sites: [site.getId()], orgs: [] },
    disabled: { sites: [], orgs: [site.getOrganizationId()] },
  });
  expect(instance.isHandlerEnabledForSite('site-override-handler', site)).to.be.true;
});

Similarly, the new #updatedHandler cleanup at configuration.model.js:267-268 (remove from enabled when disabling an enabled-by-default handler) is not exercised by any test. The removed c8 ignore next directives suggest these paths are now expected to be reachable.

Silent fallback paths have no operational signal - url-helpers.js:153, url-helpers.js:182, bot-blocker-detect.js:356

The PR adds three new failure paths that produce no logs, metrics, or traces:

url-helpers.js:153 - deadline expiry returns null, indistinguishable from "URL not resolvable"
url-helpers.js:182 - bare catch {} after retry exhaustion
bot-blocker-detect.js:356 - bare catch {} for body-read timeout, confidence silently drops to header-only

The PR description notes the original Z_BUF_ERROR was "killing the process with no error message." These new code paths reproduce the same observability gap for different failure modes. If body reads start timing out fleet-wide (e.g., a CDN change), on-call has no signal that detection quality has degraded. Add a log.warn with structured context ({ fn, url, cause }) on each fallback path. The tracingFetch import already implies a logging context is available.

Config precedence flip needs prod data audit before deploy - configuration.model.js:186-205

The new resolution order (enabledSites before disabledOrgs) means any existing configuration row where a site is in enabled.sites AND the org is in disabled.orgs will silently flip from disabled to enabled at deploy time. This is the intended behavior for the paid-upgrade scenario, but it should be verified against production data before release. Run a query against the configuration table counting rows where any handler has both enabled.sites and disabled.orgs populated for an overlapping (site, org) pair. If the count is zero, note it in the PR. If non-zero, review those rows individually to confirm they are all intended paid upgrades.

SSRF blast radius expanded by HEAD -> GET - bot-blocker-detect.js:340

Both detectBotBlocker and resolveCanonicalUrl accept caller-supplied URLs and fetch them server-side from a Lambda with no allowlist or private-CIDR filter. This PR widens the surface: previously HEAD requests meant response bodies never reached the Lambda; now GET with a 3s body read means internal service responses (EC2/ECS metadata, internal APIs) could be buffered in the html variable and passed to analyzeResponse.

Recommend adding a guard that rejects URLs whose hostname is an IP literal in private/loopback/link-local ranges (169.254.169.254, 10/8, 172.16/12, 192.168/16, 127/8). Even if the api-service onboarding endpoint validates upstream, defense-in-depth at the library level prevents future callers from inheriting the same exposure.

detectBotBlocker body read has no size cap - bot-blocker-detect.js:352

response.text() buffers the full HTML body up to the 3s timeout. For large SPA shells (10MB+), this materially increases Lambda heap during the window. Challenge markers (cf-ray, recaptcha sitekey, "Just a moment...") appear in the first KB. Cap the read at the first 64-256KB, or check Content-Length and bail to header-only analysis above some threshold. At onboarding cadence this is manageable; if detectBotBlocker is later reused at audit cadence for 1000+ sites, it becomes a cost and memory concern.

Minor (Nice to Have)

Test names and assertions no longer reflect actual behavior - url-helpers.test.js:409, bot-blocker-detect.test.js:700

url-helpers.test.js:409 is named "should use 15s timeout for GET requests" with an assertion expect(duration).to.be.below(16000). The actual budget is 7s (RESOLVE_CANONICAL_URL_TOTAL_TIMEOUT). The assertion passes vacuously. Rename to reflect the 7s deadline and tighten the assertion (e.g., below(8000)).

bot-blocker-detect.test.js:700 is named "proceeds with header-only analysis when GET body cannot be read" but the test setup provides a normal 200 response with readable body content. It tests the happy-path "body has no challenge markers" branch, not the body-read failure. Rename to match actual behavior.

BODY_READ_TIMEOUT value not pinned by any test - bot-blocker-detect.js:32

Changing the constant to 30000, 0, or removing the timeout entirely would not break any test. Add a test with a slow-streaming body (e.g., nock.delay({ body: 5000 }) or fake timers) that verifies the function returns header-only analysis within approximately 3s.

Recommendations

Consider splitting this into two PRs (shared-utils URL/bot-blocker fixes and shared-data-access config precedence). They have different blast radii and rollback stories. If the precedence change needs reverting, the URL resolver crash fix gets reverted with it.
Document the deadline contract in the resolveCanonicalUrl JSDoc: the deadline is shared across HEAD, GET, and all redirect hops.
The asymmetry "HEAD retries with GET on error, GET does not retry" (url-helpers.js:184) deserves a one-line comment explaining why - it is a chosen behavior that someone will "fix" during a future incident.
Add an operator-facing comment on the precedence change noting: "Emergency org-wide disable via disabled.orgs will not override sites already in enabled.sites - clear those entries explicitly."

Companion PR note

PR 2243 in spacecat-api-service references this PR's packages via gist tarballs. That PR was reviewed separately with REQUEST_CHANGES (gist-pinning flagged as Critical). Some discrepancies between the two PR descriptions are worth noting:

PR 2243 claims "8s safety timeout via Promise.race" - this PR implements a 7s deadline via AbortSignal (not Promise.race)
PR 2243 claims "30s timeout on site.save()" - no such change exists in this PR
The gist versions (data-access 3.55.0, shared-utils 1.112.5) are behind main

These are informational - the resolution is to wait for this PR to ship through the registry and then update PR 2243 to point at the published versions.

Assessment

Ready to merge? No - with fixes

The production code changes are well-motivated and the root-cause fix is correct. But the headline behavior change (site-enable-overrides-org-disable) has no test, the new fallback paths ship without observability (the same gap that hid the original crash), and the body read has no size cap. A prod data audit for the precedence change should be attached before deploy. All fixable in this branch.

Next Steps

Add the missing test for site-enable-overrides-org-disable, plus the updatedHandler cleanup test.
Add structured log.warn on the three new silent fallback paths.
Run the prod data audit query for the config precedence change and attach results to the PR.
Add a body-size cap or Content-Length check in detectBotBlocker.
The remaining items (SSRF guard, test name fixes, timeout pinning) are lower priority.

solaris007 · 2026-05-05T05:14:43Z

+        response.text(),
+        new Promise((_, reject) => { setTimeout(() => reject(new Error('body-read-timeout')), BODY_READ_TIMEOUT); }),
+      ]);
+    } catch {


Important: This catch block swallows body-read timeouts with no log. If body reads start failing fleet-wide, detection silently degrades to header-only with no operational signal. Add a log.warn with baseUrl and error.message.

solaris007 · 2026-05-05T05:14:43Z

+) {
+  const remaining = deadline - Date.now();
+  if (remaining <= 0) {
+    return null;


Important: Deadline expiry returns null with no log or metric - indistinguishable from URL not resolvable. Same observability gap that hid the original Z_BUF_ERROR. Add a log.warn here and on the bare catch at line 182.

solaris007 · 2026-05-05T05:14:43Z

+      // Promise.race guards against servers that stream body slowly after headers arrive.
+      // tracingFetch clears its AbortSignal in finally{} before returning, so response.text()
+      // has no built-in timeout.
+      html = await Promise.race([


Important: response.text() buffers the full body up to 3s with no size cap. Challenge markers appear in the first KB - consider capping the read at 64-256KB to limit Lambda memory pressure.

tkotthakota-adobe · 2026-05-06T03:29:58Z

Hi @solaris007 thanks for the review. Addressed below comments.

Deadline expiry silent — added log.warn when resolveCanonicalUrl hits the deadline
GET exhaustion silent — added log.warn in catch after GET retry fails
Body-read timeout silent — added log.warn in detectBotBlocker body-read catch
No body size cap — added 64 KB Content-Length check; skips body + warns if exceeded
Wrong test name (15s) — renamed to reflect actual 7s deadline; tightened assertion to below(8000)
Wrong test name (body unreadable) — renamed to reflect actual happy-path behavior
BODY_READ_TIMEOUT unpinned — exported constant; added Content-Length cap test and fake-timer timeout test
Moved audit enable changes to separate PR: fix: site-level handler overrides were ignored, org-level disable always won #1590

solaris007

Hey @tkotthakota-adobe,

Good progress since the last review - most prior findings are well-addressed and the configuration changes are cleanly split to PR 1590.

Strengths

Root-cause fix (decode: false at url-helpers.js:170) correctly targets the Brotli decompression crash - skipping decompression on a call that only needs resp.url and resp.ok.
Three structured log.warn calls with { fn, url, cause/method/contentLength } context close the observability gap that hid the original Z_BUF_ERROR crash (prior finding on silent fallback paths - addressed).
64KB Content-Length check (bot-blocker-detect.js:349-351) adds first-pass body-size gating (prior finding on size cap - addressed).
BODY_READ_TIMEOUT exported and pinned by fake-timer test at bot-blocker-detect.test.js:449 that ticks BODY_READ_TIMEOUT + 1 and asserts the warn fires (prior finding on timeout pinning - addressed).
Test names and assertions corrected to match the actual 7s deadline behavior (prior finding on test names - addressed).
Shared deadline threaded cleanly through all recursive hops with accurate JSDoc (url-helpers.js:140-145).
Configuration changes cleanly split to PR 1590, narrowing this PR to the URL/bot-blocker scope.

Issues

Important (Should Fix)

1. Chunked encoding bypasses the 64KB body-size cap (bot-blocker-detect.js:349)

The Content-Length check resolves to 0 when the header is missing (chunked transfer encoding), causing the cap to be skipped entirely:

const contentLength = parseInt(response.headers.get('content-length') || '0', 10);
if (contentLength > 0 && contentLength > BODY_READ_MAX_BYTES) { ... skip ... }
else { ... read body ... }

Body read then proceeds bounded only by the 3s wall-clock timeout. A server streaming chunked data can push multi-MB into response.text() within that window. Chunked encoding is the more common case for dynamic responses (reverse proxies, internal APIs). The 64KB cap was added to address the prior review's size concern, but it only covers servers that advertise Content-Length.

Fix: Read the body via a stream reader with a running byte counter that aborts at BODY_READ_MAX_BYTES, or accept the wall-clock-only bound and document it explicitly (e.g., a comment: "Content-Length check is best-effort; chunked responses are bounded by BODY_READ_TIMEOUT only").

2. SSRF surface widened by HEAD to GET with no private-IP guard (unresolved from prior review) (bot-blocker-detect.js:341)

This was raised in the prior review and the response skipped it (item numbering jumped from 5 to 7). Re-evaluating against the current diff:

isValidUrl (functions.js) only checks the URL parses and uses http/https - it accepts http://169.254.169.254/, http://10.0.0.1/, http://localhost/. Both detectBotBlocker and resolveCanonicalUrl fetch caller-supplied URLs server-side from a Lambda VPC.

The HEAD-to-GET change means internal service response bodies (up to 64KB, or unbounded if chunked per finding 1) are now buffered in the Lambda. analyzeResponse does not echo the body back to the caller, so this is not a direct read primitive. However: (a) timing and type/confidence fields leak reachability information, (b) resolveCanonicalUrl returns redirect targets - including internal URLs if a site 301s to one, and (c) log.warn calls include the URL, visible to anyone reading CloudWatch.

Fix: Either add a hostname guard that rejects URLs whose hostname is a private/loopback/link-local IP literal (169.254.x.x, 10/8, 172.16/12, 192.168/16, 127/8, ::1), or document the trust model explicitly in the JSDoc: "callers MUST validate that urlString resolves to a public hostname; this library does not enforce private-network protections." The second option is acceptable if the api-service onboarding endpoint already validates upstream.

Minor (Nice to Have)

3. GET-failure log.warn not asserted by any test (url-helpers.test.js:446)

The log.warn at url-helpers.js:194 ("[resolveCanonicalUrl] GET request failed") was added to address the prior "silent fallback" finding. The test at line 446 ("should return null when GET request errors") exercises the path but doesn't inject a log stub. If the log.warn is later removed, no test breaks.

Fix: Pass log = { warn: sinon.stub() } and assert expect(log.warn).to.have.been.calledWithMatch('[resolveCanonicalUrl] GET request failed').

4. Timer leak on body-read Promise.race (bot-blocker-detect.js:356-358)

When response.text() resolves before the 3s timeout, the setTimeout keeps the timer handle alive. At onboarding cadence this is negligible, but capturing the handle and clearing it in .finally is a cheap correctness improvement.

Recommendations

Both functions default log to console. In Lambda this bypasses structured logging (Powertools/Coralogix). Callers should pass their framework logger explicitly. Consider documenting this in the JSDoc.
The package-lock.json includes version bumps for sibling packages (data-access 3.55.0, drs-client 1.5.0, etc.) not touched in this PR's src/. This is likely monorepo churn from merging main. Consider regenerating the lockfile to only include changes this PR requires, or note the bumps in the PR description.
Consider adding an analyzeBody flag (default true) to detectBotBlocker so future high-cadence callers can opt into the cheaper header-only path without paying for the GET body read on every invocation.

Assessment

Ready to merge? With fixes.

Reasoning: The root-cause fix is correct and most prior findings are well-addressed. The two remaining Important items are: (1) the Content-Length cap is incomplete for chunked encoding - a real memory-pressure path reachable from user input, and (2) the SSRF concern from the prior review that was skipped without engagement - the HEAD-to-GET change widens the surface and the library has no private-IP guard. Both are addressable in this branch. The Minor items and Recommendations are non-blocking.

Next Steps

Address the chunked encoding bypass - either stream-based byte cap or document the accepted risk.
Address the SSRF concern - either add a hostname guard or document the trust model in JSDoc.
Optional: pin the GET-failure log.warn in the test and clear the timer handle.

tkotthakota-adobe · 2026-05-06T23:27:36Z

Hi @solaris007 thanks for re-review again. I believe I addressed all concerns.

tkotthakota-adobe · 2026-05-10T22:05:09Z

Hi @solaris007 Let me know if anything else needed.

tkotthakota-adobe added 2 commits April 23, 2026 16:38

fix: onboard flow fixes

24bdc60

fix resolve canonical url to avoid inficnite waiting

2fc65b9

fix bug to enable audits when paid profile is used when onboarding

0bbf40b

tkotthakota-adobe requested a review from rpapani April 24, 2026 15:33

tkotthakota-adobe added 8 commits April 24, 2026 10:38

lock

286c3dd

lint errors

9dd83ad

Merge branch 'main' of github.com:adobe/spacecat-shared into SITES-43237

f39826f

update lib

2f20818

fix tests

b43e5b6

tests

3abb681

fix

d02bfb3

simplify the logic

d21a8b4

tkotthakota-adobe requested review from ramboz, solaris007 and tathagat2241 April 29, 2026 21:00

tkotthakota-adobe self-assigned this May 4, 2026

tkotthakota-adobe mentioned this pull request May 4, 2026

fix: onboard flow fixes around bot detection and resolve canonical urls for some sites adobe/spacecat-api-service#2243

Open

solaris007 requested changes May 5, 2026

View reviewed changes

tkotthakota-adobe added 2 commits May 5, 2026 19:50

revert data access changes

d60db72

address review comments

c3c90e0

tkotthakota-adobe changed the title ~~fix: onboard flow fixes~~ fix: onboard flow fixes around bot detection and resolve canonical urls for some sites May 6, 2026

solaris007 requested changes May 6, 2026

View reviewed changes

Comment thread packages/spacecat-shared-utils/src/bot-blocker-detect/bot-blocker-detect.js

Comment thread packages/spacecat-shared-utils/src/bot-blocker-detect/bot-blocker-detect.js

tkotthakota-adobe added 2 commits May 6, 2026 18:20

fix review comments

108e356

address additional comments

2ccac93

tkotthakota-adobe requested a review from solaris007 May 6, 2026 23:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: onboard flow fixes around bot detection and resolve canonical urls for some sites#1556

fix: onboard flow fixes around bot detection and resolve canonical urls for some sites#1556
tkotthakota-adobe wants to merge 15 commits intomainfrom
SITES-43237

tkotthakota-adobe commented Apr 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

solaris007 left a comment

Uh oh!

Uh oh!

solaris007 May 5, 2026

Uh oh!

solaris007 May 5, 2026

Uh oh!

Uh oh!

solaris007 May 5, 2026

Uh oh!

tkotthakota-adobe commented May 6, 2026 •

edited

Loading

Uh oh!

solaris007 left a comment

Uh oh!

Uh oh!

Uh oh!

tkotthakota-adobe commented May 6, 2026

Uh oh!

tkotthakota-adobe commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tkotthakota-adobe commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

solaris007 left a comment

Choose a reason for hiding this comment

Strengths

Issues

Important (Should Fix)

Minor (Nice to Have)

Recommendations

Companion PR note

Assessment

Next Steps

Uh oh!

Uh oh!

solaris007 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

solaris007 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

solaris007 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

tkotthakota-adobe commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

solaris007 left a comment

Choose a reason for hiding this comment

Strengths

Issues

Important (Should Fix)

Minor (Nice to Have)

Recommendations

Assessment

Next Steps

Uh oh!

Uh oh!

Uh oh!

tkotthakota-adobe commented May 6, 2026

Uh oh!

tkotthakota-adobe commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tkotthakota-adobe commented Apr 23, 2026 •

edited

Loading

tkotthakota-adobe commented May 6, 2026 •

edited

Loading