fix: onboard flow fixes around bot detection and resolve canonical urls for some sites#1556
fix: onboard flow fixes around bot detection and resolve canonical urls for some sites#1556tkotthakota-adobe wants to merge 15 commits intomainfrom
Conversation
|
This PR will trigger a patch release when merged. |
solaris007
left a comment
There was a problem hiding this comment.
Hey @tkotthakota-adobe,
Good root-cause analysis on the Z_BUF_ERROR Lambda crash. The decode: false fix targets the actual problem, and the 7s deadline shared across recursive calls is the right shape for bounding total wall time. A few things need attention before this can merge.
Strengths
decode: falseinresolveCanonicalUrl(url-helpers.js:167) is the correct root-cause fix for the Brotli decompression crash - skipping body decompression when only the URL and status code are needed.- The absolute deadline parameter (
url-helpers.js:137, 150) shared across recursive calls bounds worst-case latency. Previously each hop had its own 10-20s budget and could chain unboundedly. BODY_READ_TIMEOUTwithPromise.raceindetectBotBlocker(bot-blocker-detect.js:352-355) prevents indefinite hangs on servers that stream body data slowly after headers arrive.#updatedHandler(configuration.model.js:251-269) now consistently maintains bothenabledanddisabledlists with defensive|| []fallback, fixing a real TypeError path.- The
esmock-based test forresponse.text()rejection (bot-blocker-detect.test.js:714-733) is a clean approach for exercising the body-read fallback path.
Issues
Important (Should Fix)
Headline behavior change has no test - configuration.model.test.js:238
The PR's stated bug fix is "site enable overrides org disable." The new test only pins the opposite direction (site not listed + org in disabled.orgs -> false). There is no test constructing a handler with enabled: { sites: [siteId] } AND disabled: { orgs: [orgId] } and asserting the result is true. If a future refactor reintroduces the old precedence, CI stays green. Add at minimum:
it('returns true when site is in enabled.sites even if org is in disabled.orgs', () => {
instance.addHandler('site-override-handler', {
enabledByDefault: false,
enabled: { sites: [site.getId()], orgs: [] },
disabled: { sites: [], orgs: [site.getOrganizationId()] },
});
expect(instance.isHandlerEnabledForSite('site-override-handler', site)).to.be.true;
});Similarly, the new #updatedHandler cleanup at configuration.model.js:267-268 (remove from enabled when disabling an enabled-by-default handler) is not exercised by any test. The removed c8 ignore next directives suggest these paths are now expected to be reachable.
Silent fallback paths have no operational signal - url-helpers.js:153, url-helpers.js:182, bot-blocker-detect.js:356
The PR adds three new failure paths that produce no logs, metrics, or traces:
url-helpers.js:153- deadline expiry returns null, indistinguishable from "URL not resolvable"url-helpers.js:182- barecatch {}after retry exhaustionbot-blocker-detect.js:356- barecatch {}for body-read timeout, confidence silently drops to header-only
The PR description notes the original Z_BUF_ERROR was "killing the process with no error message." These new code paths reproduce the same observability gap for different failure modes. If body reads start timing out fleet-wide (e.g., a CDN change), on-call has no signal that detection quality has degraded. Add a log.warn with structured context ({ fn, url, cause }) on each fallback path. The tracingFetch import already implies a logging context is available.
Config precedence flip needs prod data audit before deploy - configuration.model.js:186-205
The new resolution order (enabledSites before disabledOrgs) means any existing configuration row where a site is in enabled.sites AND the org is in disabled.orgs will silently flip from disabled to enabled at deploy time. This is the intended behavior for the paid-upgrade scenario, but it should be verified against production data before release. Run a query against the configuration table counting rows where any handler has both enabled.sites and disabled.orgs populated for an overlapping (site, org) pair. If the count is zero, note it in the PR. If non-zero, review those rows individually to confirm they are all intended paid upgrades.
SSRF blast radius expanded by HEAD -> GET - bot-blocker-detect.js:340
Both detectBotBlocker and resolveCanonicalUrl accept caller-supplied URLs and fetch them server-side from a Lambda with no allowlist or private-CIDR filter. This PR widens the surface: previously HEAD requests meant response bodies never reached the Lambda; now GET with a 3s body read means internal service responses (EC2/ECS metadata, internal APIs) could be buffered in the html variable and passed to analyzeResponse.
Recommend adding a guard that rejects URLs whose hostname is an IP literal in private/loopback/link-local ranges (169.254.169.254, 10/8, 172.16/12, 192.168/16, 127/8). Even if the api-service onboarding endpoint validates upstream, defense-in-depth at the library level prevents future callers from inheriting the same exposure.
detectBotBlocker body read has no size cap - bot-blocker-detect.js:352
response.text() buffers the full HTML body up to the 3s timeout. For large SPA shells (10MB+), this materially increases Lambda heap during the window. Challenge markers (cf-ray, recaptcha sitekey, "Just a moment...") appear in the first KB. Cap the read at the first 64-256KB, or check Content-Length and bail to header-only analysis above some threshold. At onboarding cadence this is manageable; if detectBotBlocker is later reused at audit cadence for 1000+ sites, it becomes a cost and memory concern.
Minor (Nice to Have)
Test names and assertions no longer reflect actual behavior - url-helpers.test.js:409, bot-blocker-detect.test.js:700
url-helpers.test.js:409 is named "should use 15s timeout for GET requests" with an assertion expect(duration).to.be.below(16000). The actual budget is 7s (RESOLVE_CANONICAL_URL_TOTAL_TIMEOUT). The assertion passes vacuously. Rename to reflect the 7s deadline and tighten the assertion (e.g., below(8000)).
bot-blocker-detect.test.js:700 is named "proceeds with header-only analysis when GET body cannot be read" but the test setup provides a normal 200 response with readable body content. It tests the happy-path "body has no challenge markers" branch, not the body-read failure. Rename to match actual behavior.
BODY_READ_TIMEOUT value not pinned by any test - bot-blocker-detect.js:32
Changing the constant to 30000, 0, or removing the timeout entirely would not break any test. Add a test with a slow-streaming body (e.g., nock.delay({ body: 5000 }) or fake timers) that verifies the function returns header-only analysis within approximately 3s.
Recommendations
- Consider splitting this into two PRs (shared-utils URL/bot-blocker fixes and shared-data-access config precedence). They have different blast radii and rollback stories. If the precedence change needs reverting, the URL resolver crash fix gets reverted with it.
- Document the deadline contract in the
resolveCanonicalUrlJSDoc: the deadline is shared across HEAD, GET, and all redirect hops. - The asymmetry "HEAD retries with GET on error, GET does not retry" (
url-helpers.js:184) deserves a one-line comment explaining why - it is a chosen behavior that someone will "fix" during a future incident. - Add an operator-facing comment on the precedence change noting: "Emergency org-wide disable via
disabled.orgswill not override sites already inenabled.sites- clear those entries explicitly."
Companion PR note
PR 2243 in spacecat-api-service references this PR's packages via gist tarballs. That PR was reviewed separately with REQUEST_CHANGES (gist-pinning flagged as Critical). Some discrepancies between the two PR descriptions are worth noting:
- PR 2243 claims "8s safety timeout via Promise.race" - this PR implements a 7s deadline via AbortSignal (not Promise.race)
- PR 2243 claims "30s timeout on site.save()" - no such change exists in this PR
- The gist versions (data-access 3.55.0, shared-utils 1.112.5) are behind main
These are informational - the resolution is to wait for this PR to ship through the registry and then update PR 2243 to point at the published versions.
Assessment
Ready to merge? No - with fixes
The production code changes are well-motivated and the root-cause fix is correct. But the headline behavior change (site-enable-overrides-org-disable) has no test, the new fallback paths ship without observability (the same gap that hid the original crash), and the body read has no size cap. A prod data audit for the precedence change should be attached before deploy. All fixable in this branch.
Next Steps
- Add the missing test for site-enable-overrides-org-disable, plus the updatedHandler cleanup test.
- Add structured
log.warnon the three new silent fallback paths. - Run the prod data audit query for the config precedence change and attach results to the PR.
- Add a body-size cap or Content-Length check in detectBotBlocker.
- The remaining items (SSRF guard, test name fixes, timeout pinning) are lower priority.
| response.text(), | ||
| new Promise((_, reject) => { setTimeout(() => reject(new Error('body-read-timeout')), BODY_READ_TIMEOUT); }), | ||
| ]); | ||
| } catch { |
There was a problem hiding this comment.
Important: This catch block swallows body-read timeouts with no log. If body reads start failing fleet-wide, detection silently degrades to header-only with no operational signal. Add a log.warn with baseUrl and error.message.
| ) { | ||
| const remaining = deadline - Date.now(); | ||
| if (remaining <= 0) { | ||
| return null; |
There was a problem hiding this comment.
Important: Deadline expiry returns null with no log or metric - indistinguishable from URL not resolvable. Same observability gap that hid the original Z_BUF_ERROR. Add a log.warn here and on the bare catch at line 182.
| // Promise.race guards against servers that stream body slowly after headers arrive. | ||
| // tracingFetch clears its AbortSignal in finally{} before returning, so response.text() | ||
| // has no built-in timeout. | ||
| html = await Promise.race([ |
There was a problem hiding this comment.
Important: response.text() buffers the full body up to 3s with no size cap. Challenge markers appear in the first KB - consider capping the read at 64-256KB to limit Lambda memory pressure.
|
Hi @solaris007 thanks for the review. Addressed below comments.
|
solaris007
left a comment
There was a problem hiding this comment.
Hey @tkotthakota-adobe,
Good progress since the last review - most prior findings are well-addressed and the configuration changes are cleanly split to PR 1590.
Strengths
- Root-cause fix (
decode: falseat url-helpers.js:170) correctly targets the Brotli decompression crash - skipping decompression on a call that only needsresp.urlandresp.ok. - Three structured
log.warncalls with{ fn, url, cause/method/contentLength }context close the observability gap that hid the original Z_BUF_ERROR crash (prior finding on silent fallback paths - addressed). - 64KB Content-Length check (bot-blocker-detect.js:349-351) adds first-pass body-size gating (prior finding on size cap - addressed).
BODY_READ_TIMEOUTexported and pinned by fake-timer test at bot-blocker-detect.test.js:449 that ticksBODY_READ_TIMEOUT + 1and asserts the warn fires (prior finding on timeout pinning - addressed).- Test names and assertions corrected to match the actual 7s deadline behavior (prior finding on test names - addressed).
- Shared deadline threaded cleanly through all recursive hops with accurate JSDoc (url-helpers.js:140-145).
- Configuration changes cleanly split to PR 1590, narrowing this PR to the URL/bot-blocker scope.
Issues
Important (Should Fix)
1. Chunked encoding bypasses the 64KB body-size cap (bot-blocker-detect.js:349)
The Content-Length check resolves to 0 when the header is missing (chunked transfer encoding), causing the cap to be skipped entirely:
const contentLength = parseInt(response.headers.get('content-length') || '0', 10);
if (contentLength > 0 && contentLength > BODY_READ_MAX_BYTES) { ... skip ... }
else { ... read body ... }Body read then proceeds bounded only by the 3s wall-clock timeout. A server streaming chunked data can push multi-MB into response.text() within that window. Chunked encoding is the more common case for dynamic responses (reverse proxies, internal APIs). The 64KB cap was added to address the prior review's size concern, but it only covers servers that advertise Content-Length.
Fix: Read the body via a stream reader with a running byte counter that aborts at BODY_READ_MAX_BYTES, or accept the wall-clock-only bound and document it explicitly (e.g., a comment: "Content-Length check is best-effort; chunked responses are bounded by BODY_READ_TIMEOUT only").
2. SSRF surface widened by HEAD to GET with no private-IP guard (unresolved from prior review) (bot-blocker-detect.js:341)
This was raised in the prior review and the response skipped it (item numbering jumped from 5 to 7). Re-evaluating against the current diff:
isValidUrl (functions.js) only checks the URL parses and uses http/https - it accepts http://169.254.169.254/, http://10.0.0.1/, http://localhost/. Both detectBotBlocker and resolveCanonicalUrl fetch caller-supplied URLs server-side from a Lambda VPC.
The HEAD-to-GET change means internal service response bodies (up to 64KB, or unbounded if chunked per finding 1) are now buffered in the Lambda. analyzeResponse does not echo the body back to the caller, so this is not a direct read primitive. However: (a) timing and type/confidence fields leak reachability information, (b) resolveCanonicalUrl returns redirect targets - including internal URLs if a site 301s to one, and (c) log.warn calls include the URL, visible to anyone reading CloudWatch.
Fix: Either add a hostname guard that rejects URLs whose hostname is a private/loopback/link-local IP literal (169.254.x.x, 10/8, 172.16/12, 192.168/16, 127/8, ::1), or document the trust model explicitly in the JSDoc: "callers MUST validate that urlString resolves to a public hostname; this library does not enforce private-network protections." The second option is acceptable if the api-service onboarding endpoint already validates upstream.
Minor (Nice to Have)
3. GET-failure log.warn not asserted by any test (url-helpers.test.js:446)
The log.warn at url-helpers.js:194 ("[resolveCanonicalUrl] GET request failed") was added to address the prior "silent fallback" finding. The test at line 446 ("should return null when GET request errors") exercises the path but doesn't inject a log stub. If the log.warn is later removed, no test breaks.
Fix: Pass log = { warn: sinon.stub() } and assert expect(log.warn).to.have.been.calledWithMatch('[resolveCanonicalUrl] GET request failed').
4. Timer leak on body-read Promise.race (bot-blocker-detect.js:356-358)
When response.text() resolves before the 3s timeout, the setTimeout keeps the timer handle alive. At onboarding cadence this is negligible, but capturing the handle and clearing it in .finally is a cheap correctness improvement.
Recommendations
- Both functions default
logtoconsole. In Lambda this bypasses structured logging (Powertools/Coralogix). Callers should pass their framework logger explicitly. Consider documenting this in the JSDoc. - The package-lock.json includes version bumps for sibling packages (data-access 3.55.0, drs-client 1.5.0, etc.) not touched in this PR's src/. This is likely monorepo churn from merging main. Consider regenerating the lockfile to only include changes this PR requires, or note the bumps in the PR description.
- Consider adding an
analyzeBodyflag (defaulttrue) todetectBotBlockerso future high-cadence callers can opt into the cheaper header-only path without paying for the GET body read on every invocation.
Assessment
Ready to merge? With fixes.
Reasoning: The root-cause fix is correct and most prior findings are well-addressed. The two remaining Important items are: (1) the Content-Length cap is incomplete for chunked encoding - a real memory-pressure path reachable from user input, and (2) the SSRF concern from the prior review that was skipped without engagement - the HEAD-to-GET change widens the surface and the library has no private-IP guard. Both are addressable in this branch. The Minor items and Recommendations are non-blocking.
Next Steps
- Address the chunked encoding bypass - either stream-based byte cap or document the accepted risk.
- Address the SSRF concern - either add a hostname guard or document the trust model in JSDoc.
- Optional: pin the GET-failure log.warn in the test and clear the timer handle.
|
Hi @solaris007 thanks for re-review again. I believe I addressed all concerns. |
|
Hi @solaris007 Let me know if anything else needed. |
https://jira.corp.adobe.com/browse/SITES-43237
spacecat-shared-utils
resolveCanonicalUrl crashed the Lambda silently on sites with Brotli-compressed responses (e.g.,
otempo.com.br). The function only needs the final URL and status code but @adobe/fetch was
auto-decompressing the body. The abandoned stream caused an unhandled Z_BUF_ERROR that killed the
process with no error message.
(Cloudflare, Imperva, CAPTCHA)