Skip to content

docs: add headless crawler integration RFC#350

Open
mason5052 wants to merge 2 commits into
vxcontrol:mainfrom
mason5052:codex/issue-336-headless-crawler-rfc
Open

docs: add headless crawler integration RFC#350
mason5052 wants to merge 2 commits into
vxcontrol:mainfrom
mason5052:codex/issue-336-headless-crawler-rfc

Conversation

@mason5052

Copy link
Copy Markdown
Contributor

Summary

Add examples/proposals/headless_crawler_integration.md, a docs-only RFC proposing an optional, tool-agnostic headless crawler / URL discovery capability for PentAGI agents. The RFC frames crawler discovery as disabled-by-default design work for maintainer review before any runtime implementation, and keeps it separate from the BrowserOS MCP interactive browser backend RFC (issue #342, PR #345).

Problem

Issue #336 asks for a crawler capability. PentAGI's current web flow can brute-force directories with dictionary tools such as ffuf, but it cannot crawl an application to discover its real routes, links, forms, parameters, and JavaScript endpoints. The browser tool in backend/pkg/tools/browser.go is single-URL scraper extraction (markdown/html/links), not crawling. Crawler-adjacent tools (for example katana, hakrawler) appear in the pentester prompt only as ad-hoc terminal commands, so their output is unstructured, unscoped, and not reusable across subtasks. Dictionary fuzzing guesses paths from a wordlist; crawling observes the paths an application actually exposes. The two are complementary, and there is no first-class, scoped, structured crawler capability today.

Solution

Add a focused RFC under examples/proposals/ that:

  • Proposes a tool-agnostic "crawler backend" / "discovery tool" abstraction over candidate tools (katana, crawlergo, rad, jsfinder) with no mandatory default.
  • Keeps dictionary fuzzing (ffuf/dirsearch) unchanged and complementary, not replaced.
  • Defines structured discovery artifacts: URLs, methods, status codes, forms, parameters, JavaScript endpoints, source page, depth, and scope decision.
  • Explains how agents consume the artifact: seed ffuf/dirsearch, guide browser checks, reduce repeated manual enumeration, and enrich reports.
  • Specifies safety: disabled/explicit by default, obey flow target scope, depth/page/request limits, rate limiting, same-origin/allowed-host controls, SSRF/private-network protection reusing the existing browser URL classification, robots.txt as an operator policy question (not a hard rule), no credentialed crawling by default, and active form submission gated behind separate approval.
  • Notes that JavaScript-heavy crawling overlaps with the BrowserOS MCP backend (issue [New Feature]: Add BrowserOS MCP support for better agent-controlled browsing #342) but stays focused on URL discovery artifacts, not interactive session automation.
  • Explicitly excludes runtime code, a new tool handler, Docker image/Compose changes, .env.example/env-var changes, GraphQL schema, DB migrations, generated files, frontend UI, installer logic, and provider configuration.

User Impact

  • Gives maintainers a concrete design artifact to review before any crawler runtime work lands.
  • Clarifies how crawler-based URL discovery would complement existing fuzzing without changing today's behavior.
  • No runtime impact. This is documentation only.

Test Plan

  • git diff --check clean (staged and unstaged).
  • git diff --name-only upstream/main...HEAD shows only examples/proposals/headless_crawler_integration.md.
  • Confirmed default branch is still main.
  • Confirmed Issue [Enhancement]: add Headless Browser Crawler #336 is still open.
  • Confirmed no open or closed PR already covers crawler / Katana / Crawlergo / Rad / jsfinder integration.
  • Verified diff is docs-only: no code, schema, generated, config, Docker image/Compose, frontend, env-var, or .env.example changes.
  • RFC is pure ASCII with no emojis or special characters.

Refs #336

Add examples/proposals/headless_crawler_integration.md, a docs-only RFC proposing an optional, tool-agnostic crawler / URL discovery capability for PentAGI agents. Candidate backends are katana, crawlergo, rad, and jsfinder, framed as candidates with no mandatory default. The RFC keeps dictionary fuzzing (ffuf/dirsearch) unchanged, defines structured discovery artifacts (URLs, forms, parameters, JavaScript endpoints, status codes, source page, depth, scope decision), and keeps crawler URL discovery separate from the BrowserOS MCP interactive browser backend.

Refs vxcontrol#336
Copilot AI review requested due to automatic review settings June 14, 2026 02:34

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a docs-only RFC proposing an optional headless crawler / URL discovery capability for PentAGI, including tool abstraction, artifact schema, safety constraints, and rollout milestones.

Changes:

  • Introduces an RFC describing crawler goals/non-goals and differentiation from BrowserOS MCP.
  • Proposes a tool-agnostic backend selection model and a normalized discovery artifact shape.
  • Sketches configuration, safety properties, failure modes, and incremental implementation milestones.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/proposals/headless_crawler_integration.md Outdated
Comment thread examples/proposals/headless_crawler_integration.md Outdated
Comment thread examples/proposals/headless_crawler_integration.md Outdated
Comment thread examples/proposals/headless_crawler_integration.md
Address Copilot review on PR vxcontrol#350: note that default_backend selects a crawl backend and passive extractors such as jsfinder are not selectable there; model scope_decision as a decision plus reason in both the field list and the JSON example; add an illustrative scope_entries example and clarify when allowed_hosts applies.

Refs vxcontrol#336
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants