docs: add headless crawler integration RFC by mason5052 · Pull Request #350 · vxcontrol/pentagi

mason5052 · 2026-06-14T02:34:50Z

Summary

Add examples/proposals/headless_crawler_integration.md, a docs-only RFC proposing an optional, tool-agnostic headless crawler / URL discovery capability for PentAGI agents. The RFC frames crawler discovery as disabled-by-default design work for maintainer review before any runtime implementation, and keeps it separate from the BrowserOS MCP interactive browser backend RFC (issue #342, PR #345).

Problem

Issue #336 asks for a crawler capability. PentAGI's current web flow can brute-force directories with dictionary tools such as ffuf, but it cannot crawl an application to discover its real routes, links, forms, parameters, and JavaScript endpoints. The browser tool in backend/pkg/tools/browser.go is single-URL scraper extraction (markdown/html/links), not crawling. Crawler-adjacent tools (for example katana, hakrawler) appear in the pentester prompt only as ad-hoc terminal commands, so their output is unstructured, unscoped, and not reusable across subtasks. Dictionary fuzzing guesses paths from a wordlist; crawling observes the paths an application actually exposes. The two are complementary, and there is no first-class, scoped, structured crawler capability today.

Solution

Add a focused RFC under examples/proposals/ that:

Proposes a tool-agnostic "crawler backend" / "discovery tool" abstraction over candidate tools (katana, crawlergo, rad, jsfinder) with no mandatory default.
Keeps dictionary fuzzing (ffuf/dirsearch) unchanged and complementary, not replaced.
Defines structured discovery artifacts: URLs, methods, status codes, forms, parameters, JavaScript endpoints, source page, depth, and scope decision.
Explains how agents consume the artifact: seed ffuf/dirsearch, guide browser checks, reduce repeated manual enumeration, and enrich reports.
Specifies safety: disabled/explicit by default, obey flow target scope, depth/page/request limits, rate limiting, same-origin/allowed-host controls, SSRF/private-network protection reusing the existing browser URL classification, robots.txt as an operator policy question (not a hard rule), no credentialed crawling by default, and active form submission gated behind separate approval.
Notes that JavaScript-heavy crawling overlaps with the BrowserOS MCP backend (issue [New Feature]: Add BrowserOS MCP support for better agent-controlled browsing #342) but stays focused on URL discovery artifacts, not interactive session automation.
Explicitly excludes runtime code, a new tool handler, Docker image/Compose changes, .env.example/env-var changes, GraphQL schema, DB migrations, generated files, frontend UI, installer logic, and provider configuration.

User Impact

Gives maintainers a concrete design artifact to review before any crawler runtime work lands.
Clarifies how crawler-based URL discovery would complement existing fuzzing without changing today's behavior.
No runtime impact. This is documentation only.

Test Plan

git diff --check clean (staged and unstaged).
git diff --name-only upstream/main...HEAD shows only examples/proposals/headless_crawler_integration.md.
Confirmed default branch is still main.
Confirmed Issue [Enhancement]: add Headless Browser Crawler #336 is still open.
Confirmed no open or closed PR already covers crawler / Katana / Crawlergo / Rad / jsfinder integration.
Verified diff is docs-only: no code, schema, generated, config, Docker image/Compose, frontend, env-var, or .env.example changes.
RFC is pure ASCII with no emojis or special characters.

Refs #336

Add examples/proposals/headless_crawler_integration.md, a docs-only RFC proposing an optional, tool-agnostic crawler / URL discovery capability for PentAGI agents. Candidate backends are katana, crawlergo, rad, and jsfinder, framed as candidates with no mandatory default. The RFC keeps dictionary fuzzing (ffuf/dirsearch) unchanged, defines structured discovery artifacts (URLs, forms, parameters, JavaScript endpoints, status codes, source page, depth, scope decision), and keeps crawler URL discovery separate from the BrowserOS MCP interactive browser backend. Refs vxcontrol#336

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a docs-only RFC proposing an optional headless crawler / URL discovery capability for PentAGI, including tool abstraction, artifact schema, safety constraints, and rollout milestones.

Changes:

Introduces an RFC describing crawler goals/non-goals and differentiation from BrowserOS MCP.
Proposes a tool-agnostic backend selection model and a normalized discovery artifact shape.
Sketches configuration, safety properties, failure modes, and incremental implementation milestones.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address Copilot review on PR vxcontrol#350: note that default_backend selects a crawl backend and passive extractors such as jsfinder are not selectable there; model scope_decision as a decision plus reason in both the field list and the JSON example; add an illustrative scope_entries example and clarify when allowed_hosts applies. Refs vxcontrol#336

Copilot AI review requested due to automatic review settings June 14, 2026 02:34

Copilot AI reviewed Jun 14, 2026

View reviewed changes

Comment thread examples/proposals/headless_crawler_integration.md Outdated

Comment thread examples/proposals/headless_crawler_integration.md Outdated

Comment thread examples/proposals/headless_crawler_integration.md Outdated

Comment thread examples/proposals/headless_crawler_integration.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add headless crawler integration RFC#350

docs: add headless crawler integration RFC#350
mason5052 wants to merge 2 commits into
vxcontrol:mainfrom
mason5052:codex/issue-336-headless-crawler-rfc

mason5052 commented Jun 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mason5052 commented Jun 14, 2026

Summary

Problem

Solution

User Impact

Test Plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants