docs: add headless crawler integration RFC#350
Open
mason5052 wants to merge 2 commits into
Open
Conversation
Add examples/proposals/headless_crawler_integration.md, a docs-only RFC proposing an optional, tool-agnostic crawler / URL discovery capability for PentAGI agents. Candidate backends are katana, crawlergo, rad, and jsfinder, framed as candidates with no mandatory default. The RFC keeps dictionary fuzzing (ffuf/dirsearch) unchanged, defines structured discovery artifacts (URLs, forms, parameters, JavaScript endpoints, status codes, source page, depth, scope decision), and keeps crawler URL discovery separate from the BrowserOS MCP interactive browser backend. Refs vxcontrol#336
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a docs-only RFC proposing an optional headless crawler / URL discovery capability for PentAGI, including tool abstraction, artifact schema, safety constraints, and rollout milestones.
Changes:
- Introduces an RFC describing crawler goals/non-goals and differentiation from BrowserOS MCP.
- Proposes a tool-agnostic backend selection model and a normalized discovery artifact shape.
- Sketches configuration, safety properties, failure modes, and incremental implementation milestones.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Address Copilot review on PR vxcontrol#350: note that default_backend selects a crawl backend and passive extractors such as jsfinder are not selectable there; model scope_decision as a decision plus reason in both the field list and the JSON example; add an illustrative scope_entries example and clarify when allowed_hosts applies. Refs vxcontrol#336
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add
examples/proposals/headless_crawler_integration.md, a docs-only RFC proposing an optional, tool-agnostic headless crawler / URL discovery capability for PentAGI agents. The RFC frames crawler discovery as disabled-by-default design work for maintainer review before any runtime implementation, and keeps it separate from the BrowserOS MCP interactive browser backend RFC (issue #342, PR #345).Problem
Issue #336 asks for a crawler capability. PentAGI's current web flow can brute-force directories with dictionary tools such as
ffuf, but it cannot crawl an application to discover its real routes, links, forms, parameters, and JavaScript endpoints. The browser tool inbackend/pkg/tools/browser.gois single-URL scraper extraction (markdown/html/links), not crawling. Crawler-adjacent tools (for examplekatana,hakrawler) appear in the pentester prompt only as ad-hoc terminal commands, so their output is unstructured, unscoped, and not reusable across subtasks. Dictionary fuzzing guesses paths from a wordlist; crawling observes the paths an application actually exposes. The two are complementary, and there is no first-class, scoped, structured crawler capability today.Solution
Add a focused RFC under
examples/proposals/that:katana,crawlergo,rad,jsfinder) with no mandatory default.ffuf/dirsearch) unchanged and complementary, not replaced.ffuf/dirsearch, guide browser checks, reduce repeated manual enumeration, and enrich reports.robots.txtas an operator policy question (not a hard rule), no credentialed crawling by default, and active form submission gated behind separate approval..env.example/env-var changes, GraphQL schema, DB migrations, generated files, frontend UI, installer logic, and provider configuration.User Impact
Test Plan
git diff --checkclean (staged and unstaged).git diff --name-only upstream/main...HEADshows onlyexamples/proposals/headless_crawler_integration.md.main..env.examplechanges.Refs #336