Skip to content

Agentic fuzzer + 7 filter bug fixes#12

Open
PLNech wants to merge 42 commits intomainfrom
feature/agentic-fuzzing
Open

Agentic fuzzer + 7 filter bug fixes#12
PLNech wants to merge 42 commits intomainfrom
feature/agentic-fuzzing

Conversation

@PLNech
Copy link
Copy Markdown
Member

@PLNech PLNech commented Mar 20, 2026

Summary

  • Built an agentic fuzzer (139 static tests, 35 command families, 7 heuristics)
  • Found 15 filter bugs, fixed 7 in this PR
  • Failure rate: 28% to 16% (remaining are by-design or known limitations)

Fixes

  1. Docker ps/images - accept arbitrary flags with smart passthrough (FUZZ-006)
  2. pip - respect user --format flag instead of overriding with --format=json (FUZZ-007)
  3. npm - route subcommands correctly instead of hardcoding npm run (FUZZ-008)
  4. find - delegate to system find when predicates detected (FUZZ-009)
  5. cargo - preserve stderr channel for clippy/check output (FUZZ-010)
  6. git branch -a - show all remotes without dedup when user asks (FUZZ-011)
  7. gh pr diff - raise max_lines 100 to 500 to avoid silent truncation (FUZZ-009/gh)

Remaining issues

Test plan

  • cargo fmt, clippy, test (427 pass)
  • Fuzzer: 15.8% failure rate, down from 28%
  • Manual: rtk docker ps -a, rtk pip list, rtk npm list, rtk find . -maxdepth 2 -name '*.rs'

PLNech and others added 30 commits March 17, 2026 21:15
- install.sh now points to algolia/rtk at pinned v0.22.2
  (overridable via RTK_VERSION env var)
- All docs, scripts, CI workflows updated from rtk-ai/rtk to algolia/rtk
- CHANGELOG historical upstream links preserved for attribution
- All GitHub Actions workflows target 'main' branch (not 'master')
- New ci.yml: build/test/clippy/fmt on ubuntu + macOS
- New telemetry-guard CI job: blocks telemetry.rs, ureq, sha2,
  hostname deps, and phone-home patterns in source
- CLAUDE.md documents fork maintenance strategy, sync policy,
  and telemetry exclusion rules
fork: establish algolia/rtk identity with no-telemetry CI guard
feat: support git -C directory option
gh: when --json flag is present, bypass all filtering and passthrough
raw JSON output. RTK was reformatting structured JSON into lossy
human-readable summaries, breaking downstream jq/python consumers.

grep: detect -c/--count mode and passthrough rg output verbatim.
The file:count format was being misinterpreted as file:linenum:content,
producing gibberish like "📄 101 (1): prefix_saturated".

Closes bug-report: 2026-03-19-gh-json-output-rewritten.md
scripts/fuzz-rtk.py uses Qwen 3.5 (via Algolia inference proxy) to
generate diverse command invocations targeting format-changing flags,
then compares raw vs RTK output with 6 heuristic checks:
JSON integrity, line expansion, emoji injection, exit code match,
data loss, and format preservation.

First run: 64 tests, 30 failures (47% failure rate). Key findings:
- grep: most rg flags crash with exit 2 (-l, --vimgrep, -A/-B/-C, --json)
- git log: custom --format/--pretty output mangled by compact filter
- git log: --graph characters stripped, --stat/--patch data lost

Bug reports documented in bug-reports/2026-03-19-fuzz-findings.md
RTK's grep had short flags colliding with rg:
  -l (RTK: --max-len) vs rg -l (--files-with-matches)
  -c (RTK: --context-only) vs rg -c (--count)
  -m (RTK: --max) vs rg -m (--max-count)

Removed colliding short flags from Clap. Now -l, -c, -m flow through
to rg via extra_args as users expect.

Generalized passthrough to all format-changing rg flags: -l, --json,
--vimgrep, -h, --count-matches, and context flags (-A/-B/-C). These
bypass RTK's grouping filter since they produce different output
formats.

Found via agentic fuzzing — 14/16 grep tests were failing before.
When user provides --format, --pretty, --graph, --stat, --numstat,
--shortstat, --patch, -p, --raw, --name-only, or --name-status,
RTK now passes output through verbatim instead of applying its
compact filter. These flags indicate the user wants specific output
formatting that RTK's compression would mangle.

--oneline without other detail flags still gets RTK's default
compression since it produces compatible output.

Found via agentic fuzzing — 13/16 git-log tests were failing.
- Add 11 new command families: find, cat, tree, cargo-test, cargo-clippy,
  git-branch, git-stash, curl, wc, env, diff
- Add STATIC_TESTS dict for deterministic regression without LLM
- Extract _run_single_test() reusable helper
- Switch to qwen3.5-35b-fp8 for faster throughput
- Add FUZZ-RTK.md project README documenting architecture and usage
New families: git-diff, git-show, git-branch, find, cargo-build,
cargo-clippy, diff, env, curl. Fix cargo 2>&1 safety false positive.
Found 8+ new bugs across ls, find, diff, git-branch, git-show, wc.
Fixes discovered by agentic fuzzer (Run 2, bugs 8/10/11/12):

- diff: exit code 1 when files differ (matches diff convention)
- git branch: passthrough for --format/--sort flags
- git show: passthrough for --name-only/--name-status/--raw/--no-patch/-p
- ls: passthrough when -l flag present (preserves metadata users expect)

Static fuzzer: 20 fail → 13 fail (65 tests)
- git diff: passthrough for --name-only/--name-status flags
- cargo build/clippy/check: passthrough for --message-format=json
- grep: disable_help_flag so -h flows to rg instead of Clap
- fuzz-rtk.py: remove false cat -b test case
Maps -q to --brief flag, shells out to real diff for brief mode.
Preserves exit codes (0=identical, 1=differ).
…rmat check

- grep/rg: sort lines before FORMAT_ALTERED comparison to handle
  non-deterministic rg thread ordering (was causing false positives)
- wc: skip FORMAT_ALTERED check entirely (filename stripping is
  intentional compression, not a format bug)

Results: 47% → 8% failure rate (5 remaining are design choices)
…UZZ-005)

Five bug classes discovered by LLM-powered fuzzing:
- FUZZ-001: git format flags not passthrough
- FUZZ-002: cargo --message-format=json mangled
- FUZZ-003: git diff --name-only filtered incorrectly
- FUZZ-004: grep -h intercepted by Clap
- FUZZ-005: missing flag support (diff -q, ls -l, git branch)
New bug classes tested:
- separator: -- handling between tool flags and args
- large-output: truncation behavior on 100+ line results
- stderr-commands: commands producing output on stderr
- gh-issue/gh-api: GitHub CLI subcommands
- empty-output: no-match/clean-state edge cases

New heuristic (#7): STDERR_LOSS — detects when raw stderr
content vanishes through RTK's filter pipeline.

Preflight now skips vault/API checks for static-only runs.
Multi-command families use RTK_MAP for command translation.

Found 3 new bugs: cargo test -- separator loss, cargo clippy
stderr loss, find -name Clap misinterpretation.
PLNech added 12 commits March 20, 2026 11:15
New families: docker-ps, docker-images, pip, go-test, npm, ruff,
pytest, git-tag, git-remote, cargo-test (expanded), cargo-clippy (expanded).

New bugs: FUZZ-006 (docker flags rejected), FUZZ-007 (pip format
override), FUZZ-008 (npm hardcoded to run), FUZZ-009-012 (find,
cargo stderr, git branch, diff data loss).

LLM round on weak families: 94% failure rate confirms docker/pip/npm/find
are comprehensively broken.
Add trailing_var_arg to DockerCommands::Ps and Images so Clap accepts
extra flags. Format-changing flags (--format, -q, --quiet) trigger
passthrough; content flags (-a, --filter, --no-trunc) pass through
while RTK still filters the output.

Also adds shared has_output_format_flag() utility to utils.rs for
reuse across modules.

Fixes 12 fuzzer failures.
…ZZ-007)

When user specifies --format=freeze/columns/etc, skip injecting
--format=json and passthrough raw output. Unknown subcommands now
passthrough instead of erroring.

Fixes format override for pip list/outdated. uv compatibility for
edge flags (--not-required) is a separate uv issue.
…-008)

Only npm run/run-script goes through the boilerplate filter. All other
subcommands (list, outdated, config, view, install, etc.) passthrough
to npm directly. Previously all input was prepended with "run", breaking
npm list, npm config, npm --version, etc.

Fixes 3 fuzzer failures.
Restructured Find command to use trailing_var_arg with manual parsing.
When system-find predicates (-name, -type, -maxdepth, etc.) are present,
delegates to system find. Without predicates, uses RTK's glob finder.

Fixes 3 fuzzer failures where Clap rejected valid find predicates.
When cargo subcommand output was primarily on stderr (clippy, check),
the filtered output was printed to stdout, causing STDERR_LOSS. Now
detects the original channel and preserves it.
When user explicitly passes -a/--all or -r/--remotes, show full remote
list instead of deduplicating against local branches. RTK's auto-added
-a (when no list flag given) still deduplicates for token savings.
…9/gh)

compact_diff() was called with max_lines=100 for PR diffs, silently
dropping files beyond ~2 in multi-file PRs. Agents doing PR review
received incomplete diffs with no truncation signal. Raised to 500.
Includes regression test with synthetic 5-file PR diff.
Replace clap subcommand enum with flat trailing_var_arg for compose,
then manually route subcommands after splitting global flags.
Fixes: docker compose -f deploy/docker-compose.yml build
filter_curl_output now pretty-prints JSON with serde_json, keeping
actual data values intact. Previous behaviour replaced values with
type names (string, int) via json_cmd, destroying API response data.
Commands containing =$( or =` are now skipped entirely by the rewrite
hook. The ENV_PREFIX regex misparsed VAR=$(grep ...) as an env prefix,
mangling the command. Safer to skip than to risk broken shell syntax.
@PLNech PLNech force-pushed the main branch 2 times, most recently from 43f54e2 to 37ac0de Compare March 31, 2026 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant