Skip to content

Bench-runner tool fixes: timeouts, ignore-list, result caps (ToolPolicy)#134

Merged
adithyn7 merged 2 commits into
mainfrom
feat/bench-runner-tool-policy
Jun 11, 2026
Merged

Bench-runner tool fixes: timeouts, ignore-list, result caps (ToolPolicy)#134
adithyn7 merged 2 commits into
mainfrom
feat/bench-runner-tool-policy

Conversation

@adithyn7

@adithyn7 adithyn7 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Three agent-tooling fixes for the eval harness, all scoped to bench-runner only via a new ToolPolicy on ToolContext. The default policy is byte-identical to today's desktop-app behavior (no product risk); bench-runner opts into the strict variant. No tool-schema or prompt changes — tool definitions stay identical to v0.1.4.

Targets the pilot's three measured issues (runaway searches, result pollution, oversized context payloads) ahead of the 173×ON/OFF grid.

The three fixes (bench policy only)

  1. Per-tool timeoutsbash clamps the model timeout to min(arg, 300s); grep/glob get a 60s wall timeout that returns an adaptable error; codebase_* keep their existing 30s HTTP timeout.
  2. Default ignore-listgrep (rg --glob '!' + grep --exclude-dir) and glob (walk-time filter_entry, so it never descends) skip .git, node_modules, target, dist, build, __pycache__, .next, vendor, coverage + *.min.js/*.min.css. Overridable, no new param: if the model explicitly names a blocked dir in path/include/pattern, it's searched anyway.
  3. Result capscodebase_search defaults limit=20 and caps each chunk's content to 2 KB (…[truncated]); codebase_graph renders ≤50 rows per section (… and N more).

Architecture

ToolPolicy carried on ToolContext, threaded from AgentConfig through the agent loop. ToolPolicy::default() = permissive (app); ToolPolicy::bench() = strict. bench-runner sets bench() unconditionally (frozen harness identity). Subagents + desktop app keep the default.

Testing

  • Unit: 14 new tests — each fix has a bench-policy test and a default-policy test proving the app path is unchanged. Full agent lib suite: 442 passed. cargo check --workspace --all-targets clean.

  • Live scale test (real compiled glob/grep, 50k-file node_modules):

    policy glob **/*.js grep node_modules in results src found min.js
    default (app) 752 ms 1049 ms yes (98–99) yes included
    bench 0 ms 10 ms none yes excluded

    ~100× faster, node_modules fully excluded, real source still found — this is the regression that caused the 18-min monorepo hang.

Invariants

  • Zero new dependencies (no Cargo.toml/Cargo.lock changes) → static-musl R1 invariant structurally untouched; CI bench-runner-musl is the gate.
  • No schema/prompt changes → treatment unchanged vs v0.1.4.

adithyn7 added 2 commits June 11, 2026 21:09
Introduce a ToolPolicy carried on ToolContext so search/exec tools can run
stricter under headless eval without changing the desktop app. Default policy
is permissive (byte-identical to current app behavior); ToolPolicy::bench()
enables:

- bash: clamp the model-supplied timeout to a 300s ceiling
- grep/glob: 60s wall timeout + skip build/vendor/coverage dirs (node_modules,
  target, dist, build, vendor, .next, __pycache__, coverage, .git) and minified
  assets, overridable when the model explicitly targets a dir via path/glob
- codebase_search: default limit 20, per-chunk content capped to 2KB
- codebase_graph: render at most 50 rows per section

No tool-schema or prompt changes. Subagents and the desktop app keep the
default policy.
The eval harness sets ToolPolicy::bench() so the frozen binary is robust to
runaway searches and oversized result payloads. Always on — part of the
harness identity, no CLI flag.
@adithyn7 adithyn7 added the patch Patch version bump label Jun 11, 2026
@adithyn7 adithyn7 merged commit 908afb7 into main Jun 11, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

patch Patch version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant