Bench-runner tool fixes: timeouts, ignore-list, result caps (ToolPolicy)#134
Merged
Conversation
Introduce a ToolPolicy carried on ToolContext so search/exec tools can run stricter under headless eval without changing the desktop app. Default policy is permissive (byte-identical to current app behavior); ToolPolicy::bench() enables: - bash: clamp the model-supplied timeout to a 300s ceiling - grep/glob: 60s wall timeout + skip build/vendor/coverage dirs (node_modules, target, dist, build, vendor, .next, __pycache__, coverage, .git) and minified assets, overridable when the model explicitly targets a dir via path/glob - codebase_search: default limit 20, per-chunk content capped to 2KB - codebase_graph: render at most 50 rows per section No tool-schema or prompt changes. Subagents and the desktop app keep the default policy.
The eval harness sets ToolPolicy::bench() so the frozen binary is robust to runaway searches and oversized result payloads. Always on — part of the harness identity, no CLI flag.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three agent-tooling fixes for the eval harness, all scoped to bench-runner only via a new
ToolPolicyonToolContext. The default policy is byte-identical to today's desktop-app behavior (no product risk); bench-runner opts into the strict variant. No tool-schema or prompt changes — tool definitions stay identical to v0.1.4.Targets the pilot's three measured issues (runaway searches, result pollution, oversized context payloads) ahead of the 173×ON/OFF grid.
The three fixes (bench policy only)
bashclamps the modeltimeouttomin(arg, 300s);grep/globget a 60s wall timeout that returns an adaptable error;codebase_*keep their existing 30s HTTP timeout.grep(rg--glob '!'+ grep--exclude-dir) andglob(walk-timefilter_entry, so it never descends) skip.git, node_modules, target, dist, build, __pycache__, .next, vendor, coverage+*.min.js/*.min.css. Overridable, no new param: if the model explicitly names a blocked dir inpath/include/pattern, it's searched anyway.codebase_searchdefaultslimit=20and caps each chunk's content to 2 KB (…[truncated]);codebase_graphrenders ≤50 rows per section (… and N more).Architecture
ToolPolicycarried onToolContext, threaded fromAgentConfigthrough the agent loop.ToolPolicy::default()= permissive (app);ToolPolicy::bench()= strict. bench-runner setsbench()unconditionally (frozen harness identity). Subagents + desktop app keep the default.Testing
Unit: 14 new tests — each fix has a bench-policy test and a default-policy test proving the app path is unchanged. Full agent lib suite: 442 passed.
cargo check --workspace --all-targetsclean.Live scale test (real compiled
glob/grep, 50k-filenode_modules):**/*.js~100× faster, node_modules fully excluded, real source still found — this is the regression that caused the 18-min monorepo hang.
Invariants
bench-runner-muslis the gate.