Bench-runner tool fixes: timeouts, ignore-list, result caps (ToolPolicy) by adithyn7 · Pull Request #134 · TransformerOptimus/SuperCoder

adithyn7 · 2026-06-11T15:52:59Z

Summary

Three agent-tooling fixes for the eval harness, all scoped to bench-runner only via a new ToolPolicy on ToolContext. The default policy is byte-identical to today's desktop-app behavior (no product risk); bench-runner opts into the strict variant. No tool-schema or prompt changes — tool definitions stay identical to v0.1.4.

Targets the pilot's three measured issues (runaway searches, result pollution, oversized context payloads) ahead of the 173×ON/OFF grid.

The three fixes (bench policy only)

Per-tool timeouts — bash clamps the model timeout to min(arg, 300s); grep/glob get a 60s wall timeout that returns an adaptable error; codebase_* keep their existing 30s HTTP timeout.
Default ignore-list — grep (rg --glob '!' + grep --exclude-dir) and glob (walk-time filter_entry, so it never descends) skip .git, node_modules, target, dist, build, __pycache__, .next, vendor, coverage + *.min.js/*.min.css. Overridable, no new param: if the model explicitly names a blocked dir in path/include/pattern, it's searched anyway.
Result caps — codebase_search defaults limit=20 and caps each chunk's content to 2 KB (…[truncated]); codebase_graph renders ≤50 rows per section (… and N more).

Architecture

ToolPolicy carried on ToolContext, threaded from AgentConfig through the agent loop. ToolPolicy::default() = permissive (app); ToolPolicy::bench() = strict. bench-runner sets bench() unconditionally (frozen harness identity). Subagents + desktop app keep the default.

Testing

Unit: 14 new tests — each fix has a bench-policy test and a default-policy test proving the app path is unchanged. Full agent lib suite: 442 passed. cargo check --workspace --all-targets clean.
Live scale test (real compiled glob/grep, 50k-file node_modules):

policy glob **/*.js grep node_modules in results src found min.js

default (app) 752 ms 1049 ms yes (98–99) yes included

bench 0 ms 10 ms none yes excluded

~100× faster, node_modules fully excluded, real source still found — this is the regression that caused the 18-min monorepo hang.

Invariants

Zero new dependencies (no Cargo.toml/Cargo.lock changes) → static-musl R1 invariant structurally untouched; CI bench-runner-musl is the gate.
No schema/prompt changes → treatment unchanged vs v0.1.4.

Introduce a ToolPolicy carried on ToolContext so search/exec tools can run stricter under headless eval without changing the desktop app. Default policy is permissive (byte-identical to current app behavior); ToolPolicy::bench() enables: - bash: clamp the model-supplied timeout to a 300s ceiling - grep/glob: 60s wall timeout + skip build/vendor/coverage dirs (node_modules, target, dist, build, vendor, .next, __pycache__, coverage, .git) and minified assets, overridable when the model explicitly targets a dir via path/glob - codebase_search: default limit 20, per-chunk content capped to 2KB - codebase_graph: render at most 50 rows per section No tool-schema or prompt changes. Subagents and the desktop app keep the default policy.

The eval harness sets ToolPolicy::bench() so the frozen binary is robust to runaway searches and oversized result payloads. Always on — part of the harness identity, no CLI flag.

adithyn7 added 2 commits June 11, 2026 21:09

feat(bench-runner): opt into strict ToolPolicy

b5815ae

The eval harness sets ToolPolicy::bench() so the frozen binary is robust to runaway searches and oversized result payloads. Always on — part of the harness identity, no CLI flag.

adithyn7 added the patch Patch version bump label Jun 11, 2026

adithyn7 merged commit 908afb7 into main Jun 11, 2026
4 checks passed

adithyn7 mentioned this pull request Jun 12, 2026

Bench-runner LLM transport fix: HTTP/1.1 + no pool + 15s header timeout (LlmPolicy) #135

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bench-runner tool fixes: timeouts, ignore-list, result caps (ToolPolicy)#134

Bench-runner tool fixes: timeouts, ignore-list, result caps (ToolPolicy)#134
adithyn7 merged 2 commits into
mainfrom
feat/bench-runner-tool-policy

adithyn7 commented Jun 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

policy	glob `*/.js`	grep	node_modules in results	src found	min.js
default (app)	752 ms	1049 ms	yes (98–99)	yes	included
bench	0 ms	10 ms	none	yes	excluded

Conversation

adithyn7 commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The three fixes (bench policy only)

Architecture

Testing

Invariants

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adithyn7 commented Jun 11, 2026 •

edited

Loading