Bench-runner v0.1.7: expose LLM + tool policy knobs as CLI / env tunables#136
Merged
Conversation
…ields Replace ToolPolicy::codebase_result_caps (single bool, hardcoded 20/2048/50 constants in codebase_search.rs + codebase_graph.rs) with three Option fields: codebase_search_limit: Option<u32> (None = no default-limit injection) codebase_search_chunk_bytes: Option<usize> (None = no content cap) codebase_graph_section_cap: Option<usize> (None = no section cap) ToolPolicy::bench() still resolves to the same effective values (Some(20) / Some(2048) / Some(50)) so existing behavior is byte-identical. The refactor enables bench-runner to tune these from CLI flags / env vars in the next commit without rebuilding the binary.
Add 10 CLI flags (each with SUPERCODER_<NAME> env fallback) for tuning the bench binary without rebuilding. Defaults match the frozen v0.1.6 LlmPolicy:: bench() and ToolPolicy::bench() values exactly — zero behavior change unless overridden. Required by the grid team after kimi-k2p6 hit the hardcoded 15s header timeout on legitimately slow upstream cold starts (20-60s TTFB). LLM transport / retry: --llm-header-timeout-ms (default 15000, 0 disables) --llm-max-retries (default 3) --llm-retry-initial-ms (default 1000) --llm-retry-multiplier (default 2.0) --llm-retry-max-ms (default 30000) Tool policy: --bash-timeout-ms (default 300000) --search-timeout-ms (default 60000) --codebase-search-limit (default 20) --codebase-search-chunk-bytes (default 2048) --codebase-graph-section-cap (default 50) http1_only and no_pool stay hardcoded — they ARE the v0.1.6 transport fix and toggling them risks bringing back the decode disease. For the kimi-k2p6 cells: SUPERCODER_LLM_HEADER_TIMEOUT_MS=90000 SUPERCODER_LLM_RETRY_MULTIPLIER=3 SUPERCODER_LLM_RETRY_MAX_MS=60000 clap 'env' feature added — pure activation of code already in clap; no new transitive deps (Cargo.lock unchanged). R1 invariant intact.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ship the bench-runner frozen policy values as tunable CLI flags / env vars so the eval team can adjust them without rebuilding the binary. Required after kimi-k2p6 cells hit the hardcoded 15s header timeout on legitimately slow upstream cold starts (20-60s TTFB) — see the grid team's report in the PR thread.
Defaults match v0.1.6 exactly — zero behavior change unless overridden. Bench-runner-only; desktop app code path unchanged. No schema/prompt changes.
What's new
10 CLI flags, each with a
SUPERCODER_<NAME>env fallback. Same defaults as v0.1.6's hardcodedLlmPolicy::bench()/ToolPolicy::bench().--llm-header-timeout-msSUPERCODER_LLM_HEADER_TIMEOUT_MS15000(0 disables)--llm-max-retriesSUPERCODER_LLM_MAX_RETRIES3--llm-retry-initial-msSUPERCODER_LLM_RETRY_INITIAL_MS1000--llm-retry-multiplierSUPERCODER_LLM_RETRY_MULTIPLIER2.0--llm-retry-max-msSUPERCODER_LLM_RETRY_MAX_MS30000--bash-timeout-msSUPERCODER_BASH_TIMEOUT_MS300000--search-timeout-msSUPERCODER_SEARCH_TIMEOUT_MS60000--codebase-search-limitSUPERCODER_CODEBASE_SEARCH_LIMIT20--codebase-search-chunk-bytesSUPERCODER_CODEBASE_SEARCH_CHUNK_BYTES2048--codebase-graph-section-capSUPERCODER_CODEBASE_GRAPH_SECTION_CAP50What stays hardcoded (intentional)
http1_only,no_pool,search_default_ignores— these ARE the v0.1.6 transport fix (#135) and the bench treatment. Exposing them risks bringing back the decode disease or changing the experimental treatment.For the kimi-k2p6 failing cells
90s for headers (matches measured 20-60s TTFB) + 1s → 3s → 9s → 27s retry backoff (capped at 60s) — enough room for proxy circuit breaker windows to drain.
Implementation
Two commits:
refactor(agent): expose codebase result caps as per-knob ToolPolicy fields— replaces the singlecodebase_result_caps: bool(which gated hardcoded 20/2048/50 constants incodebase_search.rs+codebase_graph.rs) with threeOption<…>fields onToolPolicy.ToolPolicy::bench()still resolves to the same effective values, so existing behavior is byte-identical.feat(bench-runner): tunable LLM + tool policy knobs via CLI / env— the 10 flags + their wire-up inrun(). Adds clap'senvfeature (pure activation of code already in clap; no new transitive deps).Testing
cli_defaults_match_frozen_policy— locks every default to v0.1.6 values so the eval team's prior runs stay reproducible.cli_flag_overrides— verifies all 10 flags override correctly.header_timeout_zero_means_disabled— locks the0escape hatch.cargo check --workspace --all-targetsclean.alpine:3.20—--helpshows all 10 new flags, env-var overrides accepted.Invariants
Cargo.lockunchanged) → static-musl R1 invariant intact; CIbench-runner-muslis the gate.SHARED_HTTP_CLIENT, sameToolPolicy::default()(permissive).