Bench-runner v0.1.7: expose LLM + tool policy knobs as CLI / env tunables by adithyn7 · Pull Request #136 · TransformerOptimus/SuperCoder

adithyn7 · 2026-06-13T08:25:47Z

Summary

Ship the bench-runner frozen policy values as tunable CLI flags / env vars so the eval team can adjust them without rebuilding the binary. Required after kimi-k2p6 cells hit the hardcoded 15s header timeout on legitimately slow upstream cold starts (20-60s TTFB) — see the grid team's report in the PR thread.

Defaults match v0.1.6 exactly — zero behavior change unless overridden. Bench-runner-only; desktop app code path unchanged. No schema/prompt changes.

What's new

10 CLI flags, each with a SUPERCODER_<NAME> env fallback. Same defaults as v0.1.6's hardcoded LlmPolicy::bench() / ToolPolicy::bench().

Flag	Env var	Default
`--llm-header-timeout-ms`	`SUPERCODER_LLM_HEADER_TIMEOUT_MS`	`15000` (0 disables)
`--llm-max-retries`	`SUPERCODER_LLM_MAX_RETRIES`	`3`
`--llm-retry-initial-ms`	`SUPERCODER_LLM_RETRY_INITIAL_MS`	`1000`
`--llm-retry-multiplier`	`SUPERCODER_LLM_RETRY_MULTIPLIER`	`2.0`
`--llm-retry-max-ms`	`SUPERCODER_LLM_RETRY_MAX_MS`	`30000`
`--bash-timeout-ms`	`SUPERCODER_BASH_TIMEOUT_MS`	`300000`
`--search-timeout-ms`	`SUPERCODER_SEARCH_TIMEOUT_MS`	`60000`
`--codebase-search-limit`	`SUPERCODER_CODEBASE_SEARCH_LIMIT`	`20`
`--codebase-search-chunk-bytes`	`SUPERCODER_CODEBASE_SEARCH_CHUNK_BYTES`	`2048`
`--codebase-graph-section-cap`	`SUPERCODER_CODEBASE_GRAPH_SECTION_CAP`	`50`

What stays hardcoded (intentional)

http1_only, no_pool, search_default_ignores — these ARE the v0.1.6 transport fix (#135) and the bench treatment. Exposing them risks bringing back the decode disease or changing the experimental treatment.

For the kimi-k2p6 failing cells

SUPERCODER_LLM_HEADER_TIMEOUT_MS=90000 \
SUPERCODER_LLM_RETRY_MULTIPLIER=3      \
SUPERCODER_LLM_RETRY_MAX_MS=60000

90s for headers (matches measured 20-60s TTFB) + 1s → 3s → 9s → 27s retry backoff (capped at 60s) — enough room for proxy circuit breaker windows to drain.

Implementation

Two commits:

refactor(agent): expose codebase result caps as per-knob ToolPolicy fields — replaces the single codebase_result_caps: bool (which gated hardcoded 20/2048/50 constants in codebase_search.rs + codebase_graph.rs) with three Option<…> fields on ToolPolicy. ToolPolicy::bench() still resolves to the same effective values, so existing behavior is byte-identical.
feat(bench-runner): tunable LLM + tool policy knobs via CLI / env — the 10 flags + their wire-up in run(). Adds clap's env feature (pure activation of code already in clap; no new transitive deps).

Testing

577 tests pass workspace-wide (445 agent + 45 context-sync + 57 git-ops + 27 desktop + 3 new bench-runner CLI tests):
- cli_defaults_match_frozen_policy — locks every default to v0.1.6 values so the eval team's prior runs stay reproducible.
- cli_flag_overrides — verifies all 10 flags override correctly.
- header_timeout_zero_means_disabled — locks the 0 escape hatch.
cargo check --workspace --all-targets clean.
Cross-built amd64 static musl, verified in clean alpine:3.20 — --help shows all 10 new flags, env-var overrides accepted.

Invariants

Zero new transitive deps (Cargo.lock unchanged) → static-musl R1 invariant intact; CI bench-runner-musl is the gate.
No schema/prompt changes → treatment unchanged vs v0.1.6.
Desktop app agent unchanged → same SHARED_HTTP_CLIENT, same ToolPolicy::default() (permissive).

…ields Replace ToolPolicy::codebase_result_caps (single bool, hardcoded 20/2048/50 constants in codebase_search.rs + codebase_graph.rs) with three Option fields: codebase_search_limit: Option<u32> (None = no default-limit injection) codebase_search_chunk_bytes: Option<usize> (None = no content cap) codebase_graph_section_cap: Option<usize> (None = no section cap) ToolPolicy::bench() still resolves to the same effective values (Some(20) / Some(2048) / Some(50)) so existing behavior is byte-identical. The refactor enables bench-runner to tune these from CLI flags / env vars in the next commit without rebuilding the binary.

Add 10 CLI flags (each with SUPERCODER_<NAME> env fallback) for tuning the bench binary without rebuilding. Defaults match the frozen v0.1.6 LlmPolicy:: bench() and ToolPolicy::bench() values exactly — zero behavior change unless overridden. Required by the grid team after kimi-k2p6 hit the hardcoded 15s header timeout on legitimately slow upstream cold starts (20-60s TTFB). LLM transport / retry: --llm-header-timeout-ms (default 15000, 0 disables) --llm-max-retries (default 3) --llm-retry-initial-ms (default 1000) --llm-retry-multiplier (default 2.0) --llm-retry-max-ms (default 30000) Tool policy: --bash-timeout-ms (default 300000) --search-timeout-ms (default 60000) --codebase-search-limit (default 20) --codebase-search-chunk-bytes (default 2048) --codebase-graph-section-cap (default 50) http1_only and no_pool stay hardcoded — they ARE the v0.1.6 transport fix and toggling them risks bringing back the decode disease. For the kimi-k2p6 cells: SUPERCODER_LLM_HEADER_TIMEOUT_MS=90000 SUPERCODER_LLM_RETRY_MULTIPLIER=3 SUPERCODER_LLM_RETRY_MAX_MS=60000 clap 'env' feature added — pure activation of code already in clap; no new transitive deps (Cargo.lock unchanged). R1 invariant intact.

adithyn7 added 2 commits June 13, 2026 13:52

adithyn7 added the patch Patch version bump label Jun 13, 2026

adithyn7 merged commit eb9a1d9 into main Jun 13, 2026
4 checks passed

adithyn7 deleted the feat/bench-runner-tunable-policy branch June 13, 2026 08:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bench-runner v0.1.7: expose LLM + tool policy knobs as CLI / env tunables#136

Bench-runner v0.1.7: expose LLM + tool policy knobs as CLI / env tunables#136
adithyn7 merged 2 commits into
mainfrom
feat/bench-runner-tunable-policy

adithyn7 commented Jun 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adithyn7 commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's new

What stays hardcoded (intentional)

For the kimi-k2p6 failing cells

Implementation

Testing

Invariants

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adithyn7 commented Jun 13, 2026 •

edited

Loading