Skip to content

Bench-runner v0.1.7: expose LLM + tool policy knobs as CLI / env tunables#136

Merged
adithyn7 merged 2 commits into
mainfrom
feat/bench-runner-tunable-policy
Jun 13, 2026
Merged

Bench-runner v0.1.7: expose LLM + tool policy knobs as CLI / env tunables#136
adithyn7 merged 2 commits into
mainfrom
feat/bench-runner-tunable-policy

Conversation

@adithyn7

@adithyn7 adithyn7 commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Summary

Ship the bench-runner frozen policy values as tunable CLI flags / env vars so the eval team can adjust them without rebuilding the binary. Required after kimi-k2p6 cells hit the hardcoded 15s header timeout on legitimately slow upstream cold starts (20-60s TTFB) — see the grid team's report in the PR thread.

Defaults match v0.1.6 exactly — zero behavior change unless overridden. Bench-runner-only; desktop app code path unchanged. No schema/prompt changes.

What's new

10 CLI flags, each with a SUPERCODER_<NAME> env fallback. Same defaults as v0.1.6's hardcoded LlmPolicy::bench() / ToolPolicy::bench().

Flag Env var Default
--llm-header-timeout-ms SUPERCODER_LLM_HEADER_TIMEOUT_MS 15000 (0 disables)
--llm-max-retries SUPERCODER_LLM_MAX_RETRIES 3
--llm-retry-initial-ms SUPERCODER_LLM_RETRY_INITIAL_MS 1000
--llm-retry-multiplier SUPERCODER_LLM_RETRY_MULTIPLIER 2.0
--llm-retry-max-ms SUPERCODER_LLM_RETRY_MAX_MS 30000
--bash-timeout-ms SUPERCODER_BASH_TIMEOUT_MS 300000
--search-timeout-ms SUPERCODER_SEARCH_TIMEOUT_MS 60000
--codebase-search-limit SUPERCODER_CODEBASE_SEARCH_LIMIT 20
--codebase-search-chunk-bytes SUPERCODER_CODEBASE_SEARCH_CHUNK_BYTES 2048
--codebase-graph-section-cap SUPERCODER_CODEBASE_GRAPH_SECTION_CAP 50

What stays hardcoded (intentional)

http1_only, no_pool, search_default_ignores — these ARE the v0.1.6 transport fix (#135) and the bench treatment. Exposing them risks bringing back the decode disease or changing the experimental treatment.

For the kimi-k2p6 failing cells

SUPERCODER_LLM_HEADER_TIMEOUT_MS=90000 \
SUPERCODER_LLM_RETRY_MULTIPLIER=3      \
SUPERCODER_LLM_RETRY_MAX_MS=60000

90s for headers (matches measured 20-60s TTFB) + 1s → 3s → 9s → 27s retry backoff (capped at 60s) — enough room for proxy circuit breaker windows to drain.

Implementation

Two commits:

  1. refactor(agent): expose codebase result caps as per-knob ToolPolicy fields — replaces the single codebase_result_caps: bool (which gated hardcoded 20/2048/50 constants in codebase_search.rs + codebase_graph.rs) with three Option<…> fields on ToolPolicy. ToolPolicy::bench() still resolves to the same effective values, so existing behavior is byte-identical.
  2. feat(bench-runner): tunable LLM + tool policy knobs via CLI / env — the 10 flags + their wire-up in run(). Adds clap's env feature (pure activation of code already in clap; no new transitive deps).

Testing

  • 577 tests pass workspace-wide (445 agent + 45 context-sync + 57 git-ops + 27 desktop + 3 new bench-runner CLI tests):
    • cli_defaults_match_frozen_policy — locks every default to v0.1.6 values so the eval team's prior runs stay reproducible.
    • cli_flag_overrides — verifies all 10 flags override correctly.
    • header_timeout_zero_means_disabled — locks the 0 escape hatch.
  • cargo check --workspace --all-targets clean.
  • Cross-built amd64 static musl, verified in clean alpine:3.20--help shows all 10 new flags, env-var overrides accepted.

Invariants

  • Zero new transitive deps (Cargo.lock unchanged) → static-musl R1 invariant intact; CI bench-runner-musl is the gate.
  • No schema/prompt changes → treatment unchanged vs v0.1.6.
  • Desktop app agent unchanged → same SHARED_HTTP_CLIENT, same ToolPolicy::default() (permissive).

adithyn7 added 2 commits June 13, 2026 13:52
…ields

Replace ToolPolicy::codebase_result_caps (single bool, hardcoded 20/2048/50
constants in codebase_search.rs + codebase_graph.rs) with three Option fields:

  codebase_search_limit:       Option<u32>     (None = no default-limit injection)
  codebase_search_chunk_bytes: Option<usize>   (None = no content cap)
  codebase_graph_section_cap:  Option<usize>   (None = no section cap)

ToolPolicy::bench() still resolves to the same effective values (Some(20)
/ Some(2048) / Some(50)) so existing behavior is byte-identical. The
refactor enables bench-runner to tune these from CLI flags / env vars in
the next commit without rebuilding the binary.
Add 10 CLI flags (each with SUPERCODER_<NAME> env fallback) for tuning the
bench binary without rebuilding. Defaults match the frozen v0.1.6 LlmPolicy::
bench() and ToolPolicy::bench() values exactly — zero behavior change unless
overridden. Required by the grid team after kimi-k2p6 hit the hardcoded 15s
header timeout on legitimately slow upstream cold starts (20-60s TTFB).

LLM transport / retry:
  --llm-header-timeout-ms   (default 15000, 0 disables)
  --llm-max-retries         (default 3)
  --llm-retry-initial-ms    (default 1000)
  --llm-retry-multiplier    (default 2.0)
  --llm-retry-max-ms        (default 30000)

Tool policy:
  --bash-timeout-ms         (default 300000)
  --search-timeout-ms       (default 60000)
  --codebase-search-limit   (default 20)
  --codebase-search-chunk-bytes (default 2048)
  --codebase-graph-section-cap  (default 50)

http1_only and no_pool stay hardcoded — they ARE the v0.1.6 transport fix
and toggling them risks bringing back the decode disease.

For the kimi-k2p6 cells:
  SUPERCODER_LLM_HEADER_TIMEOUT_MS=90000   SUPERCODER_LLM_RETRY_MULTIPLIER=3        SUPERCODER_LLM_RETRY_MAX_MS=60000

clap 'env' feature added — pure activation of code already in clap; no new
transitive deps (Cargo.lock unchanged). R1 invariant intact.
@adithyn7 adithyn7 added the patch Patch version bump label Jun 13, 2026
@adithyn7 adithyn7 merged commit eb9a1d9 into main Jun 13, 2026
4 checks passed
@adithyn7 adithyn7 deleted the feat/bench-runner-tunable-policy branch June 13, 2026 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

patch Patch version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant