Skip to content

feat(thinking): replace preserve_* bools with thinking_retention, respected by the bridge#88

Open
hallerite wants to merge 7 commits into
mainfrom
refactor/thinking-retention
Open

feat(thinking): replace preserve_* bools with thinking_retention, respected by the bridge#88
hallerite wants to merge 7 commits into
mainfrom
refactor/thinking-retention

Conversation

@hallerite

@hallerite hallerite commented Jun 18, 2026

Copy link
Copy Markdown
Member

What

Two changes that go hand in hand: a single, clearer thinking-retention knob, and making the bridge actually respect it during RL.

1. thinking_retention replaces the preserve_* bools

Replaces the two confusing boolean flags on _BaseRendererConfig:

preserve_all_thinking: bool = False
preserve_thinking_between_tool_calls: bool = False

with a single ordinal field whose default names the floor:

thinking_retention: Literal["template", "tool_cycle", "all"] = "template"
  • "template" (default) — defer to the chat template (the floor; never drops what the template keeps)
  • "tool_cycle" — additionally keep thinking inside the in-flight tool cycle (the A-T-…-A block after the last user turn, when it contains a tool response)
  • "all" — additionally keep thinking on every past assistant turn

preserve_all_thinking=False never meant "drop" — it meant "defer to the template." A plain bool reads as a symmetric on/off switch, so False looked like "drop." The ordinal gives a named baseline ("template"), encodes the "all""tool_cycle""template" nesting, and removes the nonsensical both=True combo.

2. The bridge now respects thinking_retention (faithfulness fix)

render() is byte-faithful to each model's chat template, but bridge_to_next_turn carried prior tokens verbatim — so a multi-turn RL rollout kept <think> blocks the template strips once a new user turn arrives, feeding the model context its own template would never produce (unfaithful to native serving, OOD).

The bridge now consults thinking_retention:

Situation Bridge behavior
new user query + prior thinking + template/tool_cycle declines → caller re-renders, byte-faithful to apply_chat_template (stale thinking dropped)
thinking_retention="all" extends verbatim (template keeps it too)
in-flight tool cycle (no new user query) extends verbatim (template keeps in-cycle thinking)
no prior thinking extends verbatim (nothing to strip)

This makes thinking_retention the single knob governing thinking end-to-end (render and the bridge). The extension property and "drop old thinking across a user turn" are mutually exclusive, so faithfulness requires a re-render at that boundary — exactly the path the caller already falls back to.

Wired across every thinking renderer via one shared reject_thinking_strip_in_extension helper (sibling to reject_assistant_in_extension), so each bridge is a one-liner. The helper detects a sampled thinking block by a token-subsequence marker:

  • qwen3, qwen35, qwen36, glm45, glm5, nemotron3 — single </think> token.
  • kimi_k25 — multi-token </think> close.
  • gpt_oss — harmony analysis-channel header <|channel|>analysis<|message|>.

(minimax_m2 has no bridge — it always re-renders, already faithful. deepseek_r1 strips history unconditionally with no override — edge case, not wired.)

3. Reconcile the per-renderer thinking-kwargs with thinking_retention

GLM's clear_thinking and Nemotron's truncate_history_thinking are byte-equivalent to thinking_retention (=False"all"). Rather than silently resolving a contradiction, a model_validator on GLM5/GLM51/Nemotron3/Nemotron3Ultra raises when both are explicitly set to conflicting values (e.g. clear_thinking=False + thinking_retention="template"). It only fires when both are explicit, so the per-kwarg parity matrix (one field at a time) is unaffected. When the keep-history kwarg is set alone (a legit "keep all" intent), the glm5/nemotron3 bridges fold it into the effective retention (=False"all") so the guard doesn't over-decline and re-tokenize sampled thinking.

Changes

  • should_preserve_past_thinking branches on the level (all → True, template → False, else tool-cycle logic).
  • All 8 renderer call sites, the AutoRendererConfig carry-over + guard, and DefaultRenderer's guard updated for thinking_retention.
  • New shared introduces_user_query(), _contains_subsequence(), reject_thinking_strip_in_extension() in base.py; wired into all eight thinking-renderer bridges.
  • _reject_thinking_retention_conflict() + model_validators on the four configs that carry a thinking-kwarg; DefaultRendererConfig (which is extra="allow") also rejects the removed preserve_* keys instead of silently forwarding them to apply_chat_template.
  • Docstrings/comments across the no-op renderers, plus README and the renderer-config doc.

Breaking change

The old boolean keys are no longer accepted (extra="forbid" rejects them at config-load). Downstream TOML/configs (prime-rl, verifiers) must migrate:

Old New
(unset) thinking_retention="template"
preserve_thinking_between_tool_calls=True thinking_retention="tool_cycle"
preserve_all_thinking=True thinking_retention="all"

The README + renderer-config doc are reframed to match: faithfully reproducing the chat template is the goal (the bridge preserves sampled tokens as a means to that, not a license to diverge); thinking-stripping across a user turn is intended template behaviour the renderer reproduces, not an "avoided failure mode"; thinking_retention is the explicit override.

Follow-ups (separate PRs)

  • Downstream preserve_*thinking_retention migration in prime-rl / verifiers.

Test plan

  • tests/test_bridge.pyreject_thinking_strip_in_extension wired per renderer; test_bridge_declines_across_user_query_when_template_drops_thinking (qwen3 deep, diffs bridge decision + fallback render vs apply_chat_template) and a parametrized test_bridge_declines_across_user_turn_when_thinking_present across all nine guarded renderers; mechanic tests isolated via a thinking_retention="all" fixture
  • tests/test_renderer_config.py — new test_thinking_retention_conflict_raises, test_default_renderer_config_rejects_legacy_preserve_flags; tests/test_bridge.pytest_bridge_keeps_thinking_when_history_kwarg_disables_truncation
  • tests/test_renderer_config_parity.py (367 passed — validator does not break the per-kwarg matrix), test_preserve_thinking.py, test_incremental.py, test_client.py
  • ruff check / ruff format --check clean (isolated defaults, line-length 88)

🤖 Generated with Claude Code


Note

Medium Risk
Breaking config API (preserve_* removed) and changed bridge behavior on multi-turn RL when past thinking meets a new user turn under default retention—more full re-renders but more faithful to apply_chat_template.

Overview
Replaces preserve_all_thinking / preserve_thinking_between_tool_calls with a single nested thinking_retention level ("template""tool_cycle""all") on all renderer configs, with should_preserve_past_thinking and create_renderer auto-carry updated accordingly. DefaultRenderer only allows "template"; legacy preserve_* keys are rejected (including on DefaultRendererConfig so they are not forwarded to Jinja).

bridge_to_next_turn now uses shared helpers (introduces_user_query, reject_thinking_strip_in_extension) so thinking renderers return None when a new real user message would cross a boundary where the chat template drops prior thinking but the bridge would keep those tokens verbatim—unless thinking_retention="all", thinking is off, or there is no thinking marker. GLM/Nemotron bridges fold clear_thinking=False / truncate_history_thinking=False into effective "all" retention; conflicting explicit template kwargs vs thinking_retention raise at config load.

Docs and tests are updated for the migration and for bridge decline/extend behavior across guarded models.

Reviewed by Cursor Bugbot for commit b140e28. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Replace preserve_all_thinking/preserve_thinking_between_tool_calls bools with a single thinking_retention level in renderer configs and bridge logic

  • Introduces ThinkingRetention as a Literal["template", "tool_cycle", "all"] type in renderers/configs.py, replacing the two boolean preserve_* flags on BaseRendererConfig.
  • Updates should_preserve_past_thinking in renderers/base.py to accept thinking_retention and implement three distinct behaviors: "template" never overrides, "tool_cycle" preserves only within the current tool cycle, and "all" always preserves.
  • Adds reject_thinking_strip_in_extension helper that causes bridge_to_next_turn to return None (forcing a full re-render) when a new user turn arrives and prior thinking markers are present, unless thinking_retention="all" or thinking is disabled.
  • All thinking renderers (Qwen3, Qwen3.5, Nemotron3, GLM-4.5, GLM-5, KimiK2.5, GPT-OSS) updated to use thinking_retention in both render and bridge paths.
  • Config validators now raise ValueError on conflicting settings (e.g., clear_thinking=False with thinking_retention!="all").
  • Risk: bridges that previously extended across user turns may now return None and trigger a full re-render when past thinking is present and thinking_retention is not "all".

Macroscope summarized b140e28.

@hallerite hallerite force-pushed the refactor/thinking-retention branch from ce78cae to 5d18ccf Compare June 18, 2026 04:53
Comment thread renderers/default.py
"selectively re-emit dropped reasoning_content. Configure a "
"model-specific renderer if you need preserve_*_thinking."
"model-specific renderer if you need thinking_retention != "
"'template'."

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DefaultRenderer legacy flags slip through

Medium Severity

The PR treats legacy preserve_* booleans as rejected at load time, but DefaultRendererConfig uses extra="allow", so those keys become model_extra instead of typed fields. DefaultRenderer only checks thinking_retention, so construction no longer raises NotImplementedError and extras may still reach apply_chat_template.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ce78cae. Configure here.

@macroscopeapp

macroscopeapp Bot commented Jun 18, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

While presented as a config refactor (two bools → one enum), this PR introduces new bridge behavior that declines to span user-query boundaries when thinking would be stripped. This runtime behavior change affects multiple renderers and warrants human review beyond the mechanical rename aspects.

No code changes detected at b140e28. Prior analysis still applies.

You can customize Macroscope's approvability policy. Learn more.

…ention

The two booleans `preserve_all_thinking` and
`preserve_thinking_between_tool_calls` were confusing: `False` never
meant "drop", it meant "defer to the chat template", and the second flag
was a strict subset of the first — so four bool combos collapsed to three
real states with one nonsensical combination.

Replace both with a single ordinal field whose default names the floor:

    thinking_retention: Literal["template", "tool_cycle", "all"] = "template"

- "template" (default) — defer entirely to the chat template (the floor;
  never drops)
- "tool_cycle" — also keep thinking in the in-flight tool cycle
- "all" — also keep all past thinking

This makes the baseline explicit (no value reads as "drop"), encodes the
"all" ⊇ "tool_cycle" ⊇ "template" nesting, and removes the redundant
combo. `should_preserve_past_thinking` now branches on the level; all 8
renderer call sites, the auto carry-over/guard, and DefaultRenderer's
guard are updated.

Breaking change: the old boolean keys are no longer accepted — `extra=
"forbid"` rejects them at config-load. Downstream TOML/configs (prime-rl,
verifiers) must migrate to `thinking_retention`.

Also fixes a stale Llama3RendererConfig docstring that claimed the flags
raise; the renderer treats them as no-ops (matched by test_llama_3.py).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hallerite hallerite force-pushed the refactor/thinking-retention branch from 5d18ccf to d163ef3 Compare June 18, 2026 05:18
@hallerite hallerite changed the title refactor(config): replace preserve_*_thinking bools with thinking_retention feat(thinking): replace preserve_* bools with thinking_retention, respected by the bridge Jun 19, 2026
…daries

Qwen3's template drops a past assistant block's thinking once a new user
turn arrives, but bridge_to_next_turn carried prior tokens verbatim — so a
multi-turn RL rollout kept <think> the template strips, feeding the model
context its own template would never produce (unfaithful to the chat
template / native serving distribution).

The bridge now consults thinking_retention: when the prior turns carry
sampled thinking and new_messages opens a real user query (not a tool
response), it declines under "template"/"tool_cycle" so the caller
re-renders — which honors thinking_retention and drops the stale thinking,
byte-faithful to apply_chat_template. thinking_retention="all" keeps
thinking on every path, and in-flight tool cycles (no new user query) still
bridge verbatim, matching the template there too.

Adds shared introduces_user_query helper (sibling to
reject_assistant_in_extension) and a regression test diffing the bridge
decision + fallback render against the chat template.

Qwen3 only for now; glm45/glm5/qwen35/qwen36/nemotron3/minimax to follow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hallerite hallerite force-pushed the refactor/thinking-retention branch from 44117ee to efece49 Compare June 19, 2026 20:56
…5/nemotron3 bridges

Extract the qwen3 bridge faithfulness check into a shared
reject_thinking_strip_in_extension (base.py) and wire it into every
token-based thinking renderer: a bridge refuses to span a new user-query
boundary while carrying a <think> block the template strips, so the caller
re-renders faithfully. thinking_retention="all" keeps thinking on every
path (no decline); in-flight tool cycles still bridge verbatim.

- qwen36 inherits qwen35's bridge.
- glm5/nemotron3: the guard is correct regardless of the clear_thinking /
  truncate_history_thinking kwargs (it only over-declines, never keeps
  unfaithfully, in the clear_thinking=False + retention!="all" combo).
- Parametrized regression test across all guarded renderers.

Deferred (need different detection): kimi_k25 (multi-token </think>),
gpt_oss (harmony analysis channel); minimax has no bridge (always
re-renders). Reconciliation of clear_thinking/truncate_history_thinking vs
thinking_retention (raise-on-conflict) to follow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread renderers/base.py Outdated
hallerite and others added 4 commits June 19, 2026 21:48
…aise on thinking-kwarg conflict

Generalize reject_thinking_strip_in_extension to a token-subsequence marker
(_contains_subsequence) so it covers every thinking renderer:
- kimi_k25: multi-token </think> (passes self._think_close_ids).
- gpt_oss: harmony analysis-channel header <|channel|>analysis<|message|>.
- qwen3/qwen35/glm45/glm5/nemotron3 now pass [self._think_end].

Reconcile the per-renderer template thinking-kwargs with thinking_retention:
they're byte-equivalent (clear_thinking=False / truncate_history_thinking=False
mean keep-all, i.e. thinking_retention="all"), so a model_validator on
GLM5/GLM51/Nemotron3/Nemotron3Ultra raises when both are explicitly set to
conflicting values. It only fires when BOTH are explicit, so the per-kwarg
parity matrix (which sets one field at a time) keeps working.

Tests: breadth bridge-decline test now covers Kimi-K2.5 + gpt-oss-20b; new
config validator test. minimax has no bridge (already faithful); deepseek_r1
strips unconditionally (no override, edge case).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e keep-history toggles

Bug 1 (DefaultRenderer legacy flags slip through): DefaultRendererConfig is
extra="allow", so the removed preserve_*_thinking bools landed in model_extra
and were forwarded to apply_chat_template silently (the guard only checked
thinking_retention). Add a model_validator that rejects them with a migration
message, matching every other config's extra="forbid" rejection.

Bug 2 (bridge guard ignores keep-history toggles): the guard declined a
user-query boundary whenever thinking_retention != "all", ignoring GLM
clear_thinking=False / Nemotron truncate_history_thinking=False — which keep
all past thinking. Set alone (retention defaulted), that over-declined into a
full re-render that re-tokenizes model-sampled thinking bytes. Fold the toggle
into the effective retention in the glm5 / nemotron3 bridges so they keep
verbatim when the template keeps all thinking.

Tests: test_default_renderer_config_rejects_legacy_preserve_flags;
test_bridge_keeps_thinking_when_history_kwarg_disables_truncation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…retention is the override

Renderers exist to faithfully reproduce each model's chat template (via
per-renderer assumptions). The bridge preserves sampled tokens verbatim as a
means to that (avoid bool/BPE/XML corruption of model output), not a license
to diverge from how the template renders history.

- Drop "thinking stripped from non-latest assistants" from the bridge's
  avoided-failure-modes list — that's intended template behaviour the renderer
  now reproduces (bridge declines a user-query boundary; caller re-renders
  faithfully), not token-identity corruption like the other entries.
- thinking_retention reframed as the explicit override, honoured end-to-end
  (render() + bridge); an explicit keep-history kwarg contradicting
  thinking_retention raises rather than silently OR-resolving.
- Roadmap: auto-stripping thinking is intended behaviour, not a use_patched
  target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Brings in PR #90 (ungated Llama-3.2 tokenizer mirrors + canonical Llama
classified preserve-thinking no-op). Conflict in tests/test_preserve_thinking.py
resolved: keep the renamed thinking_retention wording, add the canonical
meta-llama no-op entries from main.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit b140e28. Configure here.

Comment thread renderers/glm5.py
stop_ids={self._endoftext, self._user, self._observation},
think_id=self._think,
think_end_id=self._think_end,
thinking_marker_ids=[self._think_end],

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parse calls use wrong kwarg

High Severity

parse_glm and parse_qwen35 still require think_end_id, but GLM and Qwen3.5-family parse_response paths now pass thinking_marker_ids instead. That raises a TypeError on the first parse call for those renderers, so completion parsing breaks entirely.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b140e28. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant