feat(thinking): replace preserve_* bools with thinking_retention, respected by the bridge#88
feat(thinking): replace preserve_* bools with thinking_retention, respected by the bridge#88hallerite wants to merge 7 commits into
Conversation
ce78cae to
5d18ccf
Compare
| "selectively re-emit dropped reasoning_content. Configure a " | ||
| "model-specific renderer if you need preserve_*_thinking." | ||
| "model-specific renderer if you need thinking_retention != " | ||
| "'template'." |
There was a problem hiding this comment.
DefaultRenderer legacy flags slip through
Medium Severity
The PR treats legacy preserve_* booleans as rejected at load time, but DefaultRendererConfig uses extra="allow", so those keys become model_extra instead of typed fields. DefaultRenderer only checks thinking_retention, so construction no longer raises NotImplementedError and extras may still reach apply_chat_template.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit ce78cae. Configure here.
ApprovabilityVerdict: Needs human review While presented as a config refactor (two bools → one enum), this PR introduces new bridge behavior that declines to span user-query boundaries when thinking would be stripped. This runtime behavior change affects multiple renderers and warrants human review beyond the mechanical rename aspects. No code changes detected at You can customize Macroscope's approvability policy. Learn more. |
…ention
The two booleans `preserve_all_thinking` and
`preserve_thinking_between_tool_calls` were confusing: `False` never
meant "drop", it meant "defer to the chat template", and the second flag
was a strict subset of the first — so four bool combos collapsed to three
real states with one nonsensical combination.
Replace both with a single ordinal field whose default names the floor:
thinking_retention: Literal["template", "tool_cycle", "all"] = "template"
- "template" (default) — defer entirely to the chat template (the floor;
never drops)
- "tool_cycle" — also keep thinking in the in-flight tool cycle
- "all" — also keep all past thinking
This makes the baseline explicit (no value reads as "drop"), encodes the
"all" ⊇ "tool_cycle" ⊇ "template" nesting, and removes the redundant
combo. `should_preserve_past_thinking` now branches on the level; all 8
renderer call sites, the auto carry-over/guard, and DefaultRenderer's
guard are updated.
Breaking change: the old boolean keys are no longer accepted — `extra=
"forbid"` rejects them at config-load. Downstream TOML/configs (prime-rl,
verifiers) must migrate to `thinking_retention`.
Also fixes a stale Llama3RendererConfig docstring that claimed the flags
raise; the renderer treats them as no-ops (matched by test_llama_3.py).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5d18ccf to
d163ef3
Compare
…daries Qwen3's template drops a past assistant block's thinking once a new user turn arrives, but bridge_to_next_turn carried prior tokens verbatim — so a multi-turn RL rollout kept <think> the template strips, feeding the model context its own template would never produce (unfaithful to the chat template / native serving distribution). The bridge now consults thinking_retention: when the prior turns carry sampled thinking and new_messages opens a real user query (not a tool response), it declines under "template"/"tool_cycle" so the caller re-renders — which honors thinking_retention and drops the stale thinking, byte-faithful to apply_chat_template. thinking_retention="all" keeps thinking on every path, and in-flight tool cycles (no new user query) still bridge verbatim, matching the template there too. Adds shared introduces_user_query helper (sibling to reject_assistant_in_extension) and a regression test diffing the bridge decision + fallback render against the chat template. Qwen3 only for now; glm45/glm5/qwen35/qwen36/nemotron3/minimax to follow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
44117ee to
efece49
Compare
…5/nemotron3 bridges Extract the qwen3 bridge faithfulness check into a shared reject_thinking_strip_in_extension (base.py) and wire it into every token-based thinking renderer: a bridge refuses to span a new user-query boundary while carrying a <think> block the template strips, so the caller re-renders faithfully. thinking_retention="all" keeps thinking on every path (no decline); in-flight tool cycles still bridge verbatim. - qwen36 inherits qwen35's bridge. - glm5/nemotron3: the guard is correct regardless of the clear_thinking / truncate_history_thinking kwargs (it only over-declines, never keeps unfaithfully, in the clear_thinking=False + retention!="all" combo). - Parametrized regression test across all guarded renderers. Deferred (need different detection): kimi_k25 (multi-token </think>), gpt_oss (harmony analysis channel); minimax has no bridge (always re-renders). Reconciliation of clear_thinking/truncate_history_thinking vs thinking_retention (raise-on-conflict) to follow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aise on thinking-kwarg conflict Generalize reject_thinking_strip_in_extension to a token-subsequence marker (_contains_subsequence) so it covers every thinking renderer: - kimi_k25: multi-token </think> (passes self._think_close_ids). - gpt_oss: harmony analysis-channel header <|channel|>analysis<|message|>. - qwen3/qwen35/glm45/glm5/nemotron3 now pass [self._think_end]. Reconcile the per-renderer template thinking-kwargs with thinking_retention: they're byte-equivalent (clear_thinking=False / truncate_history_thinking=False mean keep-all, i.e. thinking_retention="all"), so a model_validator on GLM5/GLM51/Nemotron3/Nemotron3Ultra raises when both are explicitly set to conflicting values. It only fires when BOTH are explicit, so the per-kwarg parity matrix (which sets one field at a time) keeps working. Tests: breadth bridge-decline test now covers Kimi-K2.5 + gpt-oss-20b; new config validator test. minimax has no bridge (already faithful); deepseek_r1 strips unconditionally (no override, edge case). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e keep-history toggles Bug 1 (DefaultRenderer legacy flags slip through): DefaultRendererConfig is extra="allow", so the removed preserve_*_thinking bools landed in model_extra and were forwarded to apply_chat_template silently (the guard only checked thinking_retention). Add a model_validator that rejects them with a migration message, matching every other config's extra="forbid" rejection. Bug 2 (bridge guard ignores keep-history toggles): the guard declined a user-query boundary whenever thinking_retention != "all", ignoring GLM clear_thinking=False / Nemotron truncate_history_thinking=False — which keep all past thinking. Set alone (retention defaulted), that over-declined into a full re-render that re-tokenizes model-sampled thinking bytes. Fold the toggle into the effective retention in the glm5 / nemotron3 bridges so they keep verbatim when the template keeps all thinking. Tests: test_default_renderer_config_rejects_legacy_preserve_flags; test_bridge_keeps_thinking_when_history_kwarg_disables_truncation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…retention is the override Renderers exist to faithfully reproduce each model's chat template (via per-renderer assumptions). The bridge preserves sampled tokens verbatim as a means to that (avoid bool/BPE/XML corruption of model output), not a license to diverge from how the template renders history. - Drop "thinking stripped from non-latest assistants" from the bridge's avoided-failure-modes list — that's intended template behaviour the renderer now reproduces (bridge declines a user-query boundary; caller re-renders faithfully), not token-identity corruption like the other entries. - thinking_retention reframed as the explicit override, honoured end-to-end (render() + bridge); an explicit keep-history kwarg contradicting thinking_retention raises rather than silently OR-resolving. - Roadmap: auto-stripping thinking is intended behaviour, not a use_patched target. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Brings in PR #90 (ungated Llama-3.2 tokenizer mirrors + canonical Llama classified preserve-thinking no-op). Conflict in tests/test_preserve_thinking.py resolved: keep the renamed thinking_retention wording, add the canonical meta-llama no-op entries from main.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit b140e28. Configure here.
| stop_ids={self._endoftext, self._user, self._observation}, | ||
| think_id=self._think, | ||
| think_end_id=self._think_end, | ||
| thinking_marker_ids=[self._think_end], |
There was a problem hiding this comment.
Parse calls use wrong kwarg
High Severity
parse_glm and parse_qwen35 still require think_end_id, but GLM and Qwen3.5-family parse_response paths now pass thinking_marker_ids instead. That raises a TypeError on the first parse call for those renderers, so completion parsing breaks entirely.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit b140e28. Configure here.


What
Two changes that go hand in hand: a single, clearer thinking-retention knob, and making the bridge actually respect it during RL.
1.
thinking_retentionreplaces thepreserve_*boolsReplaces the two confusing boolean flags on
_BaseRendererConfig:with a single ordinal field whose default names the floor:
"template"(default) — defer to the chat template (the floor; never drops what the template keeps)"tool_cycle"— additionally keep thinking inside the in-flight tool cycle (the A-T-…-A block after the lastuserturn, when it contains atoolresponse)"all"— additionally keep thinking on every past assistant turnpreserve_all_thinking=Falsenever meant "drop" — it meant "defer to the template." A plainboolreads as a symmetric on/off switch, soFalselooked like "drop." The ordinal gives a named baseline ("template"), encodes the"all"⊇"tool_cycle"⊇"template"nesting, and removes the nonsensicalboth=Truecombo.2. The bridge now respects
thinking_retention(faithfulness fix)render()is byte-faithful to each model's chat template, butbridge_to_next_turncarried prior tokens verbatim — so a multi-turn RL rollout kept<think>blocks the template strips once a new user turn arrives, feeding the model context its own template would never produce (unfaithful to native serving, OOD).The bridge now consults
thinking_retention:template/tool_cycleapply_chat_template(stale thinking dropped)thinking_retention="all"This makes
thinking_retentionthe single knob governing thinking end-to-end (renderand the bridge). The extension property and "drop old thinking across a user turn" are mutually exclusive, so faithfulness requires a re-render at that boundary — exactly the path the caller already falls back to.Wired across every thinking renderer via one shared
reject_thinking_strip_in_extensionhelper (sibling toreject_assistant_in_extension), so each bridge is a one-liner. The helper detects a sampled thinking block by a token-subsequence marker:</think>token.</think>close.<|channel|>analysis<|message|>.(
minimax_m2has no bridge — it always re-renders, already faithful.deepseek_r1strips history unconditionally with no override — edge case, not wired.)3. Reconcile the per-renderer thinking-kwargs with
thinking_retentionGLM's
clear_thinkingand Nemotron'struncate_history_thinkingare byte-equivalent tothinking_retention(=False≡"all"). Rather than silently resolving a contradiction, amodel_validatoron GLM5/GLM51/Nemotron3/Nemotron3Ultra raises when both are explicitly set to conflicting values (e.g.clear_thinking=False+thinking_retention="template"). It only fires when both are explicit, so the per-kwarg parity matrix (one field at a time) is unaffected. When the keep-history kwarg is set alone (a legit "keep all" intent), the glm5/nemotron3 bridges fold it into the effective retention (=False→"all") so the guard doesn't over-decline and re-tokenize sampled thinking.Changes
should_preserve_past_thinkingbranches on the level (all → True,template → False, else tool-cycle logic).AutoRendererConfigcarry-over + guard, andDefaultRenderer's guard updated forthinking_retention.introduces_user_query(),_contains_subsequence(),reject_thinking_strip_in_extension()inbase.py; wired into all eight thinking-renderer bridges._reject_thinking_retention_conflict()+model_validators on the four configs that carry a thinking-kwarg;DefaultRendererConfig(which isextra="allow") also rejects the removedpreserve_*keys instead of silently forwarding them toapply_chat_template.Breaking change
The old boolean keys are no longer accepted (
extra="forbid"rejects them at config-load). Downstream TOML/configs (prime-rl,verifiers) must migrate:thinking_retention="template"preserve_thinking_between_tool_calls=Truethinking_retention="tool_cycle"preserve_all_thinking=Truethinking_retention="all"The README + renderer-config doc are reframed to match: faithfully reproducing the chat template is the goal (the bridge preserves sampled tokens as a means to that, not a license to diverge); thinking-stripping across a user turn is intended template behaviour the renderer reproduces, not an "avoided failure mode";
thinking_retentionis the explicit override.Follow-ups (separate PRs)
preserve_*→thinking_retentionmigration inprime-rl/verifiers.Test plan
tests/test_bridge.py—reject_thinking_strip_in_extensionwired per renderer;test_bridge_declines_across_user_query_when_template_drops_thinking(qwen3 deep, diffs bridge decision + fallback render vsapply_chat_template) and a parametrizedtest_bridge_declines_across_user_turn_when_thinking_presentacross all nine guarded renderers; mechanic tests isolated via athinking_retention="all"fixturetests/test_renderer_config.py— newtest_thinking_retention_conflict_raises,test_default_renderer_config_rejects_legacy_preserve_flags;tests/test_bridge.py—test_bridge_keeps_thinking_when_history_kwarg_disables_truncationtests/test_renderer_config_parity.py(367 passed — validator does not break the per-kwarg matrix),test_preserve_thinking.py,test_incremental.py,test_client.pyruff check/ruff format --checkclean (isolated defaults, line-length 88)🤖 Generated with Claude Code
Note
Medium Risk
Breaking config API (
preserve_*removed) and changed bridge behavior on multi-turn RL when past thinking meets a new user turn under default retention—more full re-renders but more faithful toapply_chat_template.Overview
Replaces
preserve_all_thinking/preserve_thinking_between_tool_callswith a single nestedthinking_retentionlevel ("template"→"tool_cycle"→"all") on all renderer configs, withshould_preserve_past_thinkingandcreate_rendererauto-carry updated accordingly.DefaultRendereronly allows"template"; legacypreserve_*keys are rejected (including onDefaultRendererConfigso they are not forwarded to Jinja).bridge_to_next_turnnow uses shared helpers (introduces_user_query,reject_thinking_strip_in_extension) so thinking renderers returnNonewhen a new real user message would cross a boundary where the chat template drops prior thinking but the bridge would keep those tokens verbatim—unlessthinking_retention="all", thinking is off, or there is no thinking marker. GLM/Nemotron bridges foldclear_thinking=False/truncate_history_thinking=Falseinto effective"all"retention; conflicting explicit template kwargs vsthinking_retentionraise at config load.Docs and tests are updated for the migration and for bridge decline/extend behavior across guarded models.
Reviewed by Cursor Bugbot for commit b140e28. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Replace
preserve_all_thinking/preserve_thinking_between_tool_callsbools with a singlethinking_retentionlevel in renderer configs and bridge logicThinkingRetentionas aLiteral["template", "tool_cycle", "all"]type inrenderers/configs.py, replacing the two booleanpreserve_*flags onBaseRendererConfig.should_preserve_past_thinkinginrenderers/base.pyto acceptthinking_retentionand implement three distinct behaviors:"template"never overrides,"tool_cycle"preserves only within the current tool cycle, and"all"always preserves.reject_thinking_strip_in_extensionhelper that causesbridge_to_next_turnto returnNone(forcing a full re-render) when a new user turn arrives and prior thinking markers are present, unlessthinking_retention="all"or thinking is disabled.thinking_retentionin both render and bridge paths.ValueErroron conflicting settings (e.g.,clear_thinking=Falsewiththinking_retention!="all").Noneand trigger a full re-render when past thinking is present andthinking_retentionis not"all".Macroscope summarized b140e28.