feat(thinking): replace preserve_* bools with thinking_retention, respected by the bridge by hallerite · Pull Request #88 · PrimeIntellect-ai/renderers

hallerite · 2026-06-18T04:49:16Z

What

Two changes that go hand in hand: a single, clearer thinking-retention knob, and making the bridge actually respect it during RL.

1. `thinking_retention` replaces the `preserve_*` bools

Replaces the two confusing boolean flags on _BaseRendererConfig:

preserve_all_thinking: bool = False
preserve_thinking_between_tool_calls: bool = False

with a single ordinal field whose default names the floor:

thinking_retention: Literal["template", "tool_cycle", "all"] = "template"

"template" (default) — defer to the chat template (the floor; never drops what the template keeps)
"tool_cycle" — additionally keep thinking inside the in-flight tool cycle (the A-T-…-A block after the last user turn, when it contains a tool response)
"all" — additionally keep thinking on every past assistant turn

preserve_all_thinking=False never meant "drop" — it meant "defer to the template." A plain bool reads as a symmetric on/off switch, so False looked like "drop." The ordinal gives a named baseline ("template"), encodes the "all" ⊇ "tool_cycle" ⊇ "template" nesting, and removes the nonsensical both=True combo.

2. The bridge now respects `thinking_retention` (faithfulness fix)

render() is byte-faithful to each model's chat template, but bridge_to_next_turn carried prior tokens verbatim — so a multi-turn RL rollout kept <think> blocks the template strips once a new user turn arrives, feeding the model context its own template would never produce (unfaithful to native serving, OOD).

The bridge now consults thinking_retention:

Situation	Bridge behavior
new user query + prior thinking + `template`/`tool_cycle`	declines → caller re-renders, byte-faithful to `apply_chat_template` (stale thinking dropped)
`thinking_retention="all"`	extends verbatim (template keeps it too)
in-flight tool cycle (no new user query)	extends verbatim (template keeps in-cycle thinking)
no prior thinking	extends verbatim (nothing to strip)

This makes thinking_retention the single knob governing thinking end-to-end (render and the bridge). The extension property and "drop old thinking across a user turn" are mutually exclusive, so faithfulness requires a re-render at that boundary — exactly the path the caller already falls back to.

Wired across every thinking renderer via one shared reject_thinking_strip_in_extension helper (sibling to reject_assistant_in_extension), so each bridge is a one-liner. The helper detects a sampled thinking block by a token-subsequence marker:

qwen3, qwen35, qwen36, glm45, glm5, nemotron3 — single </think> token.
kimi_k25 — multi-token </think> close.
gpt_oss — harmony analysis-channel header <|channel|>analysis<|message|>.

(minimax_m2 has no bridge — it always re-renders, already faithful. deepseek_r1 strips history unconditionally with no override — edge case, not wired.)

3. Reconcile the per-renderer thinking-kwargs with `thinking_retention`

GLM's clear_thinking and Nemotron's truncate_history_thinking are byte-equivalent to thinking_retention (=False ≡ "all"). Rather than silently resolving a contradiction, a model_validator on GLM5/GLM51/Nemotron3/Nemotron3Ultra raises when both are explicitly set to conflicting values (e.g. clear_thinking=False + thinking_retention="template"). It only fires when both are explicit, so the per-kwarg parity matrix (one field at a time) is unaffected. When the keep-history kwarg is set alone (a legit "keep all" intent), the glm5/nemotron3 bridges fold it into the effective retention (=False → "all") so the guard doesn't over-decline and re-tokenize sampled thinking.

Changes

should_preserve_past_thinking branches on the level (all → True, template → False, else tool-cycle logic).
All 8 renderer call sites, the AutoRendererConfig carry-over + guard, and DefaultRenderer's guard updated for thinking_retention.
New shared introduces_user_query(), _contains_subsequence(), reject_thinking_strip_in_extension() in base.py; wired into all eight thinking-renderer bridges.
_reject_thinking_retention_conflict() + model_validators on the four configs that carry a thinking-kwarg; DefaultRendererConfig (which is extra="allow") also rejects the removed preserve_* keys instead of silently forwarding them to apply_chat_template.
Docstrings/comments across the no-op renderers, plus README and the renderer-config doc.

Breaking change

The old boolean keys are no longer accepted (extra="forbid" rejects them at config-load). Downstream TOML/configs (prime-rl, verifiers) must migrate:

Old	New
(unset)	`thinking_retention="template"`
`preserve_thinking_between_tool_calls=True`	`thinking_retention="tool_cycle"`
`preserve_all_thinking=True`	`thinking_retention="all"`

The README + renderer-config doc are reframed to match: faithfully reproducing the chat template is the goal (the bridge preserves sampled tokens as a means to that, not a license to diverge); thinking-stripping across a user turn is intended template behaviour the renderer reproduces, not an "avoided failure mode"; thinking_retention is the explicit override.

Follow-ups (separate PRs)

Downstream preserve_* → thinking_retention migration in prime-rl / verifiers.

Test plan

tests/test_bridge.py — reject_thinking_strip_in_extension wired per renderer; test_bridge_declines_across_user_query_when_template_drops_thinking (qwen3 deep, diffs bridge decision + fallback render vs apply_chat_template) and a parametrized test_bridge_declines_across_user_turn_when_thinking_present across all nine guarded renderers; mechanic tests isolated via a thinking_retention="all" fixture
tests/test_renderer_config.py — new test_thinking_retention_conflict_raises, test_default_renderer_config_rejects_legacy_preserve_flags; tests/test_bridge.py — test_bridge_keeps_thinking_when_history_kwarg_disables_truncation
tests/test_renderer_config_parity.py (367 passed — validator does not break the per-kwarg matrix), test_preserve_thinking.py, test_incremental.py, test_client.py
ruff check / ruff format --check clean (isolated defaults, line-length 88)

🤖 Generated with Claude Code

Note

Medium Risk
Breaking config API (preserve_* removed) and changed bridge behavior on multi-turn RL when past thinking meets a new user turn under default retention—more full re-renders but more faithful to apply_chat_template.

Overview
Replaces preserve_all_thinking / preserve_thinking_between_tool_calls with a single nested thinking_retention level ("template" → "tool_cycle" → "all") on all renderer configs, with should_preserve_past_thinking and create_renderer auto-carry updated accordingly. DefaultRenderer only allows "template"; legacy preserve_* keys are rejected (including on DefaultRendererConfig so they are not forwarded to Jinja).

bridge_to_next_turn now uses shared helpers (introduces_user_query, reject_thinking_strip_in_extension) so thinking renderers return None when a new real user message would cross a boundary where the chat template drops prior thinking but the bridge would keep those tokens verbatim—unless thinking_retention="all", thinking is off, or there is no thinking marker. GLM/Nemotron bridges fold clear_thinking=False / truncate_history_thinking=False into effective "all" retention; conflicting explicit template kwargs vs thinking_retention raise at config load.

Docs and tests are updated for the migration and for bridge decline/extend behavior across guarded models.

^{Reviewed by Cursor Bugbot for commit b140e28. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Replace `preserve_all_thinking`/`preserve_thinking_between_tool_calls` bools with a single `thinking_retention` level in renderer configs and bridge logic

Introduces ThinkingRetention as a Literal["template", "tool_cycle", "all"] type in renderers/configs.py, replacing the two boolean preserve_* flags on BaseRendererConfig.
Updates should_preserve_past_thinking in renderers/base.py to accept thinking_retention and implement three distinct behaviors: "template" never overrides, "tool_cycle" preserves only within the current tool cycle, and "all" always preserves.
Adds reject_thinking_strip_in_extension helper that causes bridge_to_next_turn to return None (forcing a full re-render) when a new user turn arrives and prior thinking markers are present, unless thinking_retention="all" or thinking is disabled.
All thinking renderers (Qwen3, Qwen3.5, Nemotron3, GLM-4.5, GLM-5, KimiK2.5, GPT-OSS) updated to use thinking_retention in both render and bridge paths.
Config validators now raise ValueError on conflicting settings (e.g., clear_thinking=False with thinking_retention!="all").
Risk: bridges that previously extended across user turns may now return None and trigger a full re-render when past thinking is present and thinking_retention is not "all".

^{Macroscope summarized b140e28.}

cursor · 2026-06-18T04:53:31Z

                "selectively re-emit dropped reasoning_content. Configure a "
-                "model-specific renderer if you need preserve_*_thinking."
+                "model-specific renderer if you need thinking_retention != "
+                "'template'."


DefaultRenderer legacy flags slip through

Medium Severity

The PR treats legacy preserve_* booleans as rejected at load time, but DefaultRendererConfig uses extra="allow", so those keys become model_extra instead of typed fields. DefaultRenderer only checks thinking_retention, so construction no longer raises NotImplementedError and extras may still reach apply_chat_template.

Additional Locations (1)

renderers/configs.py#L108-L117

^{Reviewed by Cursor Bugbot for commit ce78cae. Configure here.}

macroscopeapp · 2026-06-18T04:54:00Z

Approvability

Verdict: Needs human review

While presented as a config refactor (two bools → one enum), this PR introduces new bridge behavior that declines to span user-query boundaries when thinking would be stripped. This runtime behavior change affects multiple renderers and warrants human review beyond the mechanical rename aspects.

No code changes detected at b140e28. Prior analysis still applies.

^{You can customize Macroscope's approvability policy. Learn more.}

…ention The two booleans `preserve_all_thinking` and `preserve_thinking_between_tool_calls` were confusing: `False` never meant "drop", it meant "defer to the chat template", and the second flag was a strict subset of the first — so four bool combos collapsed to three real states with one nonsensical combination. Replace both with a single ordinal field whose default names the floor: thinking_retention: Literal["template", "tool_cycle", "all"] = "template" - "template" (default) — defer entirely to the chat template (the floor; never drops) - "tool_cycle" — also keep thinking in the in-flight tool cycle - "all" — also keep all past thinking This makes the baseline explicit (no value reads as "drop"), encodes the "all" ⊇ "tool_cycle" ⊇ "template" nesting, and removes the redundant combo. `should_preserve_past_thinking` now branches on the level; all 8 renderer call sites, the auto carry-over/guard, and DefaultRenderer's guard are updated. Breaking change: the old boolean keys are no longer accepted — `extra= "forbid"` rejects them at config-load. Downstream TOML/configs (prime-rl, verifiers) must migrate to `thinking_retention`. Also fixes a stale Llama3RendererConfig docstring that claimed the flags raise; the renderer treats them as no-ops (matched by test_llama_3.py). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…daries Qwen3's template drops a past assistant block's thinking once a new user turn arrives, but bridge_to_next_turn carried prior tokens verbatim — so a multi-turn RL rollout kept <think> the template strips, feeding the model context its own template would never produce (unfaithful to the chat template / native serving distribution). The bridge now consults thinking_retention: when the prior turns carry sampled thinking and new_messages opens a real user query (not a tool response), it declines under "template"/"tool_cycle" so the caller re-renders — which honors thinking_retention and drops the stale thinking, byte-faithful to apply_chat_template. thinking_retention="all" keeps thinking on every path, and in-flight tool cycles (no new user query) still bridge verbatim, matching the template there too. Adds shared introduces_user_query helper (sibling to reject_assistant_in_extension) and a regression test diffing the bridge decision + fallback render against the chat template. Qwen3 only for now; glm45/glm5/qwen35/qwen36/nemotron3/minimax to follow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…5/nemotron3 bridges Extract the qwen3 bridge faithfulness check into a shared reject_thinking_strip_in_extension (base.py) and wire it into every token-based thinking renderer: a bridge refuses to span a new user-query boundary while carrying a <think> block the template strips, so the caller re-renders faithfully. thinking_retention="all" keeps thinking on every path (no decline); in-flight tool cycles still bridge verbatim. - qwen36 inherits qwen35's bridge. - glm5/nemotron3: the guard is correct regardless of the clear_thinking / truncate_history_thinking kwargs (it only over-declines, never keeps unfaithfully, in the clear_thinking=False + retention!="all" combo). - Parametrized regression test across all guarded renderers. Deferred (need different detection): kimi_k25 (multi-token </think>), gpt_oss (harmony analysis channel); minimax has no bridge (always re-renders). Reconciliation of clear_thinking/truncate_history_thinking vs thinking_retention (raise-on-conflict) to follow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…aise on thinking-kwarg conflict Generalize reject_thinking_strip_in_extension to a token-subsequence marker (_contains_subsequence) so it covers every thinking renderer: - kimi_k25: multi-token </think> (passes self._think_close_ids). - gpt_oss: harmony analysis-channel header <|channel|>analysis<|message|>. - qwen3/qwen35/glm45/glm5/nemotron3 now pass [self._think_end]. Reconcile the per-renderer template thinking-kwargs with thinking_retention: they're byte-equivalent (clear_thinking=False / truncate_history_thinking=False mean keep-all, i.e. thinking_retention="all"), so a model_validator on GLM5/GLM51/Nemotron3/Nemotron3Ultra raises when both are explicitly set to conflicting values. It only fires when BOTH are explicit, so the per-kwarg parity matrix (which sets one field at a time) keeps working. Tests: breadth bridge-decline test now covers Kimi-K2.5 + gpt-oss-20b; new config validator test. minimax has no bridge (already faithful); deepseek_r1 strips unconditionally (no override, edge case). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e keep-history toggles Bug 1 (DefaultRenderer legacy flags slip through): DefaultRendererConfig is extra="allow", so the removed preserve_*_thinking bools landed in model_extra and were forwarded to apply_chat_template silently (the guard only checked thinking_retention). Add a model_validator that rejects them with a migration message, matching every other config's extra="forbid" rejection. Bug 2 (bridge guard ignores keep-history toggles): the guard declined a user-query boundary whenever thinking_retention != "all", ignoring GLM clear_thinking=False / Nemotron truncate_history_thinking=False — which keep all past thinking. Set alone (retention defaulted), that over-declined into a full re-render that re-tokenizes model-sampled thinking bytes. Fold the toggle into the effective retention in the glm5 / nemotron3 bridges so they keep verbatim when the template keeps all thinking. Tests: test_default_renderer_config_rejects_legacy_preserve_flags; test_bridge_keeps_thinking_when_history_kwarg_disables_truncation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…retention is the override Renderers exist to faithfully reproduce each model's chat template (via per-renderer assumptions). The bridge preserves sampled tokens verbatim as a means to that (avoid bool/BPE/XML corruption of model output), not a license to diverge from how the template renders history. - Drop "thinking stripped from non-latest assistants" from the bridge's avoided-failure-modes list — that's intended template behaviour the renderer now reproduces (bridge declines a user-query boundary; caller re-renders faithfully), not token-identity corruption like the other entries. - thinking_retention reframed as the explicit override, honoured end-to-end (render() + bridge); an explicit keep-history kwarg contradicting thinking_retention raises rather than silently OR-resolving. - Roadmap: auto-stripping thinking is intended behaviour, not a use_patched target. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Brings in PR #90 (ungated Llama-3.2 tokenizer mirrors + canonical Llama classified preserve-thinking no-op). Conflict in tests/test_preserve_thinking.py resolved: keep the renamed thinking_retention wording, add the canonical meta-llama no-op entries from main.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit b140e28. Configure here.}

cursor · 2026-06-20T05:12:38Z

            stop_ids={self._endoftext, self._user, self._observation},
            think_id=self._think,
-            think_end_id=self._think_end,
+            thinking_marker_ids=[self._think_end],


Parse calls use wrong kwarg

High Severity

parse_glm and parse_qwen35 still require think_end_id, but GLM and Qwen3.5-family parse_response paths now pass thinking_marker_ids instead. That raises a TypeError on the first parse call for those renderers, so completion parsing breaks entirely.

Additional Locations (2)

renderers/glm45.py#L295-L296

renderers/qwen35.py#L596-L597

^{Reviewed by Cursor Bugbot for commit b140e28. Configure here.}

hallerite force-pushed the refactor/thinking-retention branch from ce78cae to 5d18ccf Compare June 18, 2026 04:53

cursor Bot reviewed Jun 18, 2026

View reviewed changes

hallerite force-pushed the refactor/thinking-retention branch from 5d18ccf to d163ef3 Compare June 18, 2026 05:18

hallerite mentioned this pull request Jun 18, 2026

refactor(renderer): migrate to renderers thinking_retention PrimeIntellect-ai/verifiers#1743

Open

3 tasks

hallerite changed the title ~~refactor(config): replace preserve_*_thinking bools with thinking_retention~~ feat(thinking): replace preserve_* bools with thinking_retention, respected by the bridge Jun 19, 2026

hallerite force-pushed the refactor/thinking-retention branch from 44117ee to efece49 Compare June 19, 2026 20:56

cursor Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread renderers/base.py Outdated

hallerite and others added 4 commits June 19, 2026 21:48

cursor Bot reviewed Jun 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(thinking): replace preserve_* bools with thinking_retention, respected by the bridge#88

feat(thinking): replace preserve_* bools with thinking_retention, respected by the bridge#88
hallerite wants to merge 7 commits into
mainfrom
refactor/thinking-retention

hallerite commented Jun 18, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot Jun 18, 2026

Uh oh!

macroscopeapp Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallerite commented Jun 18, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

1. thinking_retention replaces the preserve_* bools

2. The bridge now respects thinking_retention (faithfulness fix)

3. Reconcile the per-renderer thinking-kwargs with thinking_retention

Changes

Breaking change

Follow-ups (separate PRs)

Test plan

Replace preserve_all_thinking/preserve_thinking_between_tool_calls bools with a single thinking_retention level in renderer configs and bridge logic

Uh oh!

cursor Bot Jun 18, 2026

Choose a reason for hiding this comment

DefaultRenderer legacy flags slip through

Uh oh!

macroscopeapp Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 20, 2026

Choose a reason for hiding this comment

Parse calls use wrong kwarg

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hallerite commented Jun 18, 2026 •

edited by cursor Bot

Loading

1. `thinking_retention` replaces the `preserve_*` bools

2. The bridge now respects `thinking_retention` (faithfulness fix)

3. Reconcile the per-renderer thinking-kwargs with `thinking_retention`

Replace `preserve_all_thinking`/`preserve_thinking_between_tool_calls` bools with a single `thinking_retention` level in renderer configs and bridge logic

macroscopeapp Bot commented Jun 18, 2026 •

edited

Loading