diff --git a/README.md b/README.md
index d153163..25184be 100644
--- a/README.md
+++ b/README.md
@@ -95,7 +95,6 @@ For RL the trainer must see the exact token ids the sampler saw. The standard al
- **Boolean round-trip.** Engine emits `false`; client parses to Python `bool(False)`; `apply_chat_template` re-renders via `str(False)` → `"False"`. Capital F. Reproducible on Qwen3.5-35B-A3B + mini-swe-agent-plus at ~50% break rate per rollout.
- **BPE retokenization drift.** The same substring tokenizes differently depending on neighbouring bytes. `json` + `p` + `enderer` (3 tokens) vs `jsonp` + `enderer` (2 tokens) when whitespace shifts by one character. Every subsequent token is shifted from there on.
- **Tool-call XML drift.** The engine emits a no-arg call with a stylistic empty ``; the Jinja re-render of the reconstructed dict drops it. Extension property broken at every such call.
-- **Thinking stripped from non-latest assistants.** Some templates strip `…` blocks from prior assistant turns when re-rendering. The recorded stream has the thinking; the next prompt does not.
- **Max-seq-len truncation zeroing the anchor.** Client-side `max_seq_len` enforcement zeros `completion_ids` when `prompt_len > max_seq_len`. The bridge anchor is empty, falling back to full re-render — triggering every mode above.
- **Scaffold-level history rewriting.** Some agent scaffolds (e.g. opencode's `experimental_repairToolCall`) rewrite tool calls before sending them back as history. The next turn's prompt contains a tool call the model never emitted. *A renderer cannot fix this — the drift happens before rendering.*
@@ -122,9 +121,9 @@ from renderers import (
)
# Auto-resolve renderer from the tokenizer's model name. Carries the
-# shared preserve_* flags; template kwargs require an explicit choice.
+# shared thinking_retention flag; template kwargs require an explicit choice.
renderer = create_renderer(tokenizer)
-renderer = create_renderer(tokenizer, AutoRendererConfig(preserve_all_thinking=True))
+renderer = create_renderer(tokenizer, AutoRendererConfig(thinking_retention="all"))
# Explicit choice — the typed config exposes exactly the fields that
# renderer's chat template honours.
@@ -142,12 +141,13 @@ renderer = create_renderer(
Discriminated union: every per-renderer config is a variant of `RendererConfig`, dispatched on the `name` field. Bogus combinations (e.g. `add_vision_id` under `name="qwen3"`) error at construction with a `pydantic.ValidationError`. Downstream pydantic configs (prime-rl orchestrator, verifiers `ClientConfig`) hold a single field typed as `RendererConfig` and inherit the same strict-per-variant validation.
-Two shared behaviour flags live on every variant via `_BaseRendererConfig`:
+One shared behaviour flag lives on every variant via `_BaseRendererConfig`: `thinking_retention`, an ascending scale (`"template"` < `"tool_cycle"` < `"all"`) whose floor is the chat template's own decision.
-- `preserve_all_thinking=True` — every past assistant's `reasoning_content` is kept, even when the chat template would drop it.
-- `preserve_thinking_between_tool_calls=True` — reasoning is kept on assistants in the in-flight tool cycle (post-last-user A-T-…-A block when it contains a tool response). A new user turn closes the block and drops its thinking.
+- `thinking_retention="template"` (default) — defer entirely to the chat template.
+- `thinking_retention="tool_cycle"` — additionally keep reasoning on assistants in the in-flight tool cycle (post-last-user A-T-…-A block when it contains a tool response). A new user turn closes the block and drops its thinking.
+- `thinking_retention="all"` — additionally keep every past assistant's `reasoning_content`, even when the chat template would drop it.
-These OR-compose with template-level toggles (e.g. GLM-5 `clear_thinking`, Nemotron-3 `truncate_history_thinking`): either flag saying "keep" wins. preserve_* can only ever *extend* retention — never override a template kwarg into a "drop" decision. The canonical use case is **compaction**: injecting a `user` turn like *"summarize the work so far"* puts every prior assistant in a past cycle, and `preserve_all_thinking=True` keeps reasoning visible end-to-end.
+`thinking_retention` is honoured end-to-end — both `render()` and `bridge_to_next_turn` consult it. So a multi-turn rollout reproduces the chat template's history handling **faithfully by default**: when a new user turn arrives, a past block's reasoning is dropped exactly as `apply_chat_template` would, and the bridge declines that boundary (letting the caller re-render) rather than carrying the stale `` forward. The override can only ever *extend* retention above the template floor — it never forces a drop the template would keep. GLM-5 `clear_thinking` / Nemotron-3 `truncate_history_thinking` are byte-equivalent template kwargs (`False` ≡ `"all"`); setting one of them *and* a contradictory `thinking_retention` raises at config-load rather than silently resolving. The canonical override use case is **compaction**: injecting a `user` turn like *"summarize the work so far"* puts every prior assistant in a past cycle, and `thinking_retention="all"` keeps reasoning visible end-to-end.
## `DefaultRenderer`
@@ -156,7 +156,7 @@ Fallback for unsupported models. Wraps `apply_chat_template` and accepts `tool_p
## Roadmap
- **VLM support.** `ContentPart` is text-only today; `Qwen3VLRenderer` ships only because Qwen3-VL's text-only chat template differs from Qwen3's. Plan: add `ImagePart` / `VideoPart`, multimodal bridges, validate against a Qwen3-VL RL run.
-- **Patched chat templates.** Some shipped templates re-tokenize history, normalize JSON, or auto-strip thinking — each breaks the extension property. Plan: a `use_patched` opt-in per renderer that renders the same surface form while avoiding known-bad patterns.
+- **Patched chat templates.** Some shipped templates re-tokenize history or normalize JSON in ways that break token identity. Plan: a `use_patched` opt-in per renderer that renders the same surface form while avoiding known-bad patterns. (Auto-stripping thinking from past turns is *not* one of these — that's intended template behaviour the renderer reproduces; use `thinking_retention` to override it.)
## Testing
diff --git a/docs/renderer-config.md b/docs/renderer-config.md
index 3d5fa2a..93cced3 100644
--- a/docs/renderer-config.md
+++ b/docs/renderer-config.md
@@ -50,10 +50,10 @@ Configs are frozen. To override a field, construct a new instance or call
```python
r = create_renderer(tokenizer) # AutoRendererConfig() is the default
-r = create_renderer(tokenizer, AutoRendererConfig(preserve_all_thinking=True))
+r = create_renderer(tokenizer, AutoRendererConfig(thinking_retention="all"))
```
-`AutoRendererConfig` carries only the shared `preserve_*` flags. Template
+`AutoRendererConfig` carries only the shared `thinking_retention` flag. Template
kwargs depend on the renderer, so overriding them requires naming the
renderer explicitly:
@@ -67,33 +67,43 @@ falling back for a VLM would produce token streams the trainer can't
reconstruct. Text-only fine-tunes without a registered renderer fall back to
`DefaultRenderer` and log the choice at INFO.
-## `preserve_*` flags
-
-Every variant carries two renderer-agnostic flags on `_BaseRendererConfig`:
-
-- `preserve_all_thinking: bool = False` — re-emit `reasoning_content` on
- every past assistant turn, even when the chat template would drop it.
-- `preserve_thinking_between_tool_calls: bool = False` — re-emit
- `reasoning_content` only inside the in-flight tool cycle (the contiguous
- A-T-…-A block after the most recent `user` message, when it contains at
- least one `tool` response). A new user turn closes the block and drops
- its thinking.
-
-These OR-compose with template-level toggles. GLM-5's `clear_thinking` and
-Nemotron-3's `truncate_history_thinking` already gate past thinking; the
-`preserve_*` flags add to that:
-
-| `clear_thinking` | `preserve_all_thinking` | past thinking? |
-|------------------|-------------------------|----------------|
-| `True` (default — drop) | `False` (default) | dropped |
-| `True` | `True` | kept |
-| `False` (keep) | `False` | kept |
-| `False` | `True` | kept |
-
-`preserve_*` can only extend retention, never force a drop. The canonical
-use case is **compaction**: injecting a `user` turn like *"summarize the work
-so far"* puts every prior assistant in a past cycle, and
-`preserve_all_thinking=True` keeps reasoning visible end-to-end.
+## `thinking_retention`
+
+Every variant carries one renderer-agnostic flag on `_BaseRendererConfig`,
+an ascending scale whose floor is the chat template's own decision:
+
+- `thinking_retention: Literal["template", "tool_cycle", "all"] = "template"`
+ - `"template"` (default) — defer entirely to the chat template.
+ - `"tool_cycle"` — additionally re-emit `reasoning_content` inside the
+ in-flight tool cycle (the contiguous A-T-…-A block after the most
+ recent `user` message, when it contains at least one `tool` response).
+ A new user turn closes the block and drops its thinking.
+ - `"all"` — additionally re-emit `reasoning_content` on every past
+ assistant turn, even when the chat template would drop it.
+
+The levels are nested: `"all"` ⊇ `"tool_cycle"` ⊇ `"template"`, and the
+level is honoured end-to-end — `render()` and `bridge_to_next_turn` both
+consult it, so multi-turn rollouts reproduce the template's history handling
+faithfully by default. GLM-5's `clear_thinking` and Nemotron-3's
+`truncate_history_thinking` are byte-equivalent template kwargs (`False` ≡
+`"all"`) gating the same past thinking; `thinking_retention` composes with
+them as:
+
+| `clear_thinking` | `thinking_retention` | past thinking? |
+|------------------|----------------------|----------------|
+| `True` (default — drop) | `"template"` (default) | dropped |
+| `True` | `"all"` | kept |
+| `False` (keep) | `"template"` | kept |
+| `False` | `"all"` | kept |
+
+`thinking_retention` can only extend retention, never force a drop — the
+template is the floor. Because the kwarg and `thinking_retention` name the
+same thing, explicitly setting a keep-history kwarg to `False` *and* a
+non-`"all"` `thinking_retention` is contradictory and raises at config-load
+(set `thinking_retention="all"` instead). The canonical use case is **compaction**: injecting
+a `user` turn like *"summarize the work so far"* puts every prior assistant
+in a past cycle, and `thinking_retention="all"` keeps reasoning visible
+end-to-end.
## `DefaultRendererConfig` accepts arbitrary Jinja kwargs
@@ -139,7 +149,7 @@ In TOML / YAML, the discriminator routes deserialization:
name = "qwen3.5"
enable_thinking = false
add_vision_id = true
-preserve_all_thinking = true
+thinking_retention = "all"
```
Pydantic dispatches on `name = "qwen3.5"` to `Qwen35RendererConfig`. Bogus
diff --git a/renderers/base.py b/renderers/base.py
index ed5bc7e..7d5da7c 100644
--- a/renderers/base.py
+++ b/renderers/base.py
@@ -20,7 +20,11 @@
)
if TYPE_CHECKING:
- from renderers.configs import AutoRendererConfig, RendererConfig
+ from renderers.configs import (
+ AutoRendererConfig,
+ RendererConfig,
+ ThinkingRetention,
+ )
logger = logging.getLogger("renderers.base")
@@ -666,14 +670,13 @@ def render(
"""Render messages to token IDs with per-token message attribution.
Behaviour around historical ``reasoning_content`` is owned by the
- renderer instance — the ``preserve_all_thinking`` and
- ``preserve_thinking_between_tool_calls`` flags are constructor
- kwargs, not call-site kwargs. To render with a different
- configuration, build a different renderer (or different pool).
- Defaults preserve byte-identity with each model's chat template;
- flipping a flag at construction restores ``reasoning_content``
- the template would otherwise drop. See
- ``should_preserve_past_thinking`` for the per-message
+ renderer instance — the ``thinking_retention`` level is a
+ constructor kwarg, not a call-site kwarg. To render with a
+ different configuration, build a different renderer (or different
+ pool). The ``"template"`` default preserves byte-identity with
+ each model's chat template; raising the level at construction
+ restores ``reasoning_content`` the template would otherwise drop.
+ See ``should_preserve_past_thinking`` for the per-message
classification.
"""
...
@@ -769,7 +772,13 @@ def bridge_to_next_turn(
Return ``None`` whenever the renderer can't prove that contract
holds — the caller falls back to a full re-render. In particular,
bridges refuse assistant messages in ``new_messages`` (those would
- re-tokenize model-sampled content). Hand-coded renderers know their
+ re-tokenize model-sampled content), and a renderer whose template
+ drops a past block's thinking once a new user query arrives refuses
+ to span that boundary while carrying ```` verbatim would keep
+ it (it falls back to a re-render, which honours ``thinking_retention``
+ and drops the stale thinking — keeping the bridge faithful to the
+ template). ``thinking_retention="all"`` keeps thinking on every path,
+ so no decline is needed there. Hand-coded renderers know their
canonical close and synthesise it on truncated priors;
DefaultRenderer always returns ``None`` because the template's
close is unknown.
@@ -1527,10 +1536,7 @@ def _resolve_auto(tokenizer, auto: AutoRendererConfig) -> Renderer:
model_name = getattr(tokenizer, "name_or_path", "")
renderer_name = MODEL_RENDERER_MAP.get(model_name)
- preserve_carry = {
- "preserve_all_thinking": auto.preserve_all_thinking,
- "preserve_thinking_between_tool_calls": auto.preserve_thinking_between_tool_calls,
- }
+ preserve_carry = {"thinking_retention": auto.thinking_retention}
if renderer_name is not None:
cfg_cls = _config_class_for(renderer_name)
@@ -1555,11 +1561,12 @@ def _resolve_auto(tokenizer, auto: AutoRendererConfig) -> Renderer:
# Text-only fall back to default (apply_chat_template). For fine-tunes
# with customized chat templates this is the *correct* choice, so we
# don't warn. Note the pick at INFO and advertise the parser knobs.
- if auto.preserve_all_thinking or auto.preserve_thinking_between_tool_calls:
+ if auto.thinking_retention != "template":
raise NotImplementedError(
"Auto-resolved DefaultRenderer can't selectively re-emit "
"dropped reasoning_content. Pass an explicit typed renderer "
- "config (model-specific) if you need preserve_*_thinking."
+ "config (model-specific) if you need thinking_retention != "
+ "'template'."
)
logger.info(
"No model-specific renderer matched %r. Using DefaultRenderer "
@@ -1912,12 +1919,92 @@ def reject_assistant_in_extension(new_messages: list[Message]) -> bool:
return any(m.get("role") == "assistant" for m in new_messages)
+def introduces_user_query(new_messages: list[Message]) -> bool:
+ """Return True if ``new_messages`` opens a new user-query turn.
+
+ A query boundary is a ``user`` message whose content isn't a
+ ``...`` wrapper — tool responses
+ (``role="tool"``, or folded into a wrapped user turn) continue the
+ in-flight cycle rather than starting a new query. This mirrors the
+ boundary chat templates use to decide when a past assistant block's
+ thinking becomes "older" and is dropped, so a bridge can tell whether
+ spanning ``new_messages`` would cross that boundary (cf.
+ ``should_preserve_past_thinking`` and per-renderer ``_last_query_index``).
+ """
+ for m in new_messages:
+ if m.get("role") != "user":
+ continue
+ content = m.get("content")
+ if (
+ isinstance(content, str)
+ and content.startswith("")
+ and content.endswith("")
+ ):
+ continue
+ return True
+ return False
+
+
+def _contains_subsequence(haystack: list[int], needle: list[int]) -> bool:
+ """True if ``needle`` appears as a contiguous run inside ``haystack``."""
+ n = len(needle)
+ if n == 0:
+ return False
+ if n == 1:
+ return needle[0] in haystack
+ first = needle[0]
+ for i in range(len(haystack) - n + 1):
+ if haystack[i] == first and haystack[i : i + n] == needle:
+ return True
+ return False
+
+
+def reject_thinking_strip_in_extension(
+ previous_prompt_ids: list[int],
+ previous_completion_ids: list[int],
+ new_messages: list[Message],
+ *,
+ thinking_retention: ThinkingRetention,
+ thinking_marker_ids: list[int],
+ enable_thinking: bool = True,
+) -> bool:
+ """Return True if a bridge must refuse to span ``new_messages``.
+
+ A renderer whose template drops a past block's thinking once a new user
+ query arrives can't span that boundary while keeping prior tokens
+ verbatim — the kept thinking would diverge from a faithful re-render
+ (which honours ``thinking_retention`` and drops the stale thinking).
+ Thinking renderers OR this into their bridge's reject check.
+
+ ``thinking_marker_ids`` is a token subsequence whose presence in the
+ prior tokens signals a sampled thinking block — a single-token close
+ like ``[]`` (Qwen3, GLM, Nemotron), a multi-token close
+ (``Kimi-K2.5``), or a harmony analysis-channel header
+ (``[<|channel|>, …"analysis"…, <|message|>]`` for gpt-oss).
+ ``thinking_retention="all"`` keeps thinking on every path, so it never
+ declines.
+
+ Returns ``False`` (safe to bridge) when: ``thinking_retention="all"``;
+ thinking is disabled (no sampled thinking to strip — also avoids the
+ empty ```` generation-prompt scaffold a disabled model
+ leaves in ``previous_prompt_ids``); ``new_messages`` stays in the
+ in-flight cycle (no new user query); or the prior tokens carry no
+ thinking marker to strip.
+ """
+ if thinking_retention == "all" or not enable_thinking:
+ return False
+ if not introduces_user_query(new_messages):
+ return False
+ return _contains_subsequence(
+ previous_completion_ids, thinking_marker_ids
+ ) or _contains_subsequence(previous_prompt_ids, thinking_marker_ids)
+
+
def should_preserve_past_thinking(
messages: list[Message],
msg_idx: int,
*,
- preserve_all_thinking: bool,
- preserve_thinking_between_tool_calls: bool,
+ thinking_retention: ThinkingRetention,
) -> bool:
"""Should ``messages[msg_idx]``'s ``reasoning_content`` be emitted as
thinking even when the chat template would drop it?
@@ -1925,25 +2012,25 @@ def should_preserve_past_thinking(
Returns ``True`` only as an override above the template default. Each
renderer ORs this into its own "render thinking?" condition; a result
of ``False`` means "follow the template" (drop or keep as the template
- decides), not "force-drop".
-
- Override rules:
-
- - ``preserve_all_thinking`` — every past-asst's thinking is kept.
- - ``preserve_thinking_between_tool_calls`` — keeps thinking only
- inside the *current* tool cycle: the contiguous A-T-...-A block
- after the most recent ``user`` message, and only if that block
- contains at least one ``tool`` response. As soon as a new
- ``user`` turn arrives, the previous block becomes "older" and
- its thinking is dropped (template default), matching how most
- chat templates already handle multi-turn contexts. Use
- ``preserve_all_thinking`` if you need thinking on older blocks
- to survive the user-turn boundary too.
+ decides), not "force-drop". ``thinking_retention`` selects how far the
+ override reaches:
+
+ - ``"template"`` — no override; defer to the template (always ``False``).
+ - ``"all"`` — every past-asst's thinking is kept (always ``True``).
+ - ``"tool_cycle"`` — keeps thinking only inside the *current* tool
+ cycle: the contiguous A-T-...-A block after the most recent ``user``
+ message, and only if that block contains at least one ``tool``
+ response. As soon as a new ``user`` turn arrives, the previous block
+ becomes "older" and its thinking is dropped (template default),
+ matching how most chat templates handle multi-turn contexts. Use
+ ``"all"`` if you need thinking on older blocks to survive the
+ user-turn boundary too.
"""
- if preserve_all_thinking:
+ if thinking_retention == "all":
return True
- if not preserve_thinking_between_tool_calls:
+ if thinking_retention == "template":
return False
+ # thinking_retention == "tool_cycle"
# Most recent user message (or -1 if none).
last_user = -1
for j in range(len(messages) - 1, -1, -1):
diff --git a/renderers/configs.py b/renderers/configs.py
index d500f8e..b87cbe1 100644
--- a/renderers/configs.py
+++ b/renderers/configs.py
@@ -4,14 +4,15 @@
Each renderer accepts its own typed config; bad combinations (e.g.
``add_vision_id`` under ``name="qwen3"``) fail at config-load time with a
pydantic ``ValidationError`` rather than at runtime via an allowlist
-check. The shared ``preserve_*`` flags live on ``BaseRendererConfig``
-and OR-compose with template-level toggles (e.g. GLM-5
-``clear_thinking``) inside each renderer — they extend retention, never
-override the template into a drop.
+check. The shared ``thinking_retention`` flag lives on
+``BaseRendererConfig`` and OR-composes with template-level toggles (e.g.
+GLM-5 ``clear_thinking``) inside each renderer — it extends retention,
+never overrides the template into a drop.
``AutoRendererConfig`` is a placeholder variant: ``create_renderer``
resolves it via ``MODEL_RENDERER_MAP`` and constructs the matching
-typed config with the auto config's ``preserve_*`` fields carried over.
+typed config with the auto config's ``thinking_retention`` field carried
+over.
``DefaultRendererConfig`` uses ``extra="allow"`` to accept arbitrary
Jinja kwargs as ``model_extra`` — ``DefaultRenderer`` doesn't know which
@@ -22,10 +23,45 @@
from typing import Annotated, ClassVar, Literal, Union
-from pydantic import ConfigDict, Field
+from pydantic import ConfigDict, Field, model_validator
from pydantic_config import BaseConfig
+def _reject_thinking_retention_conflict(config: BaseConfig, kwarg_name: str) -> None:
+ """Raise if an explicit template thinking-kwarg contradicts an explicit
+ ``thinking_retention``.
+
+ ``clear_thinking`` (GLM) / ``truncate_history_thinking`` (Nemotron) are
+ byte-equivalent to ``thinking_retention``: setting one ``False`` keeps
+ all past thinking, i.e. ``thinking_retention="all"``. They're the same
+ knob, so a user who explicitly sets both to disagreeing values almost
+ certainly has a bug — surface it rather than silently resolving. Only
+ fires when BOTH are explicit; a defaulted value is not an intent (and
+ this keeps the per-kwarg parity matrix, which sets one field at a time,
+ working).
+ """
+ fields_set = config.__pydantic_fields_set__
+ if (
+ kwarg_name in fields_set
+ and "thinking_retention" in fields_set
+ and getattr(config, kwarg_name) is False
+ and config.thinking_retention != "all"
+ ):
+ raise ValueError(
+ f"{kwarg_name}=False keeps all past thinking (the same as "
+ f"thinking_retention='all'), which conflicts with the explicit "
+ f"thinking_retention={config.thinking_retention!r}. They are the "
+ f"same knob — set thinking_retention='all' (or drop {kwarg_name})."
+ )
+
+
+ThinkingRetention = Literal["template", "tool_cycle", "all"]
+"""How far past-assistant ``reasoning_content`` is retained, as an override
+*above* the chat-template floor. ``"template"`` defers to the template;
+``"tool_cycle"`` and ``"all"`` progressively extend retention and never force a
+drop. See :attr:`BaseRendererConfig.thinking_retention`."""
+
+
class BaseRendererConfig(BaseConfig):
"""Shared fields and config for every renderer config variant.
@@ -35,28 +71,32 @@ class BaseRendererConfig(BaseConfig):
this class adds ``frozen=True`` so configs are hashable value
objects.
- ``preserve_all_thinking`` and ``preserve_thinking_between_tool_calls``
- are renderer-internal behaviour flags — they don't map to any Jinja
- chat-template kwarg. They OR-compose with template-level toggles on
- renderers that expose one (GLM-5 ``clear_thinking``, Nemotron-3
- ``truncate_history_thinking``): either flag saying "keep this
- thinking" wins. preserve_* can only ever extend retention; setting
- ``preserve_all_thinking=True`` always keeps past thinking, regardless
- of the template kwarg. See ``renderers.base.should_preserve_past_thinking``.
+ ``thinking_retention`` is a renderer-internal behaviour flag — it
+ doesn't map to any Jinja chat-template kwarg. It OR-composes with
+ template-level toggles on renderers that expose one (GLM-5
+ ``clear_thinking``, Nemotron-3 ``truncate_history_thinking``):
+ whichever side says "keep this thinking" wins. The chat template is
+ the floor — ``thinking_retention`` can only ever *extend* retention,
+ never override the template into a drop. See
+ ``renderers.base.should_preserve_past_thinking``.
"""
model_config = ConfigDict(frozen=True)
- preserve_all_thinking: bool = False
- """Restore ``reasoning_content`` on every past assistant turn, even
- when the chat template would drop it. Strict superset of
- ``preserve_thinking_between_tool_calls``."""
+ thinking_retention: ThinkingRetention = "template"
+ """How much past-assistant ``reasoning_content`` to re-emit, layered as
+ an override *above* the chat template's own decision. The template is
+ the floor — each level only ever *extends* retention, never forces a
+ drop:
+
+ - ``"template"`` (default) — defer entirely to the chat template.
+ - ``"tool_cycle"`` — additionally keep thinking inside the in-flight
+ tool cycle: the contiguous A-T-...-A block after the most recent
+ ``user`` turn, when it contains at least one ``tool`` response. A new
+ user turn closes the block.
+ - ``"all"`` — additionally keep thinking on every past assistant turn.
- preserve_thinking_between_tool_calls: bool = False
- """Restore ``reasoning_content`` only inside the in-flight tool cycle:
- the contiguous A-T-...-A block after the most recent ``user`` turn,
- and only if it contains at least one ``tool`` response. A new user
- turn closes the block and drops its thinking (template default)."""
+ Ascending and nested: ``"all"`` ⊇ ``"tool_cycle"`` ⊇ ``"template"``."""
# Fields that are renderer-internal — not forwarded to (or mirrored
# by) ``apply_chat_template``. Override in subclasses that hold
@@ -87,8 +127,9 @@ def template_field_names(cls) -> frozenset[str]:
class AutoRendererConfig(BaseRendererConfig):
"""Resolve the renderer from ``tokenizer.name_or_path`` at construction
- time via ``MODEL_RENDERER_MAP``. Carries only the shared ``preserve_*``
- fields; template kwargs require an explicit renderer choice so that
+ time via ``MODEL_RENDERER_MAP``. Carries only the shared
+ ``thinking_retention`` field; template kwargs require an explicit
+ renderer choice so that
template-dependent behaviour stays visible at the call site."""
name: Literal["auto"] = "auto"
@@ -119,6 +160,26 @@ class DefaultRendererConfig(BaseRendererConfig):
# template. Jinja kwargs live in ``model_extra`` (extra="allow").
_internal_fields = frozenset({"tool_parser", "reasoning_parser"})
+ @model_validator(mode="after")
+ def _reject_legacy_preserve_flags(self):
+ # ``extra="allow"`` would otherwise swallow the removed ``preserve_*``
+ # bools into ``model_extra`` and forward them to apply_chat_template,
+ # silently dropping the user's intent (DefaultRenderer can't
+ # selectively re-emit reasoning_content). Reject them like every other
+ # config's ``extra="forbid"`` does, pointing at the replacement.
+ legacy = {
+ "preserve_all_thinking",
+ "preserve_thinking_between_tool_calls",
+ } & set(self.model_extra or {})
+ if legacy:
+ raise ValueError(
+ f"{sorted(legacy)} were replaced by thinking_retention. "
+ "DefaultRenderer falls back to apply_chat_template and can't "
+ "selectively re-emit reasoning_content — use thinking_retention "
+ "on a model-specific renderer."
+ )
+ return self
+
class Qwen3RendererConfig(BaseRendererConfig):
"""Qwen3 (text-only) renderer config."""
@@ -199,8 +260,13 @@ class GLM5RendererConfig(BaseRendererConfig):
"""When ``False``, the renderer keeps ``{reasoning}``
on past-cycle assistant turns instead of dropping them. Mirrors the
chat template's ``clear_thinking`` toggle. OR-composes with
- ``preserve_all_thinking`` / ``preserve_thinking_between_tool_calls``
- — see :class:`BaseRendererConfig` for the contract."""
+ ``thinking_retention`` — see :class:`BaseRendererConfig` for the
+ contract."""
+
+ @model_validator(mode="after")
+ def _check_thinking_retention(self):
+ _reject_thinking_retention_conflict(self, "clear_thinking")
+ return self
class GLM51RendererConfig(BaseRendererConfig):
@@ -215,6 +281,11 @@ class GLM51RendererConfig(BaseRendererConfig):
clear_thinking: bool = True
"""See :class:`GLM5RendererConfig.clear_thinking`."""
+ @model_validator(mode="after")
+ def _check_thinking_retention(self):
+ _reject_thinking_retention_conflict(self, "clear_thinking")
+ return self
+
class GLM45RendererConfig(BaseRendererConfig):
"""GLM-4.5 Air renderer config."""
@@ -321,10 +392,11 @@ class LagunaXS2RendererConfig(BaseRendererConfig):
class Llama3RendererConfig(BaseRendererConfig):
"""Llama-3.x Instruct renderer config.
- Llama-3 ships no reasoning channel, so the base ``preserve_*_thinking``
- flags don't apply: ``Llama3Renderer`` raises ``NotImplementedError``
- if either is set (matching ``DefaultRenderer``'s contract for the
- same case). Both fields below mirror real ``apply_chat_template``
+ Llama-3 ships no reasoning channel, so the base ``thinking_retention``
+ flag is a no-op: there's never any past-assistant thinking to retain
+ or drop, so any level leaves the token stream unchanged (same contract
+ as Kimi-K2 / Qwen3-VL). Both fields below mirror real
+ ``apply_chat_template``
kwargs.
"""
@@ -373,8 +445,13 @@ class Nemotron3RendererConfig(BaseRendererConfig):
"""When ``False``, keep ``{reasoning}`` on past-cycle
assistant turns instead of dropping them. Mirrors the chat
template's ``truncate_history_thinking`` toggle. OR-composes with
- ``preserve_all_thinking`` / ``preserve_thinking_between_tool_calls``
- — see :class:`BaseRendererConfig` for the contract."""
+ ``thinking_retention`` — see :class:`BaseRendererConfig` for the
+ contract."""
+
+ @model_validator(mode="after")
+ def _check_thinking_retention(self):
+ _reject_thinking_retention_conflict(self, "truncate_history_thinking")
+ return self
low_effort: bool = False
"""When ``True``, append ``\\n\\n{reasoning effort: low}`` to the last user
@@ -404,6 +481,11 @@ class Nemotron3UltraRendererConfig(BaseRendererConfig):
truncate_history_thinking: bool = True
"""See :class:`Nemotron3RendererConfig.truncate_history_thinking`."""
+ @model_validator(mode="after")
+ def _check_thinking_retention(self):
+ _reject_thinking_retention_conflict(self, "truncate_history_thinking")
+ return self
+
medium_effort: bool = False
"""When ``True``, append ``\\n\\n{reasoning effort: efficient}`` to the last
user message. Mirrors the Ultra chat template's ``medium_effort`` kwarg."""
@@ -426,8 +508,8 @@ class DeepSeekR1RendererConfig(BaseRendererConfig):
R1 always reasons — its chat template unconditionally prefills
``\\n`` at the generation prompt and strips ```` from
historical assistant turns. There is therefore no ``enable_thinking``
- knob (thinking is not optional), and ``preserve_*`` flags are no-ops
- (history reasoning is always dropped); both stored for protocol
+ knob (thinking is not optional), and ``thinking_retention`` is a no-op
+ (history reasoning is always dropped); stored for protocol
uniformity. Applies to full ``deepseek-ai/DeepSeek-R1`` / ``-R1-0528``
— NOT the R1-Distill-Qwen/Llama models, which use those base
tokenizers and route to the Qwen3 / Llama-3 renderers.
@@ -473,7 +555,7 @@ class DeepSeekR1RendererConfig(BaseRendererConfig):
# Map discriminator → config class. Used by ``create_renderer`` when
# resolving ``AutoRendererConfig`` against ``MODEL_RENDERER_MAP``: the
# resolved renderer name picks the corresponding typed config, and the
-# auto config's ``preserve_*`` fields are carried over.
+# auto config's ``thinking_retention`` field is carried over.
_CONFIG_BY_NAME: dict[str, type[BaseRendererConfig]] = {
"auto": AutoRendererConfig,
"default": DefaultRendererConfig,
@@ -543,5 +625,6 @@ def config_from_name(name: str) -> BaseRendererConfig | None:
"Qwen3RendererConfig",
"Qwen3VLRendererConfig",
"RendererConfig",
+ "ThinkingRetention",
"config_from_name",
]
diff --git a/renderers/deepseek_v3.py b/renderers/deepseek_v3.py
index 5f4840a..3cb82d3 100644
--- a/renderers/deepseek_v3.py
+++ b/renderers/deepseek_v3.py
@@ -48,7 +48,7 @@ class DeepSeekV3Renderer:
assistant content is emitted verbatim. The reasoning variant
(````-prefilled prompt, history reasoning stripped) lives in
:class:`renderers.deepseek_r1.DeepSeekR1Renderer`, which subclasses
- this one. ``preserve_*`` flags are no-ops here (no reasoning channel),
+ this one. ``thinking_retention`` is a no-op here (no reasoning channel),
stored for protocol uniformity.
"""
diff --git a/renderers/default.py b/renderers/default.py
index a662097..2e9c7d2 100644
--- a/renderers/default.py
+++ b/renderers/default.py
@@ -95,11 +95,12 @@ def __init__(
config: DefaultRendererConfig | None = None,
):
cfg = config or DefaultRendererConfig()
- if cfg.preserve_all_thinking or cfg.preserve_thinking_between_tool_calls:
+ if cfg.thinking_retention != "template":
raise NotImplementedError(
"DefaultRenderer falls back to apply_chat_template and can't "
"selectively re-emit dropped reasoning_content. Configure a "
- "model-specific renderer if you need preserve_*_thinking."
+ "model-specific renderer if you need thinking_retention != "
+ "'template'."
)
self._tokenizer = tokenizer
self.config = cfg
diff --git a/renderers/glm45.py b/renderers/glm45.py
index 7af9259..7ca4d1c 100644
--- a/renderers/glm45.py
+++ b/renderers/glm45.py
@@ -23,6 +23,7 @@
attribute_text_segments,
extract_message_tool_names,
reject_assistant_in_extension,
+ reject_thinking_strip_in_extension,
should_preserve_past_thinking,
)
from renderers.configs import GLM45RendererConfig
@@ -228,8 +229,7 @@ def emit_text_segments(
preserve_thinking = should_preserve_past_thinking(
messages,
i,
- preserve_all_thinking=self.config.preserve_all_thinking,
- preserve_thinking_between_tool_calls=self.config.preserve_thinking_between_tool_calls,
+ thinking_retention=self.config.thinking_retention,
)
self._render_assistant(
msg,
@@ -293,7 +293,7 @@ def parse_response(
token_ids,
stop_ids={self._endoftext, self._user, self._observation},
think_id=self._think,
- think_end_id=self._think_end,
+ thinking_marker_ids=[self._think_end],
tool_call_id=self._tool_call_tok,
tool_call_end_id=self._tool_call_end_tok,
arg_key_id=self._arg_key,
@@ -321,6 +321,19 @@ def bridge_to_next_turn(
):
return None
+ # Faithfulness across a user-query boundary: the template drops a past
+ # block's thinking once a new user turn arrives (see
+ # reject_thinking_strip_in_extension).
+ if reject_thinking_strip_in_extension(
+ previous_prompt_ids,
+ previous_completion_ids,
+ new_messages,
+ thinking_retention=self.config.thinking_retention,
+ thinking_marker_ids=[self._think_end],
+ enable_thinking=self.config.enable_thinking,
+ ):
+ return None
+
# Same next-turn-marker scheme as GLM-5, but role markers are
# followed by a literal ``\n`` in the prompt text.
previous_ids = list(previous_prompt_ids) + list(previous_completion_ids)
diff --git a/renderers/glm5.py b/renderers/glm5.py
index 924d754..ad215e5 100644
--- a/renderers/glm5.py
+++ b/renderers/glm5.py
@@ -24,6 +24,7 @@
attribute_text_segments,
extract_message_tool_names,
reject_assistant_in_extension,
+ reject_thinking_strip_in_extension,
should_preserve_past_thinking,
)
from renderers.configs import GLM5RendererConfig, GLM51RendererConfig
@@ -241,8 +242,7 @@ def emit_text_segments(
preserve_thinking = should_preserve_past_thinking(
messages,
i,
- preserve_all_thinking=self.config.preserve_all_thinking,
- preserve_thinking_between_tool_calls=self.config.preserve_thinking_between_tool_calls,
+ thinking_retention=self.config.thinking_retention,
)
self._render_assistant(
msg,
@@ -309,7 +309,7 @@ def parse_response(
token_ids,
stop_ids={self._endoftext, self._user, self._observation},
think_id=self._think,
- think_end_id=self._think_end,
+ thinking_marker_ids=[self._think_end],
tool_call_id=self._tool_call_tok,
tool_call_end_id=self._tool_call_end_tok,
arg_key_id=self._arg_key,
@@ -337,6 +337,25 @@ def bridge_to_next_turn(
):
return None
+ # Faithfulness across a user-query boundary: the template drops a past
+ # block's thinking once a new user turn arrives (see
+ # reject_thinking_strip_in_extension). ``clear_thinking=False`` keeps
+ # all past thinking (≡ thinking_retention="all"), so the bridge stays
+ # faithful carrying it verbatim — fold it into the effective level so
+ # we don't over-decline and re-tokenize sampled thinking.
+ retention = (
+ "all" if not self.config.clear_thinking else self.config.thinking_retention
+ )
+ if reject_thinking_strip_in_extension(
+ previous_prompt_ids,
+ previous_completion_ids,
+ new_messages,
+ thinking_retention=retention,
+ thinking_marker_ids=[self._think_end],
+ enable_thinking=self.config.enable_thinking,
+ ):
+ return None
+
# GLM has no per-turn close token. An assistant turn ends when the
# next turn's role marker appears, OR the model emits <|endoftext|>.
# vLLM includes these in ``stop_token_ids`` so a clean stop leaves
@@ -501,7 +520,7 @@ def _render_assistant(
# ``preserve_thinking`` is the override output of
# ``should_preserve_past_thinking`` — it adds historical assistants
# back when the renderer was constructed with
- # ``preserve_all_thinking=True``. ``clear_thinking=False`` mirrors
+ # ``thinking_retention="all"``. ``clear_thinking=False`` mirrors
# the template's per-call ``clear_thinking is defined and not
# clear_thinking`` gate: a chat_template_kwarg surface for the
# same behaviour, gated explicitly by the caller per render.
diff --git a/renderers/gpt_oss.py b/renderers/gpt_oss.py
index 2a9c5ca..9dbc6b7 100644
--- a/renderers/gpt_oss.py
+++ b/renderers/gpt_oss.py
@@ -58,6 +58,7 @@
ToolSpec,
extract_message_tool_names,
reject_assistant_in_extension,
+ reject_thinking_strip_in_extension,
should_preserve_past_thinking,
trim_to_turn_close,
)
@@ -154,6 +155,14 @@ def __init__(
self._channel = self._token_id("<|channel|>")
self._message = self._token_id("<|message|>")
self._constrain = self._token_id("<|constrain|>")
+ # Harmony marks reasoning as an analysis channel:
+ # ``<|channel|>analysis<|message|>``. Its presence in prior tokens
+ # signals a sampled thinking block the template strips from history.
+ self._analysis_marker_ids = [
+ self._channel,
+ *self._encode("analysis"),
+ self._message,
+ ]
# ── token utilities ──────────────────────────────────────────────────────
@@ -425,8 +434,7 @@ def emit_harmony_message(
should_preserve_past_thinking(
messages,
i,
- preserve_all_thinking=self.config.preserve_all_thinking,
- preserve_thinking_between_tool_calls=self.config.preserve_thinking_between_tool_calls,
+ thinking_retention=self.config.thinking_retention,
)
)
for hm in self._to_harmony_messages(
@@ -524,6 +532,20 @@ def bridge_to_next_turn(
):
return None
+ # Faithfulness across a user-query boundary: harmony strips the
+ # analysis (reasoning) channel from history once a new user turn
+ # arrives, so the bridge can't carry it verbatim there (see
+ # reject_thinking_strip_in_extension). gpt-oss always reasons.
+ if reject_thinking_strip_in_extension(
+ previous_prompt_ids,
+ previous_completion_ids,
+ new_messages,
+ thinking_retention=self.config.thinking_retention,
+ thinking_marker_ids=self._analysis_marker_ids,
+ enable_thinking=True,
+ ):
+ return None
+
previous_ids = trim_to_turn_close(
previous_prompt_ids,
previous_completion_ids,
diff --git a/renderers/kimi_k2.py b/renderers/kimi_k2.py
index e99dfa7..8cdc548 100644
--- a/renderers/kimi_k2.py
+++ b/renderers/kimi_k2.py
@@ -38,9 +38,9 @@ class KimiK2Renderer:
Kimi K2's chat template doesn't read any thinking-related variable —
``content`` renders verbatim with no reasoning branch. The
- ``enable_thinking`` / ``preserve_*`` fields on the config are stored
- for protocol uniformity with the rest of the renderer family but
- have no effect on the byte-level output.
+ ``enable_thinking`` / ``thinking_retention`` fields on the config are
+ stored for protocol uniformity with the rest of the renderer family
+ but have no effect on the byte-level output.
"""
def __init__(
diff --git a/renderers/kimi_k25.py b/renderers/kimi_k25.py
index bca4464..e90a9ba 100644
--- a/renderers/kimi_k25.py
+++ b/renderers/kimi_k25.py
@@ -38,6 +38,7 @@
ToolSpec,
extract_message_tool_names,
reject_assistant_in_extension,
+ reject_thinking_strip_in_extension,
should_preserve_past_thinking,
trim_to_turn_close,
)
@@ -888,8 +889,7 @@ def emit_image(
preserve_thinking = should_preserve_past_thinking(
messages,
i,
- preserve_all_thinking=self.config.preserve_all_thinking,
- preserve_thinking_between_tool_calls=self.config.preserve_thinking_between_tool_calls,
+ thinking_retention=self.config.thinking_retention,
)
self._render_assistant_body(
msg,
@@ -1036,6 +1036,20 @@ def bridge_to_next_turn(
):
return None
+ # Faithfulness across a user-query boundary: the template drops a past
+ # block's thinking once a new user turn arrives. ```` is
+ # multi-token here, so pass the full close subsequence (see
+ # reject_thinking_strip_in_extension).
+ if reject_thinking_strip_in_extension(
+ previous_prompt_ids,
+ previous_completion_ids,
+ new_messages,
+ thinking_retention=self.config.thinking_retention,
+ thinking_marker_ids=self._think_close_ids,
+ enable_thinking=self.config.thinking,
+ ):
+ return None
+
close_ids: set[int] = {self._im_end}
if self._endoftext is not None:
close_ids.add(self._endoftext)
diff --git a/renderers/laguna_xs2.py b/renderers/laguna_xs2.py
index bd6b64f..71953b9 100644
--- a/renderers/laguna_xs2.py
+++ b/renderers/laguna_xs2.py
@@ -17,10 +17,8 @@
a ``### Tools`` header with an ```` listing and prose
format instructions that vary on ``enable_thinking``).
- Reasoning is rendered for every assistant message — no last-user-index
- gating. ``preserve_all_thinking`` and
- ``preserve_thinking_between_tool_calls`` are accepted for protocol
- uniformity but are effectively no-ops since past reasoning is preserved
- by default.
+ gating. ``thinking_retention`` is accepted for protocol uniformity but
+ is effectively a no-op since past reasoning is preserved by default.
"""
from __future__ import annotations
diff --git a/renderers/llama_3.py b/renderers/llama_3.py
index d15792a..5dd2798 100644
--- a/renderers/llama_3.py
+++ b/renderers/llama_3.py
@@ -9,7 +9,7 @@
Notable differences from the Qwen / GLM family renderers:
* No ```` / reasoning channel — Llama-3 doesn't ship a
- reasoning-content concept, so ``preserve_*_thinking`` flags don't
+ reasoning-content concept, so the ``thinking_retention`` flag doesn't
apply.
* ``<|begin_of_text|>`` (BOS) is emitted at the very start of every
render. The chat template never omits it.
@@ -94,10 +94,10 @@ def __init__(
tokenizer: PreTrainedTokenizer,
config: Llama3RendererConfig | None = None,
):
- # ``preserve_*_thinking`` are accepted but no-ops: Llama-3 ships no
+ # ``thinking_retention`` is accepted but a no-op: Llama-3 ships no
# reasoning_content channel, so there's never any past-assistant
- # thinking to retain or drop. The flags are stored on ``self.config``
- # for cross-renderer uniformity but never change the token stream —
+ # thinking to retain or drop. The level is stored on ``self.config``
+ # for cross-renderer uniformity but never changes the token stream —
# the same contract as Kimi-K2 / Qwen3-VL (see the never-preserves
# renderers in tests/test_preserve_thinking.py).
self._tokenizer = tokenizer
diff --git a/renderers/minimax_m2.py b/renderers/minimax_m2.py
index f990274..9c92353 100644
--- a/renderers/minimax_m2.py
+++ b/renderers/minimax_m2.py
@@ -240,8 +240,7 @@ def emit_token_overlap_body(
preserve_thinking = should_preserve_past_thinking(
messages,
orig_idx,
- preserve_all_thinking=self.config.preserve_all_thinking,
- preserve_thinking_between_tool_calls=self.config.preserve_thinking_between_tool_calls,
+ thinking_retention=self.config.thinking_retention,
)
self._render_assistant(
msg,
diff --git a/renderers/nemotron3.py b/renderers/nemotron3.py
index c29129c..7f98c87 100644
--- a/renderers/nemotron3.py
+++ b/renderers/nemotron3.py
@@ -27,6 +27,7 @@
attribute_text_segments,
extract_message_tool_names,
reject_assistant_in_extension,
+ reject_thinking_strip_in_extension,
should_preserve_past_thinking,
trim_to_turn_close,
)
@@ -424,14 +425,13 @@ def emit_text_segments(
elif role == "assistant":
# Template: ``include_content = not (truncate_history_thinking
# and loop.index0 < last_user_idx)``. The renderer-internal
- # preserve_* overrides only ever *extend* retention, so OR them
- # in (a preserved turn keeps its thinking even when the
- # template default would drop it).
+ # ``thinking_retention`` override only ever *extends*
+ # retention, so OR it in (a preserved turn keeps its thinking
+ # even when the template default would drop it).
preserve_thinking = msg_orig_idx >= 0 and should_preserve_past_thinking(
original_messages,
msg_orig_idx,
- preserve_all_thinking=self.config.preserve_all_thinking,
- preserve_thinking_between_tool_calls=self.config.preserve_thinking_between_tool_calls,
+ thinking_retention=self.config.thinking_retention,
)
include_content = (
not self.config.truncate_history_thinking
@@ -510,7 +510,7 @@ def parse_response(
token_ids,
stop_ids=stop_ids,
think_id=self._think,
- think_end_id=self._think_end,
+ thinking_marker_ids=[self._think_end],
tool_call_id=self._tool_call,
tool_call_end_id=self._tool_call_end,
tools=tools,
@@ -542,6 +542,28 @@ def bridge_to_next_turn(
):
return None
+ # Faithfulness across a user-query boundary: the template drops a past
+ # block's thinking once a new user turn arrives (see
+ # reject_thinking_strip_in_extension). ``truncate_history_thinking=
+ # False`` keeps all past thinking (≡ thinking_retention="all"), so the
+ # bridge stays faithful carrying it verbatim — fold it into the
+ # effective level so we don't over-decline and re-tokenize sampled
+ # thinking.
+ retention = (
+ "all"
+ if not self.config.truncate_history_thinking
+ else self.config.thinking_retention
+ )
+ if reject_thinking_strip_in_extension(
+ previous_prompt_ids,
+ previous_completion_ids,
+ new_messages,
+ thinking_retention=retention,
+ thinking_marker_ids=[self._think_end],
+ enable_thinking=self.config.enable_thinking,
+ ):
+ return None
+
close_ids: set[int] = {self._im_end}
if self._endoftext is not None:
close_ids.add(self._endoftext)
@@ -704,8 +726,9 @@ def _assistant_body(
"""Assemble the assistant body string exactly as the chat template.
``include_content`` is the template's ``not (truncate_history_thinking
- and loop.index0 < last_user_idx)`` (already OR-ed with the preserve_*
- overrides by the caller): ``True`` keeps the full think+content block,
+ and loop.index0 < last_user_idx)`` (already OR-ed with the
+ ``thinking_retention`` override by the caller): ``True`` keeps the
+ full think+content block,
``False`` collapses historical thinking to an empty ````.
"""
ultra = self._ultra
diff --git a/renderers/qwen3.py b/renderers/qwen3.py
index 9253b3d..19141a5 100644
--- a/renderers/qwen3.py
+++ b/renderers/qwen3.py
@@ -21,6 +21,7 @@
attribute_text_segments,
extract_message_tool_names,
reject_assistant_in_extension,
+ reject_thinking_strip_in_extension,
should_preserve_past_thinking,
trim_to_turn_close,
)
@@ -209,8 +210,7 @@ def emit_text_segments(
preserve_thinking = should_preserve_past_thinking(
messages,
i,
- preserve_all_thinking=self.config.preserve_all_thinking,
- preserve_thinking_between_tool_calls=self.config.preserve_thinking_between_tool_calls,
+ thinking_retention=self.config.thinking_retention,
)
self._render_assistant(
msg,
@@ -298,6 +298,19 @@ def bridge_to_next_turn(
):
return None
+ # Faithfulness across a user-query boundary: Qwen3's template drops a
+ # past block's thinking once a new user turn arrives, so the bridge
+ # can't carry it verbatim there (see reject_thinking_strip_in_extension).
+ if reject_thinking_strip_in_extension(
+ previous_prompt_ids,
+ previous_completion_ids,
+ new_messages,
+ thinking_retention=self.config.thinking_retention,
+ thinking_marker_ids=[self._think_end],
+ enable_thinking=self.config.enable_thinking,
+ ):
+ return None
+
previous_ids = trim_to_turn_close(
previous_prompt_ids,
previous_completion_ids,
diff --git a/renderers/qwen35.py b/renderers/qwen35.py
index cdb8ee1..5878487 100644
--- a/renderers/qwen35.py
+++ b/renderers/qwen35.py
@@ -29,6 +29,7 @@
attribute_text_segments,
extract_message_tool_names,
reject_assistant_in_extension,
+ reject_thinking_strip_in_extension,
should_preserve_past_thinking,
trim_to_turn_close,
)
@@ -511,8 +512,7 @@ def flush_buf() -> None:
preserve_thinking = should_preserve_past_thinking(
messages,
i,
- preserve_all_thinking=self.config.preserve_all_thinking,
- preserve_thinking_between_tool_calls=self.config.preserve_thinking_between_tool_calls,
+ thinking_retention=self.config.thinking_retention,
)
self._render_assistant(
msg,
@@ -594,7 +594,7 @@ def parse_response(
token_ids,
stop_ids={self._im_end, self._endoftext},
think_id=self._think,
- think_end_id=self._think_end,
+ thinking_marker_ids=[self._think_end],
tool_call_id=self._tool_call,
tool_call_end_id=self._tool_call_end,
tools=tools,
@@ -619,6 +619,19 @@ def bridge_to_next_turn(
):
return None
+ # Faithfulness across a user-query boundary: the template drops a past
+ # block's thinking once a new user turn arrives (see
+ # reject_thinking_strip_in_extension).
+ if reject_thinking_strip_in_extension(
+ previous_prompt_ids,
+ previous_completion_ids,
+ new_messages,
+ thinking_retention=self.config.thinking_retention,
+ thinking_marker_ids=[self._think_end],
+ enable_thinking=self.config.enable_thinking,
+ ):
+ return None
+
previous_ids = trim_to_turn_close(
previous_prompt_ids,
previous_completion_ids,
diff --git a/renderers/qwen36.py b/renderers/qwen36.py
index 6adf867..938efd8 100644
--- a/renderers/qwen36.py
+++ b/renderers/qwen36.py
@@ -10,8 +10,7 @@
Historical-thinking retention follows Qwen3.5's default (drop past
```` blocks). The upstream template carries a ``preserve_thinking``
Jinja toggle for the opposite polarity; on the renderer side that intent
-maps to the renderer-agnostic ``preserve_all_thinking`` /
-``preserve_thinking_between_tool_calls`` flags on
+maps to the renderer-agnostic ``thinking_retention`` flag on
:class:`renderers.Qwen36RendererConfig`.
Everything else — tool system prompt, tool-call XML structure, thinking
diff --git a/renderers/qwen3_vl.py b/renderers/qwen3_vl.py
index 7b82d7e..f223ff9 100644
--- a/renderers/qwen3_vl.py
+++ b/renderers/qwen3_vl.py
@@ -302,9 +302,9 @@ class Qwen3VLRenderer:
keyed off ``tokenizer.name_or_path`` the first time a
multimodal part is seen.
- ``preserve_all_thinking`` / ``preserve_thinking_between_tool_calls``
- on the config are no-ops here — the chat template drops past
- ```` blocks unconditionally. Stored for Protocol parity.
+ ``thinking_retention`` on the config is a no-op here — the chat
+ template drops past ```` blocks unconditionally. Stored for
+ Protocol parity.
"""
def __init__(
diff --git a/tests/test_bridge.py b/tests/test_bridge.py
index 8b7bf77..714092c 100644
--- a/tests/test_bridge.py
+++ b/tests/test_bridge.py
@@ -67,7 +67,25 @@ def br_renderer(br_model, br_renderer_name):
return _load(br_model, br_renderer_name)[1]
-def _simulate_prior_turn(renderer):
+@pytest.fixture
+def br_renderer_all(br_model, br_renderer_name):
+ """Renderer forced to ``thinking_retention="all"``.
+
+ The verbatim-extension mechanic tests below cross a user-query
+ boundary. For thinking models the template drops a past block's
+ thinking there, so the faithful bridge declines (covered by
+ ``test_bridge_declines_across_user_query_when_template_drops_thinking``).
+ ``"all"`` keeps thinking on every path, isolating the pure extension
+ mechanic from the retention policy across all renderers.
+ """
+ from renderers import create_renderer
+
+ tok, base = _load(br_model, br_renderer_name)
+ cfg = base.config.model_copy(update={"thinking_retention": "all"})
+ return create_renderer(tok, cfg)
+
+
+def _simulate_prior_turn(renderer, assistant=None):
"""Build a (prev_prompt, prev_completion) pair that a real rollout
would produce for a one-turn prior with a clean stop.
@@ -76,13 +94,15 @@ def _simulate_prior_turn(renderer):
gen_prompt, and take the diff as prev_completion. We then trim
prev_completion to the last close token so it matches what vLLM
actually hands back (vLLM stops at the close token and excludes the
- trailing template scaffolding).
+ trailing template scaffolding). Pass ``assistant`` to override the
+ default no-thinking turn (e.g. one carrying ``reasoning_content``).
"""
prior = [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hi."},
]
- assistant = [{"role": "assistant", "content": "Hello!"}]
+ if assistant is None:
+ assistant = [{"role": "assistant", "content": "Hello!"}]
prev_prompt = renderer.render_ids(prior, add_generation_prompt=True)
full_with_assistant = renderer.render_ids(
@@ -105,11 +125,11 @@ def _simulate_prior_turn(renderer):
return prev_prompt, prev_completion
-def test_bridge_extends_prev_verbatim_on_clean_stop(br_renderer, br_model):
- prev_prompt, prev_completion = _simulate_prior_turn(br_renderer)
+def test_bridge_extends_prev_verbatim_on_clean_stop(br_renderer_all, br_model):
+ prev_prompt, prev_completion = _simulate_prior_turn(br_renderer_all)
new_messages = [{"role": "user", "content": "What's 2+2?"}]
- bridged = br_renderer.bridge_to_next_turn(
+ bridged = br_renderer_all.bridge_to_next_turn(
prev_prompt, prev_completion, new_messages
)
assert bridged is not None, f"{br_model}: bridge returned None on clean stop"
@@ -148,8 +168,8 @@ def test_bridge_rejects_empty_prev_or_new(br_renderer):
assert br_renderer.bridge_to_next_turn(prev_prompt, prev_completion, []) is None
-def test_bridge_synthesises_close_on_truncation(br_renderer, br_model):
- prev_prompt, prev_completion = _simulate_prior_turn(br_renderer)
+def test_bridge_synthesises_close_on_truncation(br_renderer_all, br_model):
+ prev_prompt, prev_completion = _simulate_prior_turn(br_renderer_all)
# Drop the final close token to simulate a max_tokens truncation.
prev_completion_trunc = prev_completion[:-1] if prev_completion else prev_completion
if len(prev_completion_trunc) == 0:
@@ -157,7 +177,7 @@ def test_bridge_synthesises_close_on_truncation(br_renderer, br_model):
f"{br_model}: simulated prior had no completion tokens — can't truncate"
)
- bridged = br_renderer.bridge_to_next_turn(
+ bridged = br_renderer_all.bridge_to_next_turn(
prev_prompt,
prev_completion_trunc,
[{"role": "user", "content": "What's 2+2?"}],
@@ -176,12 +196,12 @@ def test_bridge_synthesises_close_on_truncation(br_renderer, br_model):
def test_bridge_extension_includes_new_message_text(
- br_renderer, br_tokenizer, br_model
+ br_renderer_all, br_tokenizer, br_model
):
- prev_prompt, prev_completion = _simulate_prior_turn(br_renderer)
+ prev_prompt, prev_completion = _simulate_prior_turn(br_renderer_all)
new_messages = [{"role": "user", "content": "HELLO_SENTINEL_XYZ"}]
- bridged = br_renderer.bridge_to_next_turn(
+ bridged = br_renderer_all.bridge_to_next_turn(
prev_prompt, prev_completion, new_messages
)
assert bridged is not None
@@ -190,3 +210,131 @@ def test_bridge_extension_includes_new_message_text(
assert "HELLO_SENTINEL_XYZ" in decoded, (
f"{br_model}: new-message content missing from extension; got {decoded!r}"
)
+
+
+def test_bridge_declines_across_user_query_when_template_drops_thinking():
+ """Qwen3's template drops a past block's thinking once a new user turn
+ arrives. The bridge carries prior tokens verbatim, so spanning that
+ boundary would keep thinking the template strips. The faithful bridge
+ must DECLINE there (the caller then re-renders, byte-faithfully) — but
+ only when it actually matters:
+
+ - new user query + prior thinking + retention!="all" -> decline,
+ and the fallback re-render equals ``apply_chat_template``.
+ - thinking_retention="all" keeps thinking on every path -> extend.
+ - a tool response (in-flight cycle, no new query) keeps thinking in
+ the template too -> extend.
+ - no prior thinking -> nothing to strip -> extend.
+ """
+ from renderers import create_renderer
+ from renderers.base import load_tokenizer
+ from renderers.configs import Qwen3RendererConfig
+
+ tok = load_tokenizer("Qwen/Qwen3-8B")
+ im_end = tok.convert_tokens_to_ids("<|im_end|>")
+
+ u1 = {"role": "user", "content": "What is 2+2?"}
+ u2 = {"role": "user", "content": "Multiply that by 3."}
+ tool = {"role": "tool", "content": "ok"}
+ think = "\n2 plus 2 is 4.\n\n\n4"
+
+ def prior(r, asst_text):
+ p = r.render_ids([u1], add_generation_prompt=True)
+ completion = tok.encode(asst_text, add_special_tokens=False) + [im_end]
+ return p, completion
+
+ # new user query + prior thinking + default retention -> decline
+ r = create_renderer(tok, Qwen3RendererConfig())
+ p, comp = prior(r, think)
+ assert r.bridge_to_next_turn(p, comp, [u2]) is None
+ # ...and the caller's faithful re-render matches the chat template
+ hist = [u1, {"role": "assistant", "content": think}, u2]
+ rendered = tok.decode(r.render_ids(hist, add_generation_prompt=True))
+ assert rendered == tok.apply_chat_template(
+ hist, tokenize=False, add_generation_prompt=True
+ )
+
+ # thinking_retention="all" keeps thinking everywhere -> extend
+ r_all = create_renderer(tok, Qwen3RendererConfig(thinking_retention="all"))
+ p, comp = prior(r_all, think)
+ assert r_all.bridge_to_next_turn(p, comp, [u2]) is not None
+
+ # tool response continues the in-flight cycle (no new query) -> extend
+ p, comp = prior(r, think)
+ assert r.bridge_to_next_turn(p, comp, [tool]) is not None
+
+ # no prior thinking -> nothing to strip -> extend even across a user turn
+ p, comp = prior(r, "4")
+ assert r.bridge_to_next_turn(p, comp, [u2]) is not None
+
+
+# Renderers wired with the bridge faithfulness guard. Detection markers
+# vary: single ```` token (Qwen3, GLM, Nemotron), multi-token
+# ```` (Kimi-K2.5), or harmony analysis-channel header (gpt-oss).
+# minimax has no bridge; non-thinking models (llama, deepseek-v3) are out
+# of scope.
+_GUARDED_THINKING_MODELS = {
+ "Qwen/Qwen3-8B",
+ "Qwen/Qwen3.5-9B",
+ "Qwen/Qwen3.6-35B-A3B",
+ "zai-org/GLM-5",
+ "zai-org/GLM-5.1",
+ "THUDM/GLM-4.5-Air",
+ "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
+ "moonshotai/Kimi-K2.5",
+ "openai/gpt-oss-20b",
+}
+
+
+def test_bridge_declines_across_user_turn_when_thinking_present(br_renderer, br_model):
+ """Across every guarded thinking renderer: a prior turn carrying
+ ``reasoning_content`` makes the bridge decline a new *user* query (the
+ template would strip that thinking) but still extend a *tool* response
+ (in-flight cycle keeps it). Non-guarded models are skipped."""
+ if br_model not in _GUARDED_THINKING_MODELS:
+ pytest.skip(f"{br_model}: no bridge thinking-guard")
+
+ asst = [
+ {"role": "assistant", "reasoning_content": "Let me think.", "content": "Hello!"}
+ ]
+ prev_prompt, prev_completion = _simulate_prior_turn(br_renderer, asst)
+
+ declined = br_renderer.bridge_to_next_turn(
+ prev_prompt, prev_completion, [{"role": "user", "content": "next"}]
+ )
+ assert declined is None, f"{br_model}: expected faithful decline across a user turn"
+
+ extended = br_renderer.bridge_to_next_turn(
+ prev_prompt, prev_completion, [{"role": "tool", "content": "result"}]
+ )
+ assert extended is not None, f"{br_model}: should still bridge within a tool cycle"
+
+
+def test_bridge_keeps_thinking_when_history_kwarg_disables_truncation():
+ """GLM ``clear_thinking=False`` / Nemotron ``truncate_history_thinking=
+ False`` keep all past thinking (the template doesn't strip it), so the
+ bridge must NOT decline across a user turn — declining would re-render and
+ re-tokenize model-sampled thinking bytes."""
+ from renderers import create_renderer
+ from renderers.base import load_tokenizer
+ from renderers.configs import GLM5RendererConfig, Nemotron3RendererConfig
+
+ asst = [
+ {"role": "assistant", "reasoning_content": "Let me think.", "content": "Hi"}
+ ]
+ cases = [
+ ("zai-org/GLM-5", GLM5RendererConfig(clear_thinking=False)),
+ (
+ "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
+ Nemotron3RendererConfig(truncate_history_thinking=False),
+ ),
+ ]
+ for model, cfg in cases:
+ r = create_renderer(load_tokenizer(model), cfg)
+ prev_prompt, prev_completion = _simulate_prior_turn(r, asst)
+ bridged = r.bridge_to_next_turn(
+ prev_prompt, prev_completion, [{"role": "user", "content": "next"}]
+ )
+ assert bridged is not None, (
+ f"{model}: bridge must keep verbatim when the template keeps all thinking"
+ )
diff --git a/tests/test_llama_3.py b/tests/test_llama_3.py
index c6e4c75..8c16232 100644
--- a/tests/test_llama_3.py
+++ b/tests/test_llama_3.py
@@ -80,8 +80,8 @@ def test_default_date_matches_chat_template_strftime_fallback(llama_pair):
def test_preserve_thinking_flags_are_noops(llama_pair):
- """Llama-3 has no reasoning channel, so the ``preserve_*_thinking``
- flags are accepted but never change the token stream — the same
+ """Llama-3 has no reasoning channel, so any ``thinking_retention``
+ level is accepted but never changes the token stream — the same
never-preserves contract as Kimi-K2 / Qwen3-VL. (Cross-renderer
coverage lives in tests/test_preserve_thinking.py.)"""
_, _, tok, _ = llama_pair
@@ -94,10 +94,12 @@ def test_preserve_thinking_flags_are_noops(llama_pair):
},
]
base = Llama3Renderer(tok).render_ids(msgs)
- for flag in ("preserve_all_thinking", "preserve_thinking_between_tool_calls"):
- r = Llama3Renderer(tok, Llama3RendererConfig(**{flag: True}))
- assert r.config.__getattribute__(flag) is True
- assert r.render_ids(msgs) == base, f"{flag} must be a no-op for Llama-3"
+ for level in ("tool_cycle", "all"):
+ r = Llama3Renderer(tok, Llama3RendererConfig(thinking_retention=level))
+ assert r.config.thinking_retention == level
+ assert r.render_ids(msgs) == base, (
+ f"thinking_retention={level!r} must be a no-op for Llama-3"
+ )
# ---------------------------------------------------------------------------
diff --git a/tests/test_preserve_thinking.py b/tests/test_preserve_thinking.py
index daa4836..3d3a372 100644
--- a/tests/test_preserve_thinking.py
+++ b/tests/test_preserve_thinking.py
@@ -1,16 +1,17 @@
-"""Smoke coverage for the ``preserve_*_thinking`` override flags.
+"""Smoke coverage for the ``thinking_retention`` override flag.
-Flags live on the typed renderer config (e.g.
-``Qwen3RendererConfig(preserve_all_thinking=True)``) and are stored on
-the renderer as ``self.config.preserve_*``. Each test that wants a
-non-default flag builds a fresh renderer for that configuration via
+The flag lives on the typed renderer config (e.g.
+``Qwen3RendererConfig(thinking_retention="all")``) and is stored on the
+renderer as ``self.config.thinking_retention``. Each test that wants a
+non-default level builds a fresh renderer for that configuration via
``_make`` below.
Two invariants per renderer:
-1. Default render (no flags) is byte-identical to the existing
- ``apply_chat_template`` parity baseline — covered exhaustively elsewhere.
-2. Setting either flag never *removes* tokens compared to the default and,
+1. Default render (``thinking_retention="template"``) is byte-identical to
+ the existing ``apply_chat_template`` parity baseline — covered
+ exhaustively elsewhere.
+2. Raising the level never *removes* tokens compared to the default and,
for renderers whose template would drop past-asst thinking, actually
adds tokens for a conversation containing past-asst ``reasoning_content``.
@@ -31,7 +32,7 @@
def _make(tokenizer, renderer_name, **flags):
- """Build a fresh renderer with the given preserve_*_thinking flags
+ """Build a fresh renderer with the given thinking_retention level
bound at construction. Reuses the cached tokenizer fixture."""
if renderer_name == "auto":
renderer_name = MODEL_RENDERER_MAP.get(
@@ -51,8 +52,8 @@ def _make(tokenizer, renderer_name, **flags):
"Qwen/Qwen3-VL-8B-Instruct",
"Qwen/Qwen3-VL-30B-A3B-Instruct",
"poolside/Laguna-XS.2",
- # Llama-3 has no reasoning channel at all — preserve flags can't add
- # or drop anything, so they're pure no-ops.
+ # Llama-3 has no reasoning channel at all — thinking_retention can't
+ # add or drop anything, so it's a pure no-op.
"meta-llama/Llama-3.2-1B-Instruct",
"meta-llama/Llama-3.2-3B-Instruct",
"unsloth/Llama-3.2-1B-Instruct",
@@ -80,7 +81,7 @@ def _make(tokenizer, renderer_name, **flags):
def test_should_preserve_past_thinking_classification():
- # CURRENT-block-only behaviour. between_tool_calls preserves thinking
+ # CURRENT-block-only behaviour. "tool_cycle" preserves thinking
# ONLY for asst messages that sit AFTER the last user turn AND are in
# a segment that contains a tool. Anything before the last user turn
# falls back to template default (typically dropped).
@@ -100,44 +101,38 @@ def test_should_preserve_past_thinking_classification():
assert should_preserve_past_thinking(
live_cycle,
1,
- preserve_all_thinking=False,
- preserve_thinking_between_tool_calls=True,
+ thinking_retention="tool_cycle",
)
assert should_preserve_past_thinking(
live_cycle,
3,
- preserve_all_thinking=False,
- preserve_thinking_between_tool_calls=True,
+ thinking_retention="tool_cycle",
)
# Same shape with a NEW user appended → now the prior tool block is
- # "older" and between_tool_calls must drop its thinking (template
- # default). Only preserve_all_thinking would keep them.
+ # "older" and "tool_cycle" must drop its thinking (template default).
+ # Only thinking_retention="all" would keep them.
closed_cycle = live_cycle + [{"role": "user", "content": "next"}]
assert not should_preserve_past_thinking(
closed_cycle,
1,
- preserve_all_thinking=False,
- preserve_thinking_between_tool_calls=True,
+ thinking_retention="tool_cycle",
)
assert not should_preserve_past_thinking(
closed_cycle,
3,
- preserve_all_thinking=False,
- preserve_thinking_between_tool_calls=True,
+ thinking_retention="tool_cycle",
)
- # preserve_all_thinking still keeps them.
+ # thinking_retention="all" still keeps them.
assert should_preserve_past_thinking(
closed_cycle,
1,
- preserve_all_thinking=True,
- preserve_thinking_between_tool_calls=False,
+ thinking_retention="all",
)
assert should_preserve_past_thinking(
closed_cycle,
3,
- preserve_all_thinking=True,
- preserve_thinking_between_tool_calls=False,
+ thinking_retention="all",
)
# Current segment without a tool → not a tool cycle → not preserved.
@@ -148,37 +143,35 @@ def test_should_preserve_past_thinking_classification():
assert not should_preserve_past_thinking(
no_tool_yet,
1,
- preserve_all_thinking=False,
- preserve_thinking_between_tool_calls=True,
+ thinking_retention="tool_cycle",
)
- # Both flags False → always False.
+ # thinking_retention="template" → always False.
assert not should_preserve_past_thinking(
live_cycle,
1,
- preserve_all_thinking=False,
- preserve_thinking_between_tool_calls=False,
+ thinking_retention="template",
)
-def test_preserve_flags_default_unchanged(
+def test_thinking_retention_template_unchanged(
model_name, tokenizer, renderer_name, renderer
):
- # A renderer constructed with both flags explicitly ``False`` must
- # produce byte-identical output to one constructed with the defaults.
+ # A renderer constructed with thinking_retention="template" explicitly
+ # must produce byte-identical output to one constructed with the defaults.
bare = renderer.render_ids(CONVERSATION)
explicit_off = _make(
tokenizer,
renderer_name,
- preserve_all_thinking=False,
- preserve_thinking_between_tool_calls=False,
+ thinking_retention="template",
).render_ids(CONVERSATION)
assert bare == explicit_off, (
- f"{model_name}: explicit flags=False must equal bare default render"
+ f"{model_name}: explicit thinking_retention='template' must equal "
+ f"bare default render"
)
-def test_preserve_all_thinking_grows_or_no_op(
+def test_thinking_retention_all_grows_or_no_op(
model_name, tokenizer, renderer_name, renderer
):
from renderers.default import DefaultRenderer
@@ -186,37 +179,37 @@ def test_preserve_all_thinking_grows_or_no_op(
if isinstance(renderer, DefaultRenderer):
pytest.skip("DefaultRenderer raises on these flags — covered separately")
default = renderer.render_ids(CONVERSATION)
- preserved = _make(tokenizer, renderer_name, preserve_all_thinking=True).render_ids(
+ preserved = _make(tokenizer, renderer_name, thinking_retention="all").render_ids(
CONVERSATION
)
if model_name in NO_OP_MODELS:
assert preserved == default, (
- f"{model_name} is a no-op renderer; preserve_all_thinking must "
+ f"{model_name} is a no-op renderer; thinking_retention='all' must "
f"not change output (got {len(default)} → {len(preserved)})"
)
else:
assert len(preserved) > len(default), (
- f"{model_name}: preserve_all_thinking should add tokens for a "
+ f"{model_name}: thinking_retention='all' should add tokens for a "
f"conversation with past-asst reasoning_content "
f"(default={len(default)}, preserved={len(preserved)})"
)
-def test_preserve_between_tool_calls_strict_subset(
+def test_thinking_retention_tool_cycle_strict_subset(
model_name, tokenizer, renderer_name, renderer
):
- """``preserve_thinking_between_tool_calls`` is strictly weaker than
- ``preserve_all_thinking``: token count satisfies default <= between <= all."""
+ """``thinking_retention="tool_cycle"`` is strictly weaker than
+ ``"all"``: token count satisfies default <= tool_cycle <= all."""
from renderers.default import DefaultRenderer
if isinstance(renderer, DefaultRenderer):
pytest.skip("DefaultRenderer raises on these flags — covered separately")
default = renderer.render_ids(CONVERSATION)
between = _make(
- tokenizer, renderer_name, preserve_thinking_between_tool_calls=True
+ tokenizer, renderer_name, thinking_retention="tool_cycle"
).render_ids(CONVERSATION)
- all_ = _make(tokenizer, renderer_name, preserve_all_thinking=True).render_ids(
+ all_ = _make(tokenizer, renderer_name, thinking_retention="all").render_ids(
CONVERSATION
)
assert len(default) <= len(between) <= len(all_), (
@@ -244,28 +237,28 @@ def test_preserve_between_tool_calls_strict_subset(
]
-def test_preserve_btc_on_live_cycle_matches_all(
+def test_thinking_retention_tool_cycle_matches_all_on_live_cycle(
model_name, tokenizer, renderer_name, renderer
):
"""In a live tool cycle (no trailing user), every past-asst sits in
- the current tool-bearing segment. ``preserve_thinking_between_tool_calls``
+ the current tool-bearing segment. ``thinking_retention="tool_cycle"``
should preserve all of their thinking — same set of asst messages as
- ``preserve_all_thinking``, so the resulting token sequences must be
+ ``"all"``, so the resulting token sequences must be
identical (independent of which template-default condition each
renderer uses internally)."""
from renderers.default import DefaultRenderer
if isinstance(renderer, DefaultRenderer):
pytest.skip("DefaultRenderer raises on these flags — covered separately")
- btc = _make(
- tokenizer, renderer_name, preserve_thinking_between_tool_calls=True
- ).render_ids(LIVE_TOOL_CYCLE)
- all_ = _make(tokenizer, renderer_name, preserve_all_thinking=True).render_ids(
+ btc = _make(tokenizer, renderer_name, thinking_retention="tool_cycle").render_ids(
+ LIVE_TOOL_CYCLE
+ )
+ all_ = _make(tokenizer, renderer_name, thinking_retention="all").render_ids(
LIVE_TOOL_CYCLE
)
assert btc == all_, (
- f"{model_name}: in a live tool cycle btc must match preserve_all "
- f"(got len(btc)={len(btc)}, len(all)={len(all_)})"
+ f"{model_name}: in a live tool cycle tool_cycle must match all "
+ f"(got len(tool_cycle)={len(btc)}, len(all)={len(all_)})"
)
@@ -325,17 +318,17 @@ def test_preserve_btc_on_live_cycle_matches_all(
"Qwen/Qwen3-VL-8B-Instruct",
"Qwen/Qwen3-VL-30B-A3B-Instruct",
# Llama-3 ships no rendering path, so reasoning_content never
- # surfaces in the output regardless of the preserve flags.
+ # surfaces in the output regardless of thinking_retention.
"meta-llama/Llama-3.2-1B-Instruct",
"meta-llama/Llama-3.2-3B-Instruct",
"unsloth/Llama-3.2-1B-Instruct",
}
-def test_preserve_all_thinking_emits_every_asst_reasoning(
+def test_thinking_retention_all_emits_every_asst_reasoning(
model_name, tokenizer, renderer_name, renderer
):
- """``preserve_all_thinking=True`` must surface every past-asst's
+ """``thinking_retention="all"`` must surface every past-asst's
``reasoning_content`` in the decoded output — for renderers that
have any pathway to render reasoning at all."""
from renderers.default import DefaultRenderer
@@ -343,7 +336,7 @@ def test_preserve_all_thinking_emits_every_asst_reasoning(
if isinstance(renderer, DefaultRenderer):
pytest.skip("DefaultRenderer raises on these flags — covered separately")
- ids = _make(tokenizer, renderer_name, preserve_all_thinking=True).render_ids(
+ ids = _make(tokenizer, renderer_name, thinking_retention="all").render_ids(
TWO_BLOCK_CONV, tools=TWO_BLOCK_TOOLS
)
text = tokenizer.decode(ids)
@@ -352,38 +345,38 @@ def test_preserve_all_thinking_emits_every_asst_reasoning(
for sentinel in ALL_SENTINELS:
assert sentinel not in text, (
f"{model_name}: never-preserves renderer leaked {sentinel} "
- f"under preserve_all_thinking"
+ f"under thinking_retention='all'"
)
else:
for sentinel in ALL_SENTINELS:
assert sentinel in text, (
- f"{model_name}: preserve_all_thinking did not emit {sentinel} "
+ f"{model_name}: thinking_retention='all' did not emit {sentinel} "
f"in decoded output"
)
-def test_preserve_btc_emits_current_block_reasoning(
+def test_thinking_retention_tool_cycle_emits_current_block_reasoning(
model_name, tokenizer, renderer_name, renderer
):
- """``preserve_thinking_between_tool_calls=True`` must surface the
- current (post-last-user) tool block's reasoning. Older blocks fall
- back to template default, which varies per renderer — no universal
- assertion there."""
+ """``thinking_retention="tool_cycle"`` must surface the current
+ (post-last-user) tool block's reasoning. Older blocks fall back to
+ template default, which varies per renderer — no universal assertion
+ there."""
from renderers.default import DefaultRenderer
if isinstance(renderer, DefaultRenderer):
pytest.skip("DefaultRenderer raises on these flags — covered separately")
- ids = _make(
- tokenizer, renderer_name, preserve_thinking_between_tool_calls=True
- ).render_ids(TWO_BLOCK_CONV, tools=TWO_BLOCK_TOOLS)
+ ids = _make(tokenizer, renderer_name, thinking_retention="tool_cycle").render_ids(
+ TWO_BLOCK_CONV, tools=TWO_BLOCK_TOOLS
+ )
text = tokenizer.decode(ids)
if model_name in NEVER_PRESERVES_MODELS:
for sentinel in ALL_SENTINELS:
assert sentinel not in text, (
f"{model_name}: never-preserves renderer leaked {sentinel} "
- f"under preserve_thinking_between_tool_calls"
+ f"under thinking_retention='tool_cycle'"
)
else:
for sentinel in CURRENT_BLOCK_SENTINELS:
@@ -393,23 +386,23 @@ def test_preserve_btc_emits_current_block_reasoning(
)
-def test_default_renderer_raises_on_flags():
+def test_default_renderer_raises_above_template():
"""``DefaultRenderer`` falls back to apply_chat_template with no
- selective re-emit pathway, so constructing one with either flag set
- must raise — fail fast, before any render is attempted."""
+ selective re-emit pathway, so constructing one with thinking_retention
+ above ``"template"`` must raise — fail fast, before any render."""
from renderers import DefaultRendererConfig
from renderers.base import load_tokenizer
tok = load_tokenizer("Qwen/Qwen2.5-0.5B-Instruct")
- # No flags → constructs cleanly.
+ # Default "template" → constructs cleanly.
create_renderer(tok, DefaultRendererConfig())
- # Either flag set → raises at construction.
+ # Any level above "template" → raises at construction.
with pytest.raises(NotImplementedError):
- create_renderer(tok, DefaultRendererConfig(preserve_all_thinking=True))
+ create_renderer(tok, DefaultRendererConfig(thinking_retention="all"))
with pytest.raises(NotImplementedError):
create_renderer(
tok,
- DefaultRendererConfig(preserve_thinking_between_tool_calls=True),
+ DefaultRendererConfig(thinking_retention="tool_cycle"),
)
@@ -425,21 +418,16 @@ def test_create_renderer_records_flag_state(model_name, renderer_name, tokenizer
from renderers.default import DefaultRenderer
bare = _make(tokenizer, renderer_name)
- assert bare.config.preserve_all_thinking is False
- assert bare.config.preserve_thinking_between_tool_calls is False
+ assert bare.config.thinking_retention == "template"
if not isinstance(bare, DefaultRenderer):
- # DefaultRenderer raises at construction with either flag set —
+ # DefaultRenderer raises at construction above "template" —
# covered by ``test_default_renderer_raises_on_flags``.
- all_on = _make(tokenizer, renderer_name, preserve_all_thinking=True)
- assert all_on.config.preserve_all_thinking is True
- assert all_on.config.preserve_thinking_between_tool_calls is False
+ all_on = _make(tokenizer, renderer_name, thinking_retention="all")
+ assert all_on.config.thinking_retention == "all"
- btc_on = _make(
- tokenizer, renderer_name, preserve_thinking_between_tool_calls=True
- )
- assert btc_on.config.preserve_all_thinking is False
- assert btc_on.config.preserve_thinking_between_tool_calls is True
+ btc_on = _make(tokenizer, renderer_name, thinking_retention="tool_cycle")
+ assert btc_on.config.thinking_retention == "tool_cycle"
# ---------------------------------------------------------------------------
@@ -467,8 +455,8 @@ def test_glm5_config_accepts_clear_thinking():
def test_qwen36_config_rejects_unknown_field():
"""``preserve_thinking`` on Qwen3.6 was a chat-template-kwarg
pass-through in an earlier revision. It is superseded by the
- renderer-agnostic ``preserve_all_thinking`` override and must not
- appear on the typed config — its default-False semantics are now
+ renderer-agnostic ``thinking_retention`` override and must not
+ appear on the typed config — its default semantics are now
inherited from Qwen3.5's render gate. ``extra="forbid"`` on the
pydantic model enforces this at construction."""
from renderers import Qwen36RendererConfig
diff --git a/tests/test_renderer_config.py b/tests/test_renderer_config.py
index 4bc0a31..0fe951c 100644
--- a/tests/test_renderer_config.py
+++ b/tests/test_renderer_config.py
@@ -85,7 +85,7 @@ def __init__(self, tokenizer, config):
def test_create_renderer_auto_resolves_via_model_map(monkeypatch):
"""``AutoRendererConfig`` (or ``config=None``) routes through
``MODEL_RENDERER_MAP`` to pick the matching renderer + typed config,
- carrying the shared ``preserve_*`` flags over from the auto config."""
+ carrying the shared ``thinking_retention`` field over from the auto config."""
class _FakeQwen35:
def __init__(self, tokenizer, config):
@@ -97,13 +97,13 @@ def __init__(self, tokenizer, config):
renderer = create_renderer(
SimpleNamespace(name_or_path="fake/qwen35"),
- AutoRendererConfig(preserve_all_thinking=True),
+ AutoRendererConfig(thinking_retention="all"),
)
assert isinstance(renderer.config, Qwen35RendererConfig)
- assert renderer.config.preserve_all_thinking is True
+ assert renderer.config.thinking_retention == "all"
# Template-level kwargs stay at their per-renderer defaults — auto
- # carries only the preserve_* flags.
+ # carries only the thinking_retention flag.
assert renderer.config.add_vision_id is False
@@ -114,3 +114,42 @@ def test_create_renderer_default_argument_is_auto():
renderer = create_renderer(tok)
# Falls through to DefaultRenderer when no match and no vision config.
assert renderer.__class__.__name__ == "DefaultRenderer"
+
+
+def test_thinking_retention_conflict_raises():
+ """``clear_thinking`` / ``truncate_history_thinking`` are byte-equivalent
+ to ``thinking_retention`` (``False`` ≡ ``"all"``). Explicitly setting both
+ to disagreeing values raises; a consistent pair, or a single field (the
+ parity matrix sets one at a time), is fine."""
+ from renderers import Nemotron3RendererConfig
+
+ # explicit conflict: keep-all kwarg vs a non-"all" retention
+ with pytest.raises(ValidationError, match="same knob"):
+ GLM5RendererConfig(clear_thinking=False, thinking_retention="template")
+ with pytest.raises(ValidationError, match="same knob"):
+ GLM5RendererConfig(clear_thinking=False, thinking_retention="tool_cycle")
+ with pytest.raises(ValidationError, match="same knob"):
+ Nemotron3RendererConfig(
+ truncate_history_thinking=False, thinking_retention="template"
+ )
+
+ # consistent (both keep all), single-field, and the normal override
+ # (template-default drop + retention="all") are all accepted.
+ GLM5RendererConfig(clear_thinking=False, thinking_retention="all")
+ GLM5RendererConfig(clear_thinking=False)
+ GLM5RendererConfig(thinking_retention="tool_cycle")
+ GLM5RendererConfig(clear_thinking=True, thinking_retention="all")
+
+
+def test_default_renderer_config_rejects_legacy_preserve_flags():
+ """``DefaultRendererConfig`` is ``extra="allow"``, so the removed
+ ``preserve_*`` bools would otherwise slip into ``model_extra`` and be
+ forwarded to ``apply_chat_template`` silently. A validator rejects them
+ with a migration message; genuine Jinja kwargs still pass through."""
+ with pytest.raises(ValidationError, match="thinking_retention"):
+ DefaultRendererConfig(preserve_all_thinking=True)
+ with pytest.raises(ValidationError, match="thinking_retention"):
+ DefaultRendererConfig(preserve_thinking_between_tool_calls=True)
+
+ cfg = DefaultRendererConfig(some_jinja_kwarg=True)
+ assert cfg.model_extra["some_jinja_kwarg"] is True