diff --git a/README.md b/README.md index d153163..25184be 100644 --- a/README.md +++ b/README.md @@ -95,7 +95,6 @@ For RL the trainer must see the exact token ids the sampler saw. The standard al - **Boolean round-trip.** Engine emits `false`; client parses to Python `bool(False)`; `apply_chat_template` re-renders via `str(False)` → `"False"`. Capital F. Reproducible on Qwen3.5-35B-A3B + mini-swe-agent-plus at ~50% break rate per rollout. - **BPE retokenization drift.** The same substring tokenizes differently depending on neighbouring bytes. `json` + `p` + `enderer` (3 tokens) vs `jsonp` + `enderer` (2 tokens) when whitespace shifts by one character. Every subsequent token is shifted from there on. - **Tool-call XML drift.** The engine emits a no-arg call with a stylistic empty ``; the Jinja re-render of the reconstructed dict drops it. Extension property broken at every such call. -- **Thinking stripped from non-latest assistants.** Some templates strip `` blocks from prior assistant turns when re-rendering. The recorded stream has the thinking; the next prompt does not. - **Max-seq-len truncation zeroing the anchor.** Client-side `max_seq_len` enforcement zeros `completion_ids` when `prompt_len > max_seq_len`. The bridge anchor is empty, falling back to full re-render — triggering every mode above. - **Scaffold-level history rewriting.** Some agent scaffolds (e.g. opencode's `experimental_repairToolCall`) rewrite tool calls before sending them back as history. The next turn's prompt contains a tool call the model never emitted. *A renderer cannot fix this — the drift happens before rendering.* @@ -122,9 +121,9 @@ from renderers import ( ) # Auto-resolve renderer from the tokenizer's model name. Carries the -# shared preserve_* flags; template kwargs require an explicit choice. +# shared thinking_retention flag; template kwargs require an explicit choice. renderer = create_renderer(tokenizer) -renderer = create_renderer(tokenizer, AutoRendererConfig(preserve_all_thinking=True)) +renderer = create_renderer(tokenizer, AutoRendererConfig(thinking_retention="all")) # Explicit choice — the typed config exposes exactly the fields that # renderer's chat template honours. @@ -142,12 +141,13 @@ renderer = create_renderer( Discriminated union: every per-renderer config is a variant of `RendererConfig`, dispatched on the `name` field. Bogus combinations (e.g. `add_vision_id` under `name="qwen3"`) error at construction with a `pydantic.ValidationError`. Downstream pydantic configs (prime-rl orchestrator, verifiers `ClientConfig`) hold a single field typed as `RendererConfig` and inherit the same strict-per-variant validation. -Two shared behaviour flags live on every variant via `_BaseRendererConfig`: +One shared behaviour flag lives on every variant via `_BaseRendererConfig`: `thinking_retention`, an ascending scale (`"template"` < `"tool_cycle"` < `"all"`) whose floor is the chat template's own decision. -- `preserve_all_thinking=True` — every past assistant's `reasoning_content` is kept, even when the chat template would drop it. -- `preserve_thinking_between_tool_calls=True` — reasoning is kept on assistants in the in-flight tool cycle (post-last-user A-T-…-A block when it contains a tool response). A new user turn closes the block and drops its thinking. +- `thinking_retention="template"` (default) — defer entirely to the chat template. +- `thinking_retention="tool_cycle"` — additionally keep reasoning on assistants in the in-flight tool cycle (post-last-user A-T-…-A block when it contains a tool response). A new user turn closes the block and drops its thinking. +- `thinking_retention="all"` — additionally keep every past assistant's `reasoning_content`, even when the chat template would drop it. -These OR-compose with template-level toggles (e.g. GLM-5 `clear_thinking`, Nemotron-3 `truncate_history_thinking`): either flag saying "keep" wins. preserve_* can only ever *extend* retention — never override a template kwarg into a "drop" decision. The canonical use case is **compaction**: injecting a `user` turn like *"summarize the work so far"* puts every prior assistant in a past cycle, and `preserve_all_thinking=True` keeps reasoning visible end-to-end. +`thinking_retention` is honoured end-to-end — both `render()` and `bridge_to_next_turn` consult it. So a multi-turn rollout reproduces the chat template's history handling **faithfully by default**: when a new user turn arrives, a past block's reasoning is dropped exactly as `apply_chat_template` would, and the bridge declines that boundary (letting the caller re-render) rather than carrying the stale `` forward. The override can only ever *extend* retention above the template floor — it never forces a drop the template would keep. GLM-5 `clear_thinking` / Nemotron-3 `truncate_history_thinking` are byte-equivalent template kwargs (`False` ≡ `"all"`); setting one of them *and* a contradictory `thinking_retention` raises at config-load rather than silently resolving. The canonical override use case is **compaction**: injecting a `user` turn like *"summarize the work so far"* puts every prior assistant in a past cycle, and `thinking_retention="all"` keeps reasoning visible end-to-end. ## `DefaultRenderer` @@ -156,7 +156,7 @@ Fallback for unsupported models. Wraps `apply_chat_template` and accepts `tool_p ## Roadmap - **VLM support.** `ContentPart` is text-only today; `Qwen3VLRenderer` ships only because Qwen3-VL's text-only chat template differs from Qwen3's. Plan: add `ImagePart` / `VideoPart`, multimodal bridges, validate against a Qwen3-VL RL run. -- **Patched chat templates.** Some shipped templates re-tokenize history, normalize JSON, or auto-strip thinking — each breaks the extension property. Plan: a `use_patched` opt-in per renderer that renders the same surface form while avoiding known-bad patterns. +- **Patched chat templates.** Some shipped templates re-tokenize history or normalize JSON in ways that break token identity. Plan: a `use_patched` opt-in per renderer that renders the same surface form while avoiding known-bad patterns. (Auto-stripping thinking from past turns is *not* one of these — that's intended template behaviour the renderer reproduces; use `thinking_retention` to override it.) ## Testing diff --git a/docs/renderer-config.md b/docs/renderer-config.md index 3d5fa2a..93cced3 100644 --- a/docs/renderer-config.md +++ b/docs/renderer-config.md @@ -50,10 +50,10 @@ Configs are frozen. To override a field, construct a new instance or call ```python r = create_renderer(tokenizer) # AutoRendererConfig() is the default -r = create_renderer(tokenizer, AutoRendererConfig(preserve_all_thinking=True)) +r = create_renderer(tokenizer, AutoRendererConfig(thinking_retention="all")) ``` -`AutoRendererConfig` carries only the shared `preserve_*` flags. Template +`AutoRendererConfig` carries only the shared `thinking_retention` flag. Template kwargs depend on the renderer, so overriding them requires naming the renderer explicitly: @@ -67,33 +67,43 @@ falling back for a VLM would produce token streams the trainer can't reconstruct. Text-only fine-tunes without a registered renderer fall back to `DefaultRenderer` and log the choice at INFO. -## `preserve_*` flags - -Every variant carries two renderer-agnostic flags on `_BaseRendererConfig`: - -- `preserve_all_thinking: bool = False` — re-emit `reasoning_content` on - every past assistant turn, even when the chat template would drop it. -- `preserve_thinking_between_tool_calls: bool = False` — re-emit - `reasoning_content` only inside the in-flight tool cycle (the contiguous - A-T-…-A block after the most recent `user` message, when it contains at - least one `tool` response). A new user turn closes the block and drops - its thinking. - -These OR-compose with template-level toggles. GLM-5's `clear_thinking` and -Nemotron-3's `truncate_history_thinking` already gate past thinking; the -`preserve_*` flags add to that: - -| `clear_thinking` | `preserve_all_thinking` | past thinking? | -|------------------|-------------------------|----------------| -| `True` (default — drop) | `False` (default) | dropped | -| `True` | `True` | kept | -| `False` (keep) | `False` | kept | -| `False` | `True` | kept | - -`preserve_*` can only extend retention, never force a drop. The canonical -use case is **compaction**: injecting a `user` turn like *"summarize the work -so far"* puts every prior assistant in a past cycle, and -`preserve_all_thinking=True` keeps reasoning visible end-to-end. +## `thinking_retention` + +Every variant carries one renderer-agnostic flag on `_BaseRendererConfig`, +an ascending scale whose floor is the chat template's own decision: + +- `thinking_retention: Literal["template", "tool_cycle", "all"] = "template"` + - `"template"` (default) — defer entirely to the chat template. + - `"tool_cycle"` — additionally re-emit `reasoning_content` inside the + in-flight tool cycle (the contiguous A-T-…-A block after the most + recent `user` message, when it contains at least one `tool` response). + A new user turn closes the block and drops its thinking. + - `"all"` — additionally re-emit `reasoning_content` on every past + assistant turn, even when the chat template would drop it. + +The levels are nested: `"all"` ⊇ `"tool_cycle"` ⊇ `"template"`, and the +level is honoured end-to-end — `render()` and `bridge_to_next_turn` both +consult it, so multi-turn rollouts reproduce the template's history handling +faithfully by default. GLM-5's `clear_thinking` and Nemotron-3's +`truncate_history_thinking` are byte-equivalent template kwargs (`False` ≡ +`"all"`) gating the same past thinking; `thinking_retention` composes with +them as: + +| `clear_thinking` | `thinking_retention` | past thinking? | +|------------------|----------------------|----------------| +| `True` (default — drop) | `"template"` (default) | dropped | +| `True` | `"all"` | kept | +| `False` (keep) | `"template"` | kept | +| `False` | `"all"` | kept | + +`thinking_retention` can only extend retention, never force a drop — the +template is the floor. Because the kwarg and `thinking_retention` name the +same thing, explicitly setting a keep-history kwarg to `False` *and* a +non-`"all"` `thinking_retention` is contradictory and raises at config-load +(set `thinking_retention="all"` instead). The canonical use case is **compaction**: injecting +a `user` turn like *"summarize the work so far"* puts every prior assistant +in a past cycle, and `thinking_retention="all"` keeps reasoning visible +end-to-end. ## `DefaultRendererConfig` accepts arbitrary Jinja kwargs @@ -139,7 +149,7 @@ In TOML / YAML, the discriminator routes deserialization: name = "qwen3.5" enable_thinking = false add_vision_id = true -preserve_all_thinking = true +thinking_retention = "all" ``` Pydantic dispatches on `name = "qwen3.5"` to `Qwen35RendererConfig`. Bogus diff --git a/renderers/base.py b/renderers/base.py index ed5bc7e..7d5da7c 100644 --- a/renderers/base.py +++ b/renderers/base.py @@ -20,7 +20,11 @@ ) if TYPE_CHECKING: - from renderers.configs import AutoRendererConfig, RendererConfig + from renderers.configs import ( + AutoRendererConfig, + RendererConfig, + ThinkingRetention, + ) logger = logging.getLogger("renderers.base") @@ -666,14 +670,13 @@ def render( """Render messages to token IDs with per-token message attribution. Behaviour around historical ``reasoning_content`` is owned by the - renderer instance — the ``preserve_all_thinking`` and - ``preserve_thinking_between_tool_calls`` flags are constructor - kwargs, not call-site kwargs. To render with a different - configuration, build a different renderer (or different pool). - Defaults preserve byte-identity with each model's chat template; - flipping a flag at construction restores ``reasoning_content`` - the template would otherwise drop. See - ``should_preserve_past_thinking`` for the per-message + renderer instance — the ``thinking_retention`` level is a + constructor kwarg, not a call-site kwarg. To render with a + different configuration, build a different renderer (or different + pool). The ``"template"`` default preserves byte-identity with + each model's chat template; raising the level at construction + restores ``reasoning_content`` the template would otherwise drop. + See ``should_preserve_past_thinking`` for the per-message classification. """ ... @@ -769,7 +772,13 @@ def bridge_to_next_turn( Return ``None`` whenever the renderer can't prove that contract holds — the caller falls back to a full re-render. In particular, bridges refuse assistant messages in ``new_messages`` (those would - re-tokenize model-sampled content). Hand-coded renderers know their + re-tokenize model-sampled content), and a renderer whose template + drops a past block's thinking once a new user query arrives refuses + to span that boundary while carrying ```` verbatim would keep + it (it falls back to a re-render, which honours ``thinking_retention`` + and drops the stale thinking — keeping the bridge faithful to the + template). ``thinking_retention="all"`` keeps thinking on every path, + so no decline is needed there. Hand-coded renderers know their canonical close and synthesise it on truncated priors; DefaultRenderer always returns ``None`` because the template's close is unknown. @@ -1527,10 +1536,7 @@ def _resolve_auto(tokenizer, auto: AutoRendererConfig) -> Renderer: model_name = getattr(tokenizer, "name_or_path", "") renderer_name = MODEL_RENDERER_MAP.get(model_name) - preserve_carry = { - "preserve_all_thinking": auto.preserve_all_thinking, - "preserve_thinking_between_tool_calls": auto.preserve_thinking_between_tool_calls, - } + preserve_carry = {"thinking_retention": auto.thinking_retention} if renderer_name is not None: cfg_cls = _config_class_for(renderer_name) @@ -1555,11 +1561,12 @@ def _resolve_auto(tokenizer, auto: AutoRendererConfig) -> Renderer: # Text-only fall back to default (apply_chat_template). For fine-tunes # with customized chat templates this is the *correct* choice, so we # don't warn. Note the pick at INFO and advertise the parser knobs. - if auto.preserve_all_thinking or auto.preserve_thinking_between_tool_calls: + if auto.thinking_retention != "template": raise NotImplementedError( "Auto-resolved DefaultRenderer can't selectively re-emit " "dropped reasoning_content. Pass an explicit typed renderer " - "config (model-specific) if you need preserve_*_thinking." + "config (model-specific) if you need thinking_retention != " + "'template'." ) logger.info( "No model-specific renderer matched %r. Using DefaultRenderer " @@ -1912,12 +1919,92 @@ def reject_assistant_in_extension(new_messages: list[Message]) -> bool: return any(m.get("role") == "assistant" for m in new_messages) +def introduces_user_query(new_messages: list[Message]) -> bool: + """Return True if ``new_messages`` opens a new user-query turn. + + A query boundary is a ``user`` message whose content isn't a + ``...`` wrapper — tool responses + (``role="tool"``, or folded into a wrapped user turn) continue the + in-flight cycle rather than starting a new query. This mirrors the + boundary chat templates use to decide when a past assistant block's + thinking becomes "older" and is dropped, so a bridge can tell whether + spanning ``new_messages`` would cross that boundary (cf. + ``should_preserve_past_thinking`` and per-renderer ``_last_query_index``). + """ + for m in new_messages: + if m.get("role") != "user": + continue + content = m.get("content") + if ( + isinstance(content, str) + and content.startswith("") + and content.endswith("") + ): + continue + return True + return False + + +def _contains_subsequence(haystack: list[int], needle: list[int]) -> bool: + """True if ``needle`` appears as a contiguous run inside ``haystack``.""" + n = len(needle) + if n == 0: + return False + if n == 1: + return needle[0] in haystack + first = needle[0] + for i in range(len(haystack) - n + 1): + if haystack[i] == first and haystack[i : i + n] == needle: + return True + return False + + +def reject_thinking_strip_in_extension( + previous_prompt_ids: list[int], + previous_completion_ids: list[int], + new_messages: list[Message], + *, + thinking_retention: ThinkingRetention, + thinking_marker_ids: list[int], + enable_thinking: bool = True, +) -> bool: + """Return True if a bridge must refuse to span ``new_messages``. + + A renderer whose template drops a past block's thinking once a new user + query arrives can't span that boundary while keeping prior tokens + verbatim — the kept thinking would diverge from a faithful re-render + (which honours ``thinking_retention`` and drops the stale thinking). + Thinking renderers OR this into their bridge's reject check. + + ``thinking_marker_ids`` is a token subsequence whose presence in the + prior tokens signals a sampled thinking block — a single-token close + like ``[]`` (Qwen3, GLM, Nemotron), a multi-token close + (``Kimi-K2.5``), or a harmony analysis-channel header + (``[<|channel|>, …"analysis"…, <|message|>]`` for gpt-oss). + ``thinking_retention="all"`` keeps thinking on every path, so it never + declines. + + Returns ``False`` (safe to bridge) when: ``thinking_retention="all"``; + thinking is disabled (no sampled thinking to strip — also avoids the + empty ```` generation-prompt scaffold a disabled model + leaves in ``previous_prompt_ids``); ``new_messages`` stays in the + in-flight cycle (no new user query); or the prior tokens carry no + thinking marker to strip. + """ + if thinking_retention == "all" or not enable_thinking: + return False + if not introduces_user_query(new_messages): + return False + return _contains_subsequence( + previous_completion_ids, thinking_marker_ids + ) or _contains_subsequence(previous_prompt_ids, thinking_marker_ids) + + def should_preserve_past_thinking( messages: list[Message], msg_idx: int, *, - preserve_all_thinking: bool, - preserve_thinking_between_tool_calls: bool, + thinking_retention: ThinkingRetention, ) -> bool: """Should ``messages[msg_idx]``'s ``reasoning_content`` be emitted as thinking even when the chat template would drop it? @@ -1925,25 +2012,25 @@ def should_preserve_past_thinking( Returns ``True`` only as an override above the template default. Each renderer ORs this into its own "render thinking?" condition; a result of ``False`` means "follow the template" (drop or keep as the template - decides), not "force-drop". - - Override rules: - - - ``preserve_all_thinking`` — every past-asst's thinking is kept. - - ``preserve_thinking_between_tool_calls`` — keeps thinking only - inside the *current* tool cycle: the contiguous A-T-...-A block - after the most recent ``user`` message, and only if that block - contains at least one ``tool`` response. As soon as a new - ``user`` turn arrives, the previous block becomes "older" and - its thinking is dropped (template default), matching how most - chat templates already handle multi-turn contexts. Use - ``preserve_all_thinking`` if you need thinking on older blocks - to survive the user-turn boundary too. + decides), not "force-drop". ``thinking_retention`` selects how far the + override reaches: + + - ``"template"`` — no override; defer to the template (always ``False``). + - ``"all"`` — every past-asst's thinking is kept (always ``True``). + - ``"tool_cycle"`` — keeps thinking only inside the *current* tool + cycle: the contiguous A-T-...-A block after the most recent ``user`` + message, and only if that block contains at least one ``tool`` + response. As soon as a new ``user`` turn arrives, the previous block + becomes "older" and its thinking is dropped (template default), + matching how most chat templates handle multi-turn contexts. Use + ``"all"`` if you need thinking on older blocks to survive the + user-turn boundary too. """ - if preserve_all_thinking: + if thinking_retention == "all": return True - if not preserve_thinking_between_tool_calls: + if thinking_retention == "template": return False + # thinking_retention == "tool_cycle" # Most recent user message (or -1 if none). last_user = -1 for j in range(len(messages) - 1, -1, -1): diff --git a/renderers/configs.py b/renderers/configs.py index d500f8e..b87cbe1 100644 --- a/renderers/configs.py +++ b/renderers/configs.py @@ -4,14 +4,15 @@ Each renderer accepts its own typed config; bad combinations (e.g. ``add_vision_id`` under ``name="qwen3"``) fail at config-load time with a pydantic ``ValidationError`` rather than at runtime via an allowlist -check. The shared ``preserve_*`` flags live on ``BaseRendererConfig`` -and OR-compose with template-level toggles (e.g. GLM-5 -``clear_thinking``) inside each renderer — they extend retention, never -override the template into a drop. +check. The shared ``thinking_retention`` flag lives on +``BaseRendererConfig`` and OR-composes with template-level toggles (e.g. +GLM-5 ``clear_thinking``) inside each renderer — it extends retention, +never overrides the template into a drop. ``AutoRendererConfig`` is a placeholder variant: ``create_renderer`` resolves it via ``MODEL_RENDERER_MAP`` and constructs the matching -typed config with the auto config's ``preserve_*`` fields carried over. +typed config with the auto config's ``thinking_retention`` field carried +over. ``DefaultRendererConfig`` uses ``extra="allow"`` to accept arbitrary Jinja kwargs as ``model_extra`` — ``DefaultRenderer`` doesn't know which @@ -22,10 +23,45 @@ from typing import Annotated, ClassVar, Literal, Union -from pydantic import ConfigDict, Field +from pydantic import ConfigDict, Field, model_validator from pydantic_config import BaseConfig +def _reject_thinking_retention_conflict(config: BaseConfig, kwarg_name: str) -> None: + """Raise if an explicit template thinking-kwarg contradicts an explicit + ``thinking_retention``. + + ``clear_thinking`` (GLM) / ``truncate_history_thinking`` (Nemotron) are + byte-equivalent to ``thinking_retention``: setting one ``False`` keeps + all past thinking, i.e. ``thinking_retention="all"``. They're the same + knob, so a user who explicitly sets both to disagreeing values almost + certainly has a bug — surface it rather than silently resolving. Only + fires when BOTH are explicit; a defaulted value is not an intent (and + this keeps the per-kwarg parity matrix, which sets one field at a time, + working). + """ + fields_set = config.__pydantic_fields_set__ + if ( + kwarg_name in fields_set + and "thinking_retention" in fields_set + and getattr(config, kwarg_name) is False + and config.thinking_retention != "all" + ): + raise ValueError( + f"{kwarg_name}=False keeps all past thinking (the same as " + f"thinking_retention='all'), which conflicts with the explicit " + f"thinking_retention={config.thinking_retention!r}. They are the " + f"same knob — set thinking_retention='all' (or drop {kwarg_name})." + ) + + +ThinkingRetention = Literal["template", "tool_cycle", "all"] +"""How far past-assistant ``reasoning_content`` is retained, as an override +*above* the chat-template floor. ``"template"`` defers to the template; +``"tool_cycle"`` and ``"all"`` progressively extend retention and never force a +drop. See :attr:`BaseRendererConfig.thinking_retention`.""" + + class BaseRendererConfig(BaseConfig): """Shared fields and config for every renderer config variant. @@ -35,28 +71,32 @@ class BaseRendererConfig(BaseConfig): this class adds ``frozen=True`` so configs are hashable value objects. - ``preserve_all_thinking`` and ``preserve_thinking_between_tool_calls`` - are renderer-internal behaviour flags — they don't map to any Jinja - chat-template kwarg. They OR-compose with template-level toggles on - renderers that expose one (GLM-5 ``clear_thinking``, Nemotron-3 - ``truncate_history_thinking``): either flag saying "keep this - thinking" wins. preserve_* can only ever extend retention; setting - ``preserve_all_thinking=True`` always keeps past thinking, regardless - of the template kwarg. See ``renderers.base.should_preserve_past_thinking``. + ``thinking_retention`` is a renderer-internal behaviour flag — it + doesn't map to any Jinja chat-template kwarg. It OR-composes with + template-level toggles on renderers that expose one (GLM-5 + ``clear_thinking``, Nemotron-3 ``truncate_history_thinking``): + whichever side says "keep this thinking" wins. The chat template is + the floor — ``thinking_retention`` can only ever *extend* retention, + never override the template into a drop. See + ``renderers.base.should_preserve_past_thinking``. """ model_config = ConfigDict(frozen=True) - preserve_all_thinking: bool = False - """Restore ``reasoning_content`` on every past assistant turn, even - when the chat template would drop it. Strict superset of - ``preserve_thinking_between_tool_calls``.""" + thinking_retention: ThinkingRetention = "template" + """How much past-assistant ``reasoning_content`` to re-emit, layered as + an override *above* the chat template's own decision. The template is + the floor — each level only ever *extends* retention, never forces a + drop: + + - ``"template"`` (default) — defer entirely to the chat template. + - ``"tool_cycle"`` — additionally keep thinking inside the in-flight + tool cycle: the contiguous A-T-...-A block after the most recent + ``user`` turn, when it contains at least one ``tool`` response. A new + user turn closes the block. + - ``"all"`` — additionally keep thinking on every past assistant turn. - preserve_thinking_between_tool_calls: bool = False - """Restore ``reasoning_content`` only inside the in-flight tool cycle: - the contiguous A-T-...-A block after the most recent ``user`` turn, - and only if it contains at least one ``tool`` response. A new user - turn closes the block and drops its thinking (template default).""" + Ascending and nested: ``"all"`` ⊇ ``"tool_cycle"`` ⊇ ``"template"``.""" # Fields that are renderer-internal — not forwarded to (or mirrored # by) ``apply_chat_template``. Override in subclasses that hold @@ -87,8 +127,9 @@ def template_field_names(cls) -> frozenset[str]: class AutoRendererConfig(BaseRendererConfig): """Resolve the renderer from ``tokenizer.name_or_path`` at construction - time via ``MODEL_RENDERER_MAP``. Carries only the shared ``preserve_*`` - fields; template kwargs require an explicit renderer choice so that + time via ``MODEL_RENDERER_MAP``. Carries only the shared + ``thinking_retention`` field; template kwargs require an explicit + renderer choice so that template-dependent behaviour stays visible at the call site.""" name: Literal["auto"] = "auto" @@ -119,6 +160,26 @@ class DefaultRendererConfig(BaseRendererConfig): # template. Jinja kwargs live in ``model_extra`` (extra="allow"). _internal_fields = frozenset({"tool_parser", "reasoning_parser"}) + @model_validator(mode="after") + def _reject_legacy_preserve_flags(self): + # ``extra="allow"`` would otherwise swallow the removed ``preserve_*`` + # bools into ``model_extra`` and forward them to apply_chat_template, + # silently dropping the user's intent (DefaultRenderer can't + # selectively re-emit reasoning_content). Reject them like every other + # config's ``extra="forbid"`` does, pointing at the replacement. + legacy = { + "preserve_all_thinking", + "preserve_thinking_between_tool_calls", + } & set(self.model_extra or {}) + if legacy: + raise ValueError( + f"{sorted(legacy)} were replaced by thinking_retention. " + "DefaultRenderer falls back to apply_chat_template and can't " + "selectively re-emit reasoning_content — use thinking_retention " + "on a model-specific renderer." + ) + return self + class Qwen3RendererConfig(BaseRendererConfig): """Qwen3 (text-only) renderer config.""" @@ -199,8 +260,13 @@ class GLM5RendererConfig(BaseRendererConfig): """When ``False``, the renderer keeps ``{reasoning}`` on past-cycle assistant turns instead of dropping them. Mirrors the chat template's ``clear_thinking`` toggle. OR-composes with - ``preserve_all_thinking`` / ``preserve_thinking_between_tool_calls`` - — see :class:`BaseRendererConfig` for the contract.""" + ``thinking_retention`` — see :class:`BaseRendererConfig` for the + contract.""" + + @model_validator(mode="after") + def _check_thinking_retention(self): + _reject_thinking_retention_conflict(self, "clear_thinking") + return self class GLM51RendererConfig(BaseRendererConfig): @@ -215,6 +281,11 @@ class GLM51RendererConfig(BaseRendererConfig): clear_thinking: bool = True """See :class:`GLM5RendererConfig.clear_thinking`.""" + @model_validator(mode="after") + def _check_thinking_retention(self): + _reject_thinking_retention_conflict(self, "clear_thinking") + return self + class GLM45RendererConfig(BaseRendererConfig): """GLM-4.5 Air renderer config.""" @@ -321,10 +392,11 @@ class LagunaXS2RendererConfig(BaseRendererConfig): class Llama3RendererConfig(BaseRendererConfig): """Llama-3.x Instruct renderer config. - Llama-3 ships no reasoning channel, so the base ``preserve_*_thinking`` - flags don't apply: ``Llama3Renderer`` raises ``NotImplementedError`` - if either is set (matching ``DefaultRenderer``'s contract for the - same case). Both fields below mirror real ``apply_chat_template`` + Llama-3 ships no reasoning channel, so the base ``thinking_retention`` + flag is a no-op: there's never any past-assistant thinking to retain + or drop, so any level leaves the token stream unchanged (same contract + as Kimi-K2 / Qwen3-VL). Both fields below mirror real + ``apply_chat_template`` kwargs. """ @@ -373,8 +445,13 @@ class Nemotron3RendererConfig(BaseRendererConfig): """When ``False``, keep ``{reasoning}`` on past-cycle assistant turns instead of dropping them. Mirrors the chat template's ``truncate_history_thinking`` toggle. OR-composes with - ``preserve_all_thinking`` / ``preserve_thinking_between_tool_calls`` - — see :class:`BaseRendererConfig` for the contract.""" + ``thinking_retention`` — see :class:`BaseRendererConfig` for the + contract.""" + + @model_validator(mode="after") + def _check_thinking_retention(self): + _reject_thinking_retention_conflict(self, "truncate_history_thinking") + return self low_effort: bool = False """When ``True``, append ``\\n\\n{reasoning effort: low}`` to the last user @@ -404,6 +481,11 @@ class Nemotron3UltraRendererConfig(BaseRendererConfig): truncate_history_thinking: bool = True """See :class:`Nemotron3RendererConfig.truncate_history_thinking`.""" + @model_validator(mode="after") + def _check_thinking_retention(self): + _reject_thinking_retention_conflict(self, "truncate_history_thinking") + return self + medium_effort: bool = False """When ``True``, append ``\\n\\n{reasoning effort: efficient}`` to the last user message. Mirrors the Ultra chat template's ``medium_effort`` kwarg.""" @@ -426,8 +508,8 @@ class DeepSeekR1RendererConfig(BaseRendererConfig): R1 always reasons — its chat template unconditionally prefills ``\\n`` at the generation prompt and strips ```` from historical assistant turns. There is therefore no ``enable_thinking`` - knob (thinking is not optional), and ``preserve_*`` flags are no-ops - (history reasoning is always dropped); both stored for protocol + knob (thinking is not optional), and ``thinking_retention`` is a no-op + (history reasoning is always dropped); stored for protocol uniformity. Applies to full ``deepseek-ai/DeepSeek-R1`` / ``-R1-0528`` — NOT the R1-Distill-Qwen/Llama models, which use those base tokenizers and route to the Qwen3 / Llama-3 renderers. @@ -473,7 +555,7 @@ class DeepSeekR1RendererConfig(BaseRendererConfig): # Map discriminator → config class. Used by ``create_renderer`` when # resolving ``AutoRendererConfig`` against ``MODEL_RENDERER_MAP``: the # resolved renderer name picks the corresponding typed config, and the -# auto config's ``preserve_*`` fields are carried over. +# auto config's ``thinking_retention`` field is carried over. _CONFIG_BY_NAME: dict[str, type[BaseRendererConfig]] = { "auto": AutoRendererConfig, "default": DefaultRendererConfig, @@ -543,5 +625,6 @@ def config_from_name(name: str) -> BaseRendererConfig | None: "Qwen3RendererConfig", "Qwen3VLRendererConfig", "RendererConfig", + "ThinkingRetention", "config_from_name", ] diff --git a/renderers/deepseek_v3.py b/renderers/deepseek_v3.py index 5f4840a..3cb82d3 100644 --- a/renderers/deepseek_v3.py +++ b/renderers/deepseek_v3.py @@ -48,7 +48,7 @@ class DeepSeekV3Renderer: assistant content is emitted verbatim. The reasoning variant (````-prefilled prompt, history reasoning stripped) lives in :class:`renderers.deepseek_r1.DeepSeekR1Renderer`, which subclasses - this one. ``preserve_*`` flags are no-ops here (no reasoning channel), + this one. ``thinking_retention`` is a no-op here (no reasoning channel), stored for protocol uniformity. """ diff --git a/renderers/default.py b/renderers/default.py index a662097..2e9c7d2 100644 --- a/renderers/default.py +++ b/renderers/default.py @@ -95,11 +95,12 @@ def __init__( config: DefaultRendererConfig | None = None, ): cfg = config or DefaultRendererConfig() - if cfg.preserve_all_thinking or cfg.preserve_thinking_between_tool_calls: + if cfg.thinking_retention != "template": raise NotImplementedError( "DefaultRenderer falls back to apply_chat_template and can't " "selectively re-emit dropped reasoning_content. Configure a " - "model-specific renderer if you need preserve_*_thinking." + "model-specific renderer if you need thinking_retention != " + "'template'." ) self._tokenizer = tokenizer self.config = cfg diff --git a/renderers/glm45.py b/renderers/glm45.py index 7af9259..7ca4d1c 100644 --- a/renderers/glm45.py +++ b/renderers/glm45.py @@ -23,6 +23,7 @@ attribute_text_segments, extract_message_tool_names, reject_assistant_in_extension, + reject_thinking_strip_in_extension, should_preserve_past_thinking, ) from renderers.configs import GLM45RendererConfig @@ -228,8 +229,7 @@ def emit_text_segments( preserve_thinking = should_preserve_past_thinking( messages, i, - preserve_all_thinking=self.config.preserve_all_thinking, - preserve_thinking_between_tool_calls=self.config.preserve_thinking_between_tool_calls, + thinking_retention=self.config.thinking_retention, ) self._render_assistant( msg, @@ -293,7 +293,7 @@ def parse_response( token_ids, stop_ids={self._endoftext, self._user, self._observation}, think_id=self._think, - think_end_id=self._think_end, + thinking_marker_ids=[self._think_end], tool_call_id=self._tool_call_tok, tool_call_end_id=self._tool_call_end_tok, arg_key_id=self._arg_key, @@ -321,6 +321,19 @@ def bridge_to_next_turn( ): return None + # Faithfulness across a user-query boundary: the template drops a past + # block's thinking once a new user turn arrives (see + # reject_thinking_strip_in_extension). + if reject_thinking_strip_in_extension( + previous_prompt_ids, + previous_completion_ids, + new_messages, + thinking_retention=self.config.thinking_retention, + thinking_marker_ids=[self._think_end], + enable_thinking=self.config.enable_thinking, + ): + return None + # Same next-turn-marker scheme as GLM-5, but role markers are # followed by a literal ``\n`` in the prompt text. previous_ids = list(previous_prompt_ids) + list(previous_completion_ids) diff --git a/renderers/glm5.py b/renderers/glm5.py index 924d754..ad215e5 100644 --- a/renderers/glm5.py +++ b/renderers/glm5.py @@ -24,6 +24,7 @@ attribute_text_segments, extract_message_tool_names, reject_assistant_in_extension, + reject_thinking_strip_in_extension, should_preserve_past_thinking, ) from renderers.configs import GLM5RendererConfig, GLM51RendererConfig @@ -241,8 +242,7 @@ def emit_text_segments( preserve_thinking = should_preserve_past_thinking( messages, i, - preserve_all_thinking=self.config.preserve_all_thinking, - preserve_thinking_between_tool_calls=self.config.preserve_thinking_between_tool_calls, + thinking_retention=self.config.thinking_retention, ) self._render_assistant( msg, @@ -309,7 +309,7 @@ def parse_response( token_ids, stop_ids={self._endoftext, self._user, self._observation}, think_id=self._think, - think_end_id=self._think_end, + thinking_marker_ids=[self._think_end], tool_call_id=self._tool_call_tok, tool_call_end_id=self._tool_call_end_tok, arg_key_id=self._arg_key, @@ -337,6 +337,25 @@ def bridge_to_next_turn( ): return None + # Faithfulness across a user-query boundary: the template drops a past + # block's thinking once a new user turn arrives (see + # reject_thinking_strip_in_extension). ``clear_thinking=False`` keeps + # all past thinking (≡ thinking_retention="all"), so the bridge stays + # faithful carrying it verbatim — fold it into the effective level so + # we don't over-decline and re-tokenize sampled thinking. + retention = ( + "all" if not self.config.clear_thinking else self.config.thinking_retention + ) + if reject_thinking_strip_in_extension( + previous_prompt_ids, + previous_completion_ids, + new_messages, + thinking_retention=retention, + thinking_marker_ids=[self._think_end], + enable_thinking=self.config.enable_thinking, + ): + return None + # GLM has no per-turn close token. An assistant turn ends when the # next turn's role marker appears, OR the model emits <|endoftext|>. # vLLM includes these in ``stop_token_ids`` so a clean stop leaves @@ -501,7 +520,7 @@ def _render_assistant( # ``preserve_thinking`` is the override output of # ``should_preserve_past_thinking`` — it adds historical assistants # back when the renderer was constructed with - # ``preserve_all_thinking=True``. ``clear_thinking=False`` mirrors + # ``thinking_retention="all"``. ``clear_thinking=False`` mirrors # the template's per-call ``clear_thinking is defined and not # clear_thinking`` gate: a chat_template_kwarg surface for the # same behaviour, gated explicitly by the caller per render. diff --git a/renderers/gpt_oss.py b/renderers/gpt_oss.py index 2a9c5ca..9dbc6b7 100644 --- a/renderers/gpt_oss.py +++ b/renderers/gpt_oss.py @@ -58,6 +58,7 @@ ToolSpec, extract_message_tool_names, reject_assistant_in_extension, + reject_thinking_strip_in_extension, should_preserve_past_thinking, trim_to_turn_close, ) @@ -154,6 +155,14 @@ def __init__( self._channel = self._token_id("<|channel|>") self._message = self._token_id("<|message|>") self._constrain = self._token_id("<|constrain|>") + # Harmony marks reasoning as an analysis channel: + # ``<|channel|>analysis<|message|>``. Its presence in prior tokens + # signals a sampled thinking block the template strips from history. + self._analysis_marker_ids = [ + self._channel, + *self._encode("analysis"), + self._message, + ] # ── token utilities ────────────────────────────────────────────────────── @@ -425,8 +434,7 @@ def emit_harmony_message( should_preserve_past_thinking( messages, i, - preserve_all_thinking=self.config.preserve_all_thinking, - preserve_thinking_between_tool_calls=self.config.preserve_thinking_between_tool_calls, + thinking_retention=self.config.thinking_retention, ) ) for hm in self._to_harmony_messages( @@ -524,6 +532,20 @@ def bridge_to_next_turn( ): return None + # Faithfulness across a user-query boundary: harmony strips the + # analysis (reasoning) channel from history once a new user turn + # arrives, so the bridge can't carry it verbatim there (see + # reject_thinking_strip_in_extension). gpt-oss always reasons. + if reject_thinking_strip_in_extension( + previous_prompt_ids, + previous_completion_ids, + new_messages, + thinking_retention=self.config.thinking_retention, + thinking_marker_ids=self._analysis_marker_ids, + enable_thinking=True, + ): + return None + previous_ids = trim_to_turn_close( previous_prompt_ids, previous_completion_ids, diff --git a/renderers/kimi_k2.py b/renderers/kimi_k2.py index e99dfa7..8cdc548 100644 --- a/renderers/kimi_k2.py +++ b/renderers/kimi_k2.py @@ -38,9 +38,9 @@ class KimiK2Renderer: Kimi K2's chat template doesn't read any thinking-related variable — ``content`` renders verbatim with no reasoning branch. The - ``enable_thinking`` / ``preserve_*`` fields on the config are stored - for protocol uniformity with the rest of the renderer family but - have no effect on the byte-level output. + ``enable_thinking`` / ``thinking_retention`` fields on the config are + stored for protocol uniformity with the rest of the renderer family + but have no effect on the byte-level output. """ def __init__( diff --git a/renderers/kimi_k25.py b/renderers/kimi_k25.py index bca4464..e90a9ba 100644 --- a/renderers/kimi_k25.py +++ b/renderers/kimi_k25.py @@ -38,6 +38,7 @@ ToolSpec, extract_message_tool_names, reject_assistant_in_extension, + reject_thinking_strip_in_extension, should_preserve_past_thinking, trim_to_turn_close, ) @@ -888,8 +889,7 @@ def emit_image( preserve_thinking = should_preserve_past_thinking( messages, i, - preserve_all_thinking=self.config.preserve_all_thinking, - preserve_thinking_between_tool_calls=self.config.preserve_thinking_between_tool_calls, + thinking_retention=self.config.thinking_retention, ) self._render_assistant_body( msg, @@ -1036,6 +1036,20 @@ def bridge_to_next_turn( ): return None + # Faithfulness across a user-query boundary: the template drops a past + # block's thinking once a new user turn arrives. ```` is + # multi-token here, so pass the full close subsequence (see + # reject_thinking_strip_in_extension). + if reject_thinking_strip_in_extension( + previous_prompt_ids, + previous_completion_ids, + new_messages, + thinking_retention=self.config.thinking_retention, + thinking_marker_ids=self._think_close_ids, + enable_thinking=self.config.thinking, + ): + return None + close_ids: set[int] = {self._im_end} if self._endoftext is not None: close_ids.add(self._endoftext) diff --git a/renderers/laguna_xs2.py b/renderers/laguna_xs2.py index bd6b64f..71953b9 100644 --- a/renderers/laguna_xs2.py +++ b/renderers/laguna_xs2.py @@ -17,10 +17,8 @@ a ``### Tools`` header with an ```` listing and prose format instructions that vary on ``enable_thinking``). - Reasoning is rendered for every assistant message — no last-user-index - gating. ``preserve_all_thinking`` and - ``preserve_thinking_between_tool_calls`` are accepted for protocol - uniformity but are effectively no-ops since past reasoning is preserved - by default. + gating. ``thinking_retention`` is accepted for protocol uniformity but + is effectively a no-op since past reasoning is preserved by default. """ from __future__ import annotations diff --git a/renderers/llama_3.py b/renderers/llama_3.py index d15792a..5dd2798 100644 --- a/renderers/llama_3.py +++ b/renderers/llama_3.py @@ -9,7 +9,7 @@ Notable differences from the Qwen / GLM family renderers: * No ```` / reasoning channel — Llama-3 doesn't ship a - reasoning-content concept, so ``preserve_*_thinking`` flags don't + reasoning-content concept, so the ``thinking_retention`` flag doesn't apply. * ``<|begin_of_text|>`` (BOS) is emitted at the very start of every render. The chat template never omits it. @@ -94,10 +94,10 @@ def __init__( tokenizer: PreTrainedTokenizer, config: Llama3RendererConfig | None = None, ): - # ``preserve_*_thinking`` are accepted but no-ops: Llama-3 ships no + # ``thinking_retention`` is accepted but a no-op: Llama-3 ships no # reasoning_content channel, so there's never any past-assistant - # thinking to retain or drop. The flags are stored on ``self.config`` - # for cross-renderer uniformity but never change the token stream — + # thinking to retain or drop. The level is stored on ``self.config`` + # for cross-renderer uniformity but never changes the token stream — # the same contract as Kimi-K2 / Qwen3-VL (see the never-preserves # renderers in tests/test_preserve_thinking.py). self._tokenizer = tokenizer diff --git a/renderers/minimax_m2.py b/renderers/minimax_m2.py index f990274..9c92353 100644 --- a/renderers/minimax_m2.py +++ b/renderers/minimax_m2.py @@ -240,8 +240,7 @@ def emit_token_overlap_body( preserve_thinking = should_preserve_past_thinking( messages, orig_idx, - preserve_all_thinking=self.config.preserve_all_thinking, - preserve_thinking_between_tool_calls=self.config.preserve_thinking_between_tool_calls, + thinking_retention=self.config.thinking_retention, ) self._render_assistant( msg, diff --git a/renderers/nemotron3.py b/renderers/nemotron3.py index c29129c..7f98c87 100644 --- a/renderers/nemotron3.py +++ b/renderers/nemotron3.py @@ -27,6 +27,7 @@ attribute_text_segments, extract_message_tool_names, reject_assistant_in_extension, + reject_thinking_strip_in_extension, should_preserve_past_thinking, trim_to_turn_close, ) @@ -424,14 +425,13 @@ def emit_text_segments( elif role == "assistant": # Template: ``include_content = not (truncate_history_thinking # and loop.index0 < last_user_idx)``. The renderer-internal - # preserve_* overrides only ever *extend* retention, so OR them - # in (a preserved turn keeps its thinking even when the - # template default would drop it). + # ``thinking_retention`` override only ever *extends* + # retention, so OR it in (a preserved turn keeps its thinking + # even when the template default would drop it). preserve_thinking = msg_orig_idx >= 0 and should_preserve_past_thinking( original_messages, msg_orig_idx, - preserve_all_thinking=self.config.preserve_all_thinking, - preserve_thinking_between_tool_calls=self.config.preserve_thinking_between_tool_calls, + thinking_retention=self.config.thinking_retention, ) include_content = ( not self.config.truncate_history_thinking @@ -510,7 +510,7 @@ def parse_response( token_ids, stop_ids=stop_ids, think_id=self._think, - think_end_id=self._think_end, + thinking_marker_ids=[self._think_end], tool_call_id=self._tool_call, tool_call_end_id=self._tool_call_end, tools=tools, @@ -542,6 +542,28 @@ def bridge_to_next_turn( ): return None + # Faithfulness across a user-query boundary: the template drops a past + # block's thinking once a new user turn arrives (see + # reject_thinking_strip_in_extension). ``truncate_history_thinking= + # False`` keeps all past thinking (≡ thinking_retention="all"), so the + # bridge stays faithful carrying it verbatim — fold it into the + # effective level so we don't over-decline and re-tokenize sampled + # thinking. + retention = ( + "all" + if not self.config.truncate_history_thinking + else self.config.thinking_retention + ) + if reject_thinking_strip_in_extension( + previous_prompt_ids, + previous_completion_ids, + new_messages, + thinking_retention=retention, + thinking_marker_ids=[self._think_end], + enable_thinking=self.config.enable_thinking, + ): + return None + close_ids: set[int] = {self._im_end} if self._endoftext is not None: close_ids.add(self._endoftext) @@ -704,8 +726,9 @@ def _assistant_body( """Assemble the assistant body string exactly as the chat template. ``include_content`` is the template's ``not (truncate_history_thinking - and loop.index0 < last_user_idx)`` (already OR-ed with the preserve_* - overrides by the caller): ``True`` keeps the full think+content block, + and loop.index0 < last_user_idx)`` (already OR-ed with the + ``thinking_retention`` override by the caller): ``True`` keeps the + full think+content block, ``False`` collapses historical thinking to an empty ````. """ ultra = self._ultra diff --git a/renderers/qwen3.py b/renderers/qwen3.py index 9253b3d..19141a5 100644 --- a/renderers/qwen3.py +++ b/renderers/qwen3.py @@ -21,6 +21,7 @@ attribute_text_segments, extract_message_tool_names, reject_assistant_in_extension, + reject_thinking_strip_in_extension, should_preserve_past_thinking, trim_to_turn_close, ) @@ -209,8 +210,7 @@ def emit_text_segments( preserve_thinking = should_preserve_past_thinking( messages, i, - preserve_all_thinking=self.config.preserve_all_thinking, - preserve_thinking_between_tool_calls=self.config.preserve_thinking_between_tool_calls, + thinking_retention=self.config.thinking_retention, ) self._render_assistant( msg, @@ -298,6 +298,19 @@ def bridge_to_next_turn( ): return None + # Faithfulness across a user-query boundary: Qwen3's template drops a + # past block's thinking once a new user turn arrives, so the bridge + # can't carry it verbatim there (see reject_thinking_strip_in_extension). + if reject_thinking_strip_in_extension( + previous_prompt_ids, + previous_completion_ids, + new_messages, + thinking_retention=self.config.thinking_retention, + thinking_marker_ids=[self._think_end], + enable_thinking=self.config.enable_thinking, + ): + return None + previous_ids = trim_to_turn_close( previous_prompt_ids, previous_completion_ids, diff --git a/renderers/qwen35.py b/renderers/qwen35.py index cdb8ee1..5878487 100644 --- a/renderers/qwen35.py +++ b/renderers/qwen35.py @@ -29,6 +29,7 @@ attribute_text_segments, extract_message_tool_names, reject_assistant_in_extension, + reject_thinking_strip_in_extension, should_preserve_past_thinking, trim_to_turn_close, ) @@ -511,8 +512,7 @@ def flush_buf() -> None: preserve_thinking = should_preserve_past_thinking( messages, i, - preserve_all_thinking=self.config.preserve_all_thinking, - preserve_thinking_between_tool_calls=self.config.preserve_thinking_between_tool_calls, + thinking_retention=self.config.thinking_retention, ) self._render_assistant( msg, @@ -594,7 +594,7 @@ def parse_response( token_ids, stop_ids={self._im_end, self._endoftext}, think_id=self._think, - think_end_id=self._think_end, + thinking_marker_ids=[self._think_end], tool_call_id=self._tool_call, tool_call_end_id=self._tool_call_end, tools=tools, @@ -619,6 +619,19 @@ def bridge_to_next_turn( ): return None + # Faithfulness across a user-query boundary: the template drops a past + # block's thinking once a new user turn arrives (see + # reject_thinking_strip_in_extension). + if reject_thinking_strip_in_extension( + previous_prompt_ids, + previous_completion_ids, + new_messages, + thinking_retention=self.config.thinking_retention, + thinking_marker_ids=[self._think_end], + enable_thinking=self.config.enable_thinking, + ): + return None + previous_ids = trim_to_turn_close( previous_prompt_ids, previous_completion_ids, diff --git a/renderers/qwen36.py b/renderers/qwen36.py index 6adf867..938efd8 100644 --- a/renderers/qwen36.py +++ b/renderers/qwen36.py @@ -10,8 +10,7 @@ Historical-thinking retention follows Qwen3.5's default (drop past ```` blocks). The upstream template carries a ``preserve_thinking`` Jinja toggle for the opposite polarity; on the renderer side that intent -maps to the renderer-agnostic ``preserve_all_thinking`` / -``preserve_thinking_between_tool_calls`` flags on +maps to the renderer-agnostic ``thinking_retention`` flag on :class:`renderers.Qwen36RendererConfig`. Everything else — tool system prompt, tool-call XML structure, thinking diff --git a/renderers/qwen3_vl.py b/renderers/qwen3_vl.py index 7b82d7e..f223ff9 100644 --- a/renderers/qwen3_vl.py +++ b/renderers/qwen3_vl.py @@ -302,9 +302,9 @@ class Qwen3VLRenderer: keyed off ``tokenizer.name_or_path`` the first time a multimodal part is seen. - ``preserve_all_thinking`` / ``preserve_thinking_between_tool_calls`` - on the config are no-ops here — the chat template drops past - ```` blocks unconditionally. Stored for Protocol parity. + ``thinking_retention`` on the config is a no-op here — the chat + template drops past ```` blocks unconditionally. Stored for + Protocol parity. """ def __init__( diff --git a/tests/test_bridge.py b/tests/test_bridge.py index 8b7bf77..714092c 100644 --- a/tests/test_bridge.py +++ b/tests/test_bridge.py @@ -67,7 +67,25 @@ def br_renderer(br_model, br_renderer_name): return _load(br_model, br_renderer_name)[1] -def _simulate_prior_turn(renderer): +@pytest.fixture +def br_renderer_all(br_model, br_renderer_name): + """Renderer forced to ``thinking_retention="all"``. + + The verbatim-extension mechanic tests below cross a user-query + boundary. For thinking models the template drops a past block's + thinking there, so the faithful bridge declines (covered by + ``test_bridge_declines_across_user_query_when_template_drops_thinking``). + ``"all"`` keeps thinking on every path, isolating the pure extension + mechanic from the retention policy across all renderers. + """ + from renderers import create_renderer + + tok, base = _load(br_model, br_renderer_name) + cfg = base.config.model_copy(update={"thinking_retention": "all"}) + return create_renderer(tok, cfg) + + +def _simulate_prior_turn(renderer, assistant=None): """Build a (prev_prompt, prev_completion) pair that a real rollout would produce for a one-turn prior with a clean stop. @@ -76,13 +94,15 @@ def _simulate_prior_turn(renderer): gen_prompt, and take the diff as prev_completion. We then trim prev_completion to the last close token so it matches what vLLM actually hands back (vLLM stops at the close token and excludes the - trailing template scaffolding). + trailing template scaffolding). Pass ``assistant`` to override the + default no-thinking turn (e.g. one carrying ``reasoning_content``). """ prior = [ {"role": "system", "content": "You are helpful."}, {"role": "user", "content": "Hi."}, ] - assistant = [{"role": "assistant", "content": "Hello!"}] + if assistant is None: + assistant = [{"role": "assistant", "content": "Hello!"}] prev_prompt = renderer.render_ids(prior, add_generation_prompt=True) full_with_assistant = renderer.render_ids( @@ -105,11 +125,11 @@ def _simulate_prior_turn(renderer): return prev_prompt, prev_completion -def test_bridge_extends_prev_verbatim_on_clean_stop(br_renderer, br_model): - prev_prompt, prev_completion = _simulate_prior_turn(br_renderer) +def test_bridge_extends_prev_verbatim_on_clean_stop(br_renderer_all, br_model): + prev_prompt, prev_completion = _simulate_prior_turn(br_renderer_all) new_messages = [{"role": "user", "content": "What's 2+2?"}] - bridged = br_renderer.bridge_to_next_turn( + bridged = br_renderer_all.bridge_to_next_turn( prev_prompt, prev_completion, new_messages ) assert bridged is not None, f"{br_model}: bridge returned None on clean stop" @@ -148,8 +168,8 @@ def test_bridge_rejects_empty_prev_or_new(br_renderer): assert br_renderer.bridge_to_next_turn(prev_prompt, prev_completion, []) is None -def test_bridge_synthesises_close_on_truncation(br_renderer, br_model): - prev_prompt, prev_completion = _simulate_prior_turn(br_renderer) +def test_bridge_synthesises_close_on_truncation(br_renderer_all, br_model): + prev_prompt, prev_completion = _simulate_prior_turn(br_renderer_all) # Drop the final close token to simulate a max_tokens truncation. prev_completion_trunc = prev_completion[:-1] if prev_completion else prev_completion if len(prev_completion_trunc) == 0: @@ -157,7 +177,7 @@ def test_bridge_synthesises_close_on_truncation(br_renderer, br_model): f"{br_model}: simulated prior had no completion tokens — can't truncate" ) - bridged = br_renderer.bridge_to_next_turn( + bridged = br_renderer_all.bridge_to_next_turn( prev_prompt, prev_completion_trunc, [{"role": "user", "content": "What's 2+2?"}], @@ -176,12 +196,12 @@ def test_bridge_synthesises_close_on_truncation(br_renderer, br_model): def test_bridge_extension_includes_new_message_text( - br_renderer, br_tokenizer, br_model + br_renderer_all, br_tokenizer, br_model ): - prev_prompt, prev_completion = _simulate_prior_turn(br_renderer) + prev_prompt, prev_completion = _simulate_prior_turn(br_renderer_all) new_messages = [{"role": "user", "content": "HELLO_SENTINEL_XYZ"}] - bridged = br_renderer.bridge_to_next_turn( + bridged = br_renderer_all.bridge_to_next_turn( prev_prompt, prev_completion, new_messages ) assert bridged is not None @@ -190,3 +210,131 @@ def test_bridge_extension_includes_new_message_text( assert "HELLO_SENTINEL_XYZ" in decoded, ( f"{br_model}: new-message content missing from extension; got {decoded!r}" ) + + +def test_bridge_declines_across_user_query_when_template_drops_thinking(): + """Qwen3's template drops a past block's thinking once a new user turn + arrives. The bridge carries prior tokens verbatim, so spanning that + boundary would keep thinking the template strips. The faithful bridge + must DECLINE there (the caller then re-renders, byte-faithfully) — but + only when it actually matters: + + - new user query + prior thinking + retention!="all" -> decline, + and the fallback re-render equals ``apply_chat_template``. + - thinking_retention="all" keeps thinking on every path -> extend. + - a tool response (in-flight cycle, no new query) keeps thinking in + the template too -> extend. + - no prior thinking -> nothing to strip -> extend. + """ + from renderers import create_renderer + from renderers.base import load_tokenizer + from renderers.configs import Qwen3RendererConfig + + tok = load_tokenizer("Qwen/Qwen3-8B") + im_end = tok.convert_tokens_to_ids("<|im_end|>") + + u1 = {"role": "user", "content": "What is 2+2?"} + u2 = {"role": "user", "content": "Multiply that by 3."} + tool = {"role": "tool", "content": "ok"} + think = "\n2 plus 2 is 4.\n\n\n4" + + def prior(r, asst_text): + p = r.render_ids([u1], add_generation_prompt=True) + completion = tok.encode(asst_text, add_special_tokens=False) + [im_end] + return p, completion + + # new user query + prior thinking + default retention -> decline + r = create_renderer(tok, Qwen3RendererConfig()) + p, comp = prior(r, think) + assert r.bridge_to_next_turn(p, comp, [u2]) is None + # ...and the caller's faithful re-render matches the chat template + hist = [u1, {"role": "assistant", "content": think}, u2] + rendered = tok.decode(r.render_ids(hist, add_generation_prompt=True)) + assert rendered == tok.apply_chat_template( + hist, tokenize=False, add_generation_prompt=True + ) + + # thinking_retention="all" keeps thinking everywhere -> extend + r_all = create_renderer(tok, Qwen3RendererConfig(thinking_retention="all")) + p, comp = prior(r_all, think) + assert r_all.bridge_to_next_turn(p, comp, [u2]) is not None + + # tool response continues the in-flight cycle (no new query) -> extend + p, comp = prior(r, think) + assert r.bridge_to_next_turn(p, comp, [tool]) is not None + + # no prior thinking -> nothing to strip -> extend even across a user turn + p, comp = prior(r, "4") + assert r.bridge_to_next_turn(p, comp, [u2]) is not None + + +# Renderers wired with the bridge faithfulness guard. Detection markers +# vary: single ```` token (Qwen3, GLM, Nemotron), multi-token +# ```` (Kimi-K2.5), or harmony analysis-channel header (gpt-oss). +# minimax has no bridge; non-thinking models (llama, deepseek-v3) are out +# of scope. +_GUARDED_THINKING_MODELS = { + "Qwen/Qwen3-8B", + "Qwen/Qwen3.5-9B", + "Qwen/Qwen3.6-35B-A3B", + "zai-org/GLM-5", + "zai-org/GLM-5.1", + "THUDM/GLM-4.5-Air", + "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", + "moonshotai/Kimi-K2.5", + "openai/gpt-oss-20b", +} + + +def test_bridge_declines_across_user_turn_when_thinking_present(br_renderer, br_model): + """Across every guarded thinking renderer: a prior turn carrying + ``reasoning_content`` makes the bridge decline a new *user* query (the + template would strip that thinking) but still extend a *tool* response + (in-flight cycle keeps it). Non-guarded models are skipped.""" + if br_model not in _GUARDED_THINKING_MODELS: + pytest.skip(f"{br_model}: no bridge thinking-guard") + + asst = [ + {"role": "assistant", "reasoning_content": "Let me think.", "content": "Hello!"} + ] + prev_prompt, prev_completion = _simulate_prior_turn(br_renderer, asst) + + declined = br_renderer.bridge_to_next_turn( + prev_prompt, prev_completion, [{"role": "user", "content": "next"}] + ) + assert declined is None, f"{br_model}: expected faithful decline across a user turn" + + extended = br_renderer.bridge_to_next_turn( + prev_prompt, prev_completion, [{"role": "tool", "content": "result"}] + ) + assert extended is not None, f"{br_model}: should still bridge within a tool cycle" + + +def test_bridge_keeps_thinking_when_history_kwarg_disables_truncation(): + """GLM ``clear_thinking=False`` / Nemotron ``truncate_history_thinking= + False`` keep all past thinking (the template doesn't strip it), so the + bridge must NOT decline across a user turn — declining would re-render and + re-tokenize model-sampled thinking bytes.""" + from renderers import create_renderer + from renderers.base import load_tokenizer + from renderers.configs import GLM5RendererConfig, Nemotron3RendererConfig + + asst = [ + {"role": "assistant", "reasoning_content": "Let me think.", "content": "Hi"} + ] + cases = [ + ("zai-org/GLM-5", GLM5RendererConfig(clear_thinking=False)), + ( + "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", + Nemotron3RendererConfig(truncate_history_thinking=False), + ), + ] + for model, cfg in cases: + r = create_renderer(load_tokenizer(model), cfg) + prev_prompt, prev_completion = _simulate_prior_turn(r, asst) + bridged = r.bridge_to_next_turn( + prev_prompt, prev_completion, [{"role": "user", "content": "next"}] + ) + assert bridged is not None, ( + f"{model}: bridge must keep verbatim when the template keeps all thinking" + ) diff --git a/tests/test_llama_3.py b/tests/test_llama_3.py index c6e4c75..8c16232 100644 --- a/tests/test_llama_3.py +++ b/tests/test_llama_3.py @@ -80,8 +80,8 @@ def test_default_date_matches_chat_template_strftime_fallback(llama_pair): def test_preserve_thinking_flags_are_noops(llama_pair): - """Llama-3 has no reasoning channel, so the ``preserve_*_thinking`` - flags are accepted but never change the token stream — the same + """Llama-3 has no reasoning channel, so any ``thinking_retention`` + level is accepted but never changes the token stream — the same never-preserves contract as Kimi-K2 / Qwen3-VL. (Cross-renderer coverage lives in tests/test_preserve_thinking.py.)""" _, _, tok, _ = llama_pair @@ -94,10 +94,12 @@ def test_preserve_thinking_flags_are_noops(llama_pair): }, ] base = Llama3Renderer(tok).render_ids(msgs) - for flag in ("preserve_all_thinking", "preserve_thinking_between_tool_calls"): - r = Llama3Renderer(tok, Llama3RendererConfig(**{flag: True})) - assert r.config.__getattribute__(flag) is True - assert r.render_ids(msgs) == base, f"{flag} must be a no-op for Llama-3" + for level in ("tool_cycle", "all"): + r = Llama3Renderer(tok, Llama3RendererConfig(thinking_retention=level)) + assert r.config.thinking_retention == level + assert r.render_ids(msgs) == base, ( + f"thinking_retention={level!r} must be a no-op for Llama-3" + ) # --------------------------------------------------------------------------- diff --git a/tests/test_preserve_thinking.py b/tests/test_preserve_thinking.py index daa4836..3d3a372 100644 --- a/tests/test_preserve_thinking.py +++ b/tests/test_preserve_thinking.py @@ -1,16 +1,17 @@ -"""Smoke coverage for the ``preserve_*_thinking`` override flags. +"""Smoke coverage for the ``thinking_retention`` override flag. -Flags live on the typed renderer config (e.g. -``Qwen3RendererConfig(preserve_all_thinking=True)``) and are stored on -the renderer as ``self.config.preserve_*``. Each test that wants a -non-default flag builds a fresh renderer for that configuration via +The flag lives on the typed renderer config (e.g. +``Qwen3RendererConfig(thinking_retention="all")``) and is stored on the +renderer as ``self.config.thinking_retention``. Each test that wants a +non-default level builds a fresh renderer for that configuration via ``_make`` below. Two invariants per renderer: -1. Default render (no flags) is byte-identical to the existing - ``apply_chat_template`` parity baseline — covered exhaustively elsewhere. -2. Setting either flag never *removes* tokens compared to the default and, +1. Default render (``thinking_retention="template"``) is byte-identical to + the existing ``apply_chat_template`` parity baseline — covered + exhaustively elsewhere. +2. Raising the level never *removes* tokens compared to the default and, for renderers whose template would drop past-asst thinking, actually adds tokens for a conversation containing past-asst ``reasoning_content``. @@ -31,7 +32,7 @@ def _make(tokenizer, renderer_name, **flags): - """Build a fresh renderer with the given preserve_*_thinking flags + """Build a fresh renderer with the given thinking_retention level bound at construction. Reuses the cached tokenizer fixture.""" if renderer_name == "auto": renderer_name = MODEL_RENDERER_MAP.get( @@ -51,8 +52,8 @@ def _make(tokenizer, renderer_name, **flags): "Qwen/Qwen3-VL-8B-Instruct", "Qwen/Qwen3-VL-30B-A3B-Instruct", "poolside/Laguna-XS.2", - # Llama-3 has no reasoning channel at all — preserve flags can't add - # or drop anything, so they're pure no-ops. + # Llama-3 has no reasoning channel at all — thinking_retention can't + # add or drop anything, so it's a pure no-op. "meta-llama/Llama-3.2-1B-Instruct", "meta-llama/Llama-3.2-3B-Instruct", "unsloth/Llama-3.2-1B-Instruct", @@ -80,7 +81,7 @@ def _make(tokenizer, renderer_name, **flags): def test_should_preserve_past_thinking_classification(): - # CURRENT-block-only behaviour. between_tool_calls preserves thinking + # CURRENT-block-only behaviour. "tool_cycle" preserves thinking # ONLY for asst messages that sit AFTER the last user turn AND are in # a segment that contains a tool. Anything before the last user turn # falls back to template default (typically dropped). @@ -100,44 +101,38 @@ def test_should_preserve_past_thinking_classification(): assert should_preserve_past_thinking( live_cycle, 1, - preserve_all_thinking=False, - preserve_thinking_between_tool_calls=True, + thinking_retention="tool_cycle", ) assert should_preserve_past_thinking( live_cycle, 3, - preserve_all_thinking=False, - preserve_thinking_between_tool_calls=True, + thinking_retention="tool_cycle", ) # Same shape with a NEW user appended → now the prior tool block is - # "older" and between_tool_calls must drop its thinking (template - # default). Only preserve_all_thinking would keep them. + # "older" and "tool_cycle" must drop its thinking (template default). + # Only thinking_retention="all" would keep them. closed_cycle = live_cycle + [{"role": "user", "content": "next"}] assert not should_preserve_past_thinking( closed_cycle, 1, - preserve_all_thinking=False, - preserve_thinking_between_tool_calls=True, + thinking_retention="tool_cycle", ) assert not should_preserve_past_thinking( closed_cycle, 3, - preserve_all_thinking=False, - preserve_thinking_between_tool_calls=True, + thinking_retention="tool_cycle", ) - # preserve_all_thinking still keeps them. + # thinking_retention="all" still keeps them. assert should_preserve_past_thinking( closed_cycle, 1, - preserve_all_thinking=True, - preserve_thinking_between_tool_calls=False, + thinking_retention="all", ) assert should_preserve_past_thinking( closed_cycle, 3, - preserve_all_thinking=True, - preserve_thinking_between_tool_calls=False, + thinking_retention="all", ) # Current segment without a tool → not a tool cycle → not preserved. @@ -148,37 +143,35 @@ def test_should_preserve_past_thinking_classification(): assert not should_preserve_past_thinking( no_tool_yet, 1, - preserve_all_thinking=False, - preserve_thinking_between_tool_calls=True, + thinking_retention="tool_cycle", ) - # Both flags False → always False. + # thinking_retention="template" → always False. assert not should_preserve_past_thinking( live_cycle, 1, - preserve_all_thinking=False, - preserve_thinking_between_tool_calls=False, + thinking_retention="template", ) -def test_preserve_flags_default_unchanged( +def test_thinking_retention_template_unchanged( model_name, tokenizer, renderer_name, renderer ): - # A renderer constructed with both flags explicitly ``False`` must - # produce byte-identical output to one constructed with the defaults. + # A renderer constructed with thinking_retention="template" explicitly + # must produce byte-identical output to one constructed with the defaults. bare = renderer.render_ids(CONVERSATION) explicit_off = _make( tokenizer, renderer_name, - preserve_all_thinking=False, - preserve_thinking_between_tool_calls=False, + thinking_retention="template", ).render_ids(CONVERSATION) assert bare == explicit_off, ( - f"{model_name}: explicit flags=False must equal bare default render" + f"{model_name}: explicit thinking_retention='template' must equal " + f"bare default render" ) -def test_preserve_all_thinking_grows_or_no_op( +def test_thinking_retention_all_grows_or_no_op( model_name, tokenizer, renderer_name, renderer ): from renderers.default import DefaultRenderer @@ -186,37 +179,37 @@ def test_preserve_all_thinking_grows_or_no_op( if isinstance(renderer, DefaultRenderer): pytest.skip("DefaultRenderer raises on these flags — covered separately") default = renderer.render_ids(CONVERSATION) - preserved = _make(tokenizer, renderer_name, preserve_all_thinking=True).render_ids( + preserved = _make(tokenizer, renderer_name, thinking_retention="all").render_ids( CONVERSATION ) if model_name in NO_OP_MODELS: assert preserved == default, ( - f"{model_name} is a no-op renderer; preserve_all_thinking must " + f"{model_name} is a no-op renderer; thinking_retention='all' must " f"not change output (got {len(default)} → {len(preserved)})" ) else: assert len(preserved) > len(default), ( - f"{model_name}: preserve_all_thinking should add tokens for a " + f"{model_name}: thinking_retention='all' should add tokens for a " f"conversation with past-asst reasoning_content " f"(default={len(default)}, preserved={len(preserved)})" ) -def test_preserve_between_tool_calls_strict_subset( +def test_thinking_retention_tool_cycle_strict_subset( model_name, tokenizer, renderer_name, renderer ): - """``preserve_thinking_between_tool_calls`` is strictly weaker than - ``preserve_all_thinking``: token count satisfies default <= between <= all.""" + """``thinking_retention="tool_cycle"`` is strictly weaker than + ``"all"``: token count satisfies default <= tool_cycle <= all.""" from renderers.default import DefaultRenderer if isinstance(renderer, DefaultRenderer): pytest.skip("DefaultRenderer raises on these flags — covered separately") default = renderer.render_ids(CONVERSATION) between = _make( - tokenizer, renderer_name, preserve_thinking_between_tool_calls=True + tokenizer, renderer_name, thinking_retention="tool_cycle" ).render_ids(CONVERSATION) - all_ = _make(tokenizer, renderer_name, preserve_all_thinking=True).render_ids( + all_ = _make(tokenizer, renderer_name, thinking_retention="all").render_ids( CONVERSATION ) assert len(default) <= len(between) <= len(all_), ( @@ -244,28 +237,28 @@ def test_preserve_between_tool_calls_strict_subset( ] -def test_preserve_btc_on_live_cycle_matches_all( +def test_thinking_retention_tool_cycle_matches_all_on_live_cycle( model_name, tokenizer, renderer_name, renderer ): """In a live tool cycle (no trailing user), every past-asst sits in - the current tool-bearing segment. ``preserve_thinking_between_tool_calls`` + the current tool-bearing segment. ``thinking_retention="tool_cycle"`` should preserve all of their thinking — same set of asst messages as - ``preserve_all_thinking``, so the resulting token sequences must be + ``"all"``, so the resulting token sequences must be identical (independent of which template-default condition each renderer uses internally).""" from renderers.default import DefaultRenderer if isinstance(renderer, DefaultRenderer): pytest.skip("DefaultRenderer raises on these flags — covered separately") - btc = _make( - tokenizer, renderer_name, preserve_thinking_between_tool_calls=True - ).render_ids(LIVE_TOOL_CYCLE) - all_ = _make(tokenizer, renderer_name, preserve_all_thinking=True).render_ids( + btc = _make(tokenizer, renderer_name, thinking_retention="tool_cycle").render_ids( + LIVE_TOOL_CYCLE + ) + all_ = _make(tokenizer, renderer_name, thinking_retention="all").render_ids( LIVE_TOOL_CYCLE ) assert btc == all_, ( - f"{model_name}: in a live tool cycle btc must match preserve_all " - f"(got len(btc)={len(btc)}, len(all)={len(all_)})" + f"{model_name}: in a live tool cycle tool_cycle must match all " + f"(got len(tool_cycle)={len(btc)}, len(all)={len(all_)})" ) @@ -325,17 +318,17 @@ def test_preserve_btc_on_live_cycle_matches_all( "Qwen/Qwen3-VL-8B-Instruct", "Qwen/Qwen3-VL-30B-A3B-Instruct", # Llama-3 ships no rendering path, so reasoning_content never - # surfaces in the output regardless of the preserve flags. + # surfaces in the output regardless of thinking_retention. "meta-llama/Llama-3.2-1B-Instruct", "meta-llama/Llama-3.2-3B-Instruct", "unsloth/Llama-3.2-1B-Instruct", } -def test_preserve_all_thinking_emits_every_asst_reasoning( +def test_thinking_retention_all_emits_every_asst_reasoning( model_name, tokenizer, renderer_name, renderer ): - """``preserve_all_thinking=True`` must surface every past-asst's + """``thinking_retention="all"`` must surface every past-asst's ``reasoning_content`` in the decoded output — for renderers that have any pathway to render reasoning at all.""" from renderers.default import DefaultRenderer @@ -343,7 +336,7 @@ def test_preserve_all_thinking_emits_every_asst_reasoning( if isinstance(renderer, DefaultRenderer): pytest.skip("DefaultRenderer raises on these flags — covered separately") - ids = _make(tokenizer, renderer_name, preserve_all_thinking=True).render_ids( + ids = _make(tokenizer, renderer_name, thinking_retention="all").render_ids( TWO_BLOCK_CONV, tools=TWO_BLOCK_TOOLS ) text = tokenizer.decode(ids) @@ -352,38 +345,38 @@ def test_preserve_all_thinking_emits_every_asst_reasoning( for sentinel in ALL_SENTINELS: assert sentinel not in text, ( f"{model_name}: never-preserves renderer leaked {sentinel} " - f"under preserve_all_thinking" + f"under thinking_retention='all'" ) else: for sentinel in ALL_SENTINELS: assert sentinel in text, ( - f"{model_name}: preserve_all_thinking did not emit {sentinel} " + f"{model_name}: thinking_retention='all' did not emit {sentinel} " f"in decoded output" ) -def test_preserve_btc_emits_current_block_reasoning( +def test_thinking_retention_tool_cycle_emits_current_block_reasoning( model_name, tokenizer, renderer_name, renderer ): - """``preserve_thinking_between_tool_calls=True`` must surface the - current (post-last-user) tool block's reasoning. Older blocks fall - back to template default, which varies per renderer — no universal - assertion there.""" + """``thinking_retention="tool_cycle"`` must surface the current + (post-last-user) tool block's reasoning. Older blocks fall back to + template default, which varies per renderer — no universal assertion + there.""" from renderers.default import DefaultRenderer if isinstance(renderer, DefaultRenderer): pytest.skip("DefaultRenderer raises on these flags — covered separately") - ids = _make( - tokenizer, renderer_name, preserve_thinking_between_tool_calls=True - ).render_ids(TWO_BLOCK_CONV, tools=TWO_BLOCK_TOOLS) + ids = _make(tokenizer, renderer_name, thinking_retention="tool_cycle").render_ids( + TWO_BLOCK_CONV, tools=TWO_BLOCK_TOOLS + ) text = tokenizer.decode(ids) if model_name in NEVER_PRESERVES_MODELS: for sentinel in ALL_SENTINELS: assert sentinel not in text, ( f"{model_name}: never-preserves renderer leaked {sentinel} " - f"under preserve_thinking_between_tool_calls" + f"under thinking_retention='tool_cycle'" ) else: for sentinel in CURRENT_BLOCK_SENTINELS: @@ -393,23 +386,23 @@ def test_preserve_btc_emits_current_block_reasoning( ) -def test_default_renderer_raises_on_flags(): +def test_default_renderer_raises_above_template(): """``DefaultRenderer`` falls back to apply_chat_template with no - selective re-emit pathway, so constructing one with either flag set - must raise — fail fast, before any render is attempted.""" + selective re-emit pathway, so constructing one with thinking_retention + above ``"template"`` must raise — fail fast, before any render.""" from renderers import DefaultRendererConfig from renderers.base import load_tokenizer tok = load_tokenizer("Qwen/Qwen2.5-0.5B-Instruct") - # No flags → constructs cleanly. + # Default "template" → constructs cleanly. create_renderer(tok, DefaultRendererConfig()) - # Either flag set → raises at construction. + # Any level above "template" → raises at construction. with pytest.raises(NotImplementedError): - create_renderer(tok, DefaultRendererConfig(preserve_all_thinking=True)) + create_renderer(tok, DefaultRendererConfig(thinking_retention="all")) with pytest.raises(NotImplementedError): create_renderer( tok, - DefaultRendererConfig(preserve_thinking_between_tool_calls=True), + DefaultRendererConfig(thinking_retention="tool_cycle"), ) @@ -425,21 +418,16 @@ def test_create_renderer_records_flag_state(model_name, renderer_name, tokenizer from renderers.default import DefaultRenderer bare = _make(tokenizer, renderer_name) - assert bare.config.preserve_all_thinking is False - assert bare.config.preserve_thinking_between_tool_calls is False + assert bare.config.thinking_retention == "template" if not isinstance(bare, DefaultRenderer): - # DefaultRenderer raises at construction with either flag set — + # DefaultRenderer raises at construction above "template" — # covered by ``test_default_renderer_raises_on_flags``. - all_on = _make(tokenizer, renderer_name, preserve_all_thinking=True) - assert all_on.config.preserve_all_thinking is True - assert all_on.config.preserve_thinking_between_tool_calls is False + all_on = _make(tokenizer, renderer_name, thinking_retention="all") + assert all_on.config.thinking_retention == "all" - btc_on = _make( - tokenizer, renderer_name, preserve_thinking_between_tool_calls=True - ) - assert btc_on.config.preserve_all_thinking is False - assert btc_on.config.preserve_thinking_between_tool_calls is True + btc_on = _make(tokenizer, renderer_name, thinking_retention="tool_cycle") + assert btc_on.config.thinking_retention == "tool_cycle" # --------------------------------------------------------------------------- @@ -467,8 +455,8 @@ def test_glm5_config_accepts_clear_thinking(): def test_qwen36_config_rejects_unknown_field(): """``preserve_thinking`` on Qwen3.6 was a chat-template-kwarg pass-through in an earlier revision. It is superseded by the - renderer-agnostic ``preserve_all_thinking`` override and must not - appear on the typed config — its default-False semantics are now + renderer-agnostic ``thinking_retention`` override and must not + appear on the typed config — its default semantics are now inherited from Qwen3.5's render gate. ``extra="forbid"`` on the pydantic model enforces this at construction.""" from renderers import Qwen36RendererConfig diff --git a/tests/test_renderer_config.py b/tests/test_renderer_config.py index 4bc0a31..0fe951c 100644 --- a/tests/test_renderer_config.py +++ b/tests/test_renderer_config.py @@ -85,7 +85,7 @@ def __init__(self, tokenizer, config): def test_create_renderer_auto_resolves_via_model_map(monkeypatch): """``AutoRendererConfig`` (or ``config=None``) routes through ``MODEL_RENDERER_MAP`` to pick the matching renderer + typed config, - carrying the shared ``preserve_*`` flags over from the auto config.""" + carrying the shared ``thinking_retention`` field over from the auto config.""" class _FakeQwen35: def __init__(self, tokenizer, config): @@ -97,13 +97,13 @@ def __init__(self, tokenizer, config): renderer = create_renderer( SimpleNamespace(name_or_path="fake/qwen35"), - AutoRendererConfig(preserve_all_thinking=True), + AutoRendererConfig(thinking_retention="all"), ) assert isinstance(renderer.config, Qwen35RendererConfig) - assert renderer.config.preserve_all_thinking is True + assert renderer.config.thinking_retention == "all" # Template-level kwargs stay at their per-renderer defaults — auto - # carries only the preserve_* flags. + # carries only the thinking_retention flag. assert renderer.config.add_vision_id is False @@ -114,3 +114,42 @@ def test_create_renderer_default_argument_is_auto(): renderer = create_renderer(tok) # Falls through to DefaultRenderer when no match and no vision config. assert renderer.__class__.__name__ == "DefaultRenderer" + + +def test_thinking_retention_conflict_raises(): + """``clear_thinking`` / ``truncate_history_thinking`` are byte-equivalent + to ``thinking_retention`` (``False`` ≡ ``"all"``). Explicitly setting both + to disagreeing values raises; a consistent pair, or a single field (the + parity matrix sets one at a time), is fine.""" + from renderers import Nemotron3RendererConfig + + # explicit conflict: keep-all kwarg vs a non-"all" retention + with pytest.raises(ValidationError, match="same knob"): + GLM5RendererConfig(clear_thinking=False, thinking_retention="template") + with pytest.raises(ValidationError, match="same knob"): + GLM5RendererConfig(clear_thinking=False, thinking_retention="tool_cycle") + with pytest.raises(ValidationError, match="same knob"): + Nemotron3RendererConfig( + truncate_history_thinking=False, thinking_retention="template" + ) + + # consistent (both keep all), single-field, and the normal override + # (template-default drop + retention="all") are all accepted. + GLM5RendererConfig(clear_thinking=False, thinking_retention="all") + GLM5RendererConfig(clear_thinking=False) + GLM5RendererConfig(thinking_retention="tool_cycle") + GLM5RendererConfig(clear_thinking=True, thinking_retention="all") + + +def test_default_renderer_config_rejects_legacy_preserve_flags(): + """``DefaultRendererConfig`` is ``extra="allow"``, so the removed + ``preserve_*`` bools would otherwise slip into ``model_extra`` and be + forwarded to ``apply_chat_template`` silently. A validator rejects them + with a migration message; genuine Jinja kwargs still pass through.""" + with pytest.raises(ValidationError, match="thinking_retention"): + DefaultRendererConfig(preserve_all_thinking=True) + with pytest.raises(ValidationError, match="thinking_retention"): + DefaultRendererConfig(preserve_thinking_between_tool_calls=True) + + cfg = DefaultRendererConfig(some_jinja_kwarg=True) + assert cfg.model_extra["some_jinja_kwarg"] is True