PrimeIntellect-ai · hallerite · Jun 18, 2026 · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026
diff --git a/README.md b/README.md
@@ -95,7 +95,6 @@ For RL the trainer must see the exact token ids the sampler saw. The standard al
 - **Boolean round-trip.** Engine emits `false`; client parses to Python `bool(False)`; `apply_chat_template` re-renders via `str(False)` → `"False"`. Capital F. Reproducible on Qwen3.5-35B-A3B + mini-swe-agent-plus at ~50% break rate per rollout.
 - **BPE retokenization drift.** The same substring tokenizes differently depending on neighbouring bytes. `json` + `p` + `enderer` (3 tokens) vs `jsonp` + `enderer` (2 tokens) when whitespace shifts by one character. Every subsequent token is shifted from there on.
 - **Tool-call XML drift.** The engine emits a no-arg call with a stylistic empty `</parameter>`; the Jinja re-render of the reconstructed dict drops it. Extension property broken at every such call.
-- **Thinking stripped from non-latest assistants.** Some templates strip `<think>…</think>` blocks from prior assistant turns when re-rendering. The recorded stream has the thinking; the next prompt does not.
 - **Max-seq-len truncation zeroing the anchor.** Client-side `max_seq_len` enforcement zeros `completion_ids` when `prompt_len > max_seq_len`. The bridge anchor is empty, falling back to full re-render — triggering every mode above.
 - **Scaffold-level history rewriting.** Some agent scaffolds (e.g. opencode's `experimental_repairToolCall`) rewrite tool calls before sending them back as history. The next turn's prompt contains a tool call the model never emitted. *A renderer cannot fix this — the drift happens before rendering.*
 
@@ -122,9 +121,9 @@ from renderers import (
 )
 
 # Auto-resolve renderer from the tokenizer's model name. Carries the
-# shared preserve_* flags; template kwargs require an explicit choice.
+# shared thinking_retention flag; template kwargs require an explicit choice.
 renderer = create_renderer(tokenizer)
-renderer = create_renderer(tokenizer, AutoRendererConfig(preserve_all_thinking=True))
+renderer = create_renderer(tokenizer, AutoRendererConfig(thinking_retention="all"))
 
 # Explicit choice — the typed config exposes exactly the fields that
 # renderer's chat template honours.
@@ -142,12 +141,13 @@ renderer = create_renderer(
 
 Discriminated union: every per-renderer config is a variant of `RendererConfig`, dispatched on the `name` field. Bogus combinations (e.g. `add_vision_id` under `name="qwen3"`) error at construction with a `pydantic.ValidationError`. Downstream pydantic configs (prime-rl orchestrator, verifiers `ClientConfig`) hold a single field typed as `RendererConfig` and inherit the same strict-per-variant validation.
 
-Two shared behaviour flags live on every variant via `_BaseRendererConfig`:
+One shared behaviour flag lives on every variant via `_BaseRendererConfig`: `thinking_retention`, an ascending scale (`"template"` < `"tool_cycle"` < `"all"`) whose floor is the chat template's own decision.
 
-- `preserve_all_thinking=True` — every past assistant's `reasoning_content` is kept, even when the chat template would drop it.
-- `preserve_thinking_between_tool_calls=True` — reasoning is kept on assistants in the in-flight tool cycle (post-last-user A-T-…-A block when it contains a tool response). A new user turn closes the block and drops its thinking.
+- `thinking_retention="template"` (default) — defer entirely to the chat template.
+- `thinking_retention="tool_cycle"` — additionally keep reasoning on assistants in the in-flight tool cycle (post-last-user A-T-…-A block when it contains a tool response). A new user turn closes the block and drops its thinking.
+- `thinking_retention="all"` — additionally keep every past assistant's `reasoning_content`, even when the chat template would drop it.
 
-These OR-compose with template-level toggles (e.g. GLM-5 `clear_thinking`, Nemotron-3 `truncate_history_thinking`): either flag saying "keep" wins. preserve_* can only ever *extend* retention — never override a template kwarg into a "drop" decision. The canonical use case is **compaction**: injecting a `user` turn like *"summarize the work so far"* puts every prior assistant in a past cycle, and `preserve_all_thinking=True` keeps reasoning visible end-to-end.
+`thinking_retention` is honoured end-to-end — both `render()` and `bridge_to_next_turn` consult it. So a multi-turn rollout reproduces the chat template's history handling **faithfully by default**: when a new user turn arrives, a past block's reasoning is dropped exactly as `apply_chat_template` would, and the bridge declines that boundary (letting the caller re-render) rather than carrying the stale `<think>` forward. The override can only ever *extend* retention above the template floor — it never forces a drop the template would keep. GLM-5 `clear_thinking` / Nemotron-3 `truncate_history_thinking` are byte-equivalent template kwargs (`False` ≡ `"all"`); setting one of them *and* a contradictory `thinking_retention` raises at config-load rather than silently resolving. The canonical override use case is **compaction**: injecting a `user` turn like *"summarize the work so far"* puts every prior assistant in a past cycle, and `thinking_retention="all"` keeps reasoning visible end-to-end.
 
 ## `DefaultRenderer`
 
@@ -156,7 +156,7 @@ Fallback for unsupported models. Wraps `apply_chat_template` and accepts `tool_p
 ## Roadmap
 
 - **VLM support.** `ContentPart` is text-only today; `Qwen3VLRenderer` ships only because Qwen3-VL's text-only chat template differs from Qwen3's. Plan: add `ImagePart` / `VideoPart`, multimodal bridges, validate against a Qwen3-VL RL run.
-- **Patched chat templates.** Some shipped templates re-tokenize history, normalize JSON, or auto-strip thinking — each breaks the extension property. Plan: a `use_patched` opt-in per renderer that renders the same surface form while avoiding known-bad patterns.
+- **Patched chat templates.** Some shipped templates re-tokenize history or normalize JSON in ways that break token identity. Plan: a `use_patched` opt-in per renderer that renders the same surface form while avoiding known-bad patterns. (Auto-stripping thinking from past turns is *not* one of these — that's intended template behaviour the renderer reproduces; use `thinking_retention` to override it.)
 
 ## Testing
 

diff --git a/docs/renderer-config.md b/docs/renderer-config.md
@@ -50,10 +50,10 @@ Configs are frozen. To override a field, construct a new instance or call
 
 ```python
 r = create_renderer(tokenizer)                                 # AutoRendererConfig() is the default
-r = create_renderer(tokenizer, AutoRendererConfig(preserve_all_thinking=True))
+r = create_renderer(tokenizer, AutoRendererConfig(thinking_retention="all"))
 ```
 
-`AutoRendererConfig` carries only the shared `preserve_*` flags. Template
+`AutoRendererConfig` carries only the shared `thinking_retention` flag. Template
 kwargs depend on the renderer, so overriding them requires naming the
 renderer explicitly:
 
@@ -67,33 +67,43 @@ falling back for a VLM would produce token streams the trainer can't
 reconstruct. Text-only fine-tunes without a registered renderer fall back to
 `DefaultRenderer` and log the choice at INFO.
 
-## `preserve_*` flags
-
-Every variant carries two renderer-agnostic flags on `_BaseRendererConfig`:
-
-- `preserve_all_thinking: bool = False` — re-emit `reasoning_content` on
-  every past assistant turn, even when the chat template would drop it.
-- `preserve_thinking_between_tool_calls: bool = False` — re-emit
-  `reasoning_content` only inside the in-flight tool cycle (the contiguous
-  A-T-…-A block after the most recent `user` message, when it contains at
-  least one `tool` response). A new user turn closes the block and drops
-  its thinking.
-
-These OR-compose with template-level toggles. GLM-5's `clear_thinking` and
-Nemotron-3's `truncate_history_thinking` already gate past thinking; the
-`preserve_*` flags add to that:
-
-| `clear_thinking` | `preserve_all_thinking` | past thinking? |
-|------------------|-------------------------|----------------|
-| `True` (default — drop) | `False` (default) | dropped |
-| `True`           | `True`                  | kept           |
-| `False` (keep)   | `False`                 | kept           |
-| `False`          | `True`                  | kept           |
-
-`preserve_*` can only extend retention, never force a drop. The canonical
-use case is **compaction**: injecting a `user` turn like *"summarize the work
-so far"* puts every prior assistant in a past cycle, and
-`preserve_all_thinking=True` keeps reasoning visible end-to-end.
+## `thinking_retention`
+
+Every variant carries one renderer-agnostic flag on `_BaseRendererConfig`,
+an ascending scale whose floor is the chat template's own decision:
+
+- `thinking_retention: Literal["template", "tool_cycle", "all"] = "template"`
+  - `"template"` (default) — defer entirely to the chat template.
+  - `"tool_cycle"` — additionally re-emit `reasoning_content` inside the
+    in-flight tool cycle (the contiguous A-T-…-A block after the most
+    recent `user` message, when it contains at least one `tool` response).
+    A new user turn closes the block and drops its thinking.
+  - `"all"` — additionally re-emit `reasoning_content` on every past
+    assistant turn, even when the chat template would drop it.
+
+The levels are nested: `"all"` ⊇ `"tool_cycle"` ⊇ `"template"`, and the
+level is honoured end-to-end — `render()` and `bridge_to_next_turn` both
+consult it, so multi-turn rollouts reproduce the template's history handling
+faithfully by default. GLM-5's `clear_thinking` and Nemotron-3's
+`truncate_history_thinking` are byte-equivalent template kwargs (`False` ≡
+`"all"`) gating the same past thinking; `thinking_retention` composes with
+them as:
+
+| `clear_thinking` | `thinking_retention` | past thinking? |
+|------------------|----------------------|----------------|
+| `True` (default — drop) | `"template"` (default) | dropped |
+| `True`           | `"all"`              | kept           |
+| `False` (keep)   | `"template"`         | kept           |
+| `False`          | `"all"`              | kept           |
+
+`thinking_retention` can only extend retention, never force a drop — the
+template is the floor. Because the kwarg and `thinking_retention` name the
+same thing, explicitly setting a keep-history kwarg to `False` *and* a
+non-`"all"` `thinking_retention` is contradictory and raises at config-load
+(set `thinking_retention="all"` instead). The canonical use case is **compaction**: injecting
+a `user` turn like *"summarize the work so far"* puts every prior assistant
+in a past cycle, and `thinking_retention="all"` keeps reasoning visible
+end-to-end.
 
 ## `DefaultRendererConfig` accepts arbitrary Jinja kwargs
 
@@ -139,7 +149,7 @@ In TOML / YAML, the discriminator routes deserialization:
 name = "qwen3.5"
 enable_thinking = false
 add_vision_id = true
-preserve_all_thinking = true
+thinking_retention = "all"
 ```
 
 Pydantic dispatches on `name = "qwen3.5"` to `Qwen35RendererConfig`. Bogus