Skip to content
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,6 @@ For RL the trainer must see the exact token ids the sampler saw. The standard al
- **Boolean round-trip.** Engine emits `false`; client parses to Python `bool(False)`; `apply_chat_template` re-renders via `str(False)` → `"False"`. Capital F. Reproducible on Qwen3.5-35B-A3B + mini-swe-agent-plus at ~50% break rate per rollout.
- **BPE retokenization drift.** The same substring tokenizes differently depending on neighbouring bytes. `json` + `p` + `enderer` (3 tokens) vs `jsonp` + `enderer` (2 tokens) when whitespace shifts by one character. Every subsequent token is shifted from there on.
- **Tool-call XML drift.** The engine emits a no-arg call with a stylistic empty `</parameter>`; the Jinja re-render of the reconstructed dict drops it. Extension property broken at every such call.
- **Thinking stripped from non-latest assistants.** Some templates strip `<think>…</think>` blocks from prior assistant turns when re-rendering. The recorded stream has the thinking; the next prompt does not.
- **Max-seq-len truncation zeroing the anchor.** Client-side `max_seq_len` enforcement zeros `completion_ids` when `prompt_len > max_seq_len`. The bridge anchor is empty, falling back to full re-render — triggering every mode above.
- **Scaffold-level history rewriting.** Some agent scaffolds (e.g. opencode's `experimental_repairToolCall`) rewrite tool calls before sending them back as history. The next turn's prompt contains a tool call the model never emitted. *A renderer cannot fix this — the drift happens before rendering.*

Expand All @@ -122,9 +121,9 @@ from renderers import (
)

# Auto-resolve renderer from the tokenizer's model name. Carries the
# shared preserve_* flags; template kwargs require an explicit choice.
# shared thinking_retention flag; template kwargs require an explicit choice.
renderer = create_renderer(tokenizer)
renderer = create_renderer(tokenizer, AutoRendererConfig(preserve_all_thinking=True))
renderer = create_renderer(tokenizer, AutoRendererConfig(thinking_retention="all"))

# Explicit choice — the typed config exposes exactly the fields that
# renderer's chat template honours.
Expand All @@ -142,12 +141,13 @@ renderer = create_renderer(

Discriminated union: every per-renderer config is a variant of `RendererConfig`, dispatched on the `name` field. Bogus combinations (e.g. `add_vision_id` under `name="qwen3"`) error at construction with a `pydantic.ValidationError`. Downstream pydantic configs (prime-rl orchestrator, verifiers `ClientConfig`) hold a single field typed as `RendererConfig` and inherit the same strict-per-variant validation.

Two shared behaviour flags live on every variant via `_BaseRendererConfig`:
One shared behaviour flag lives on every variant via `_BaseRendererConfig`: `thinking_retention`, an ascending scale (`"template"` < `"tool_cycle"` < `"all"`) whose floor is the chat template's own decision.

- `preserve_all_thinking=True` — every past assistant's `reasoning_content` is kept, even when the chat template would drop it.
- `preserve_thinking_between_tool_calls=True` — reasoning is kept on assistants in the in-flight tool cycle (post-last-user A-T-…-A block when it contains a tool response). A new user turn closes the block and drops its thinking.
- `thinking_retention="template"` (default) — defer entirely to the chat template.
- `thinking_retention="tool_cycle"` — additionally keep reasoning on assistants in the in-flight tool cycle (post-last-user A-T-…-A block when it contains a tool response). A new user turn closes the block and drops its thinking.
- `thinking_retention="all"` — additionally keep every past assistant's `reasoning_content`, even when the chat template would drop it.

These OR-compose with template-level toggles (e.g. GLM-5 `clear_thinking`, Nemotron-3 `truncate_history_thinking`): either flag saying "keep" wins. preserve_* can only ever *extend* retention never override a template kwarg into a "drop" decision. The canonical use case is **compaction**: injecting a `user` turn like *"summarize the work so far"* puts every prior assistant in a past cycle, and `preserve_all_thinking=True` keeps reasoning visible end-to-end.
`thinking_retention` is honoured end-to-end — both `render()` and `bridge_to_next_turn` consult it. So a multi-turn rollout reproduces the chat template's history handling **faithfully by default**: when a new user turn arrives, a past block's reasoning is dropped exactly as `apply_chat_template` would, and the bridge declines that boundary (letting the caller re-render) rather than carrying the stale `<think>` forward. The override can only ever *extend* retention above the template floor — it never forces a drop the template would keep. GLM-5 `clear_thinking` / Nemotron-3 `truncate_history_thinking` are byte-equivalent template kwargs (`False` ≡ `"all"`); setting one of them *and* a contradictory `thinking_retention` raises at config-load rather than silently resolving. The canonical override use case is **compaction**: injecting a `user` turn like *"summarize the work so far"* puts every prior assistant in a past cycle, and `thinking_retention="all"` keeps reasoning visible end-to-end.

## `DefaultRenderer`

Expand All @@ -156,7 +156,7 @@ Fallback for unsupported models. Wraps `apply_chat_template` and accepts `tool_p
## Roadmap

- **VLM support.** `ContentPart` is text-only today; `Qwen3VLRenderer` ships only because Qwen3-VL's text-only chat template differs from Qwen3's. Plan: add `ImagePart` / `VideoPart`, multimodal bridges, validate against a Qwen3-VL RL run.
- **Patched chat templates.** Some shipped templates re-tokenize history, normalize JSON, or auto-strip thinking — each breaks the extension property. Plan: a `use_patched` opt-in per renderer that renders the same surface form while avoiding known-bad patterns.
- **Patched chat templates.** Some shipped templates re-tokenize history or normalize JSON in ways that break token identity. Plan: a `use_patched` opt-in per renderer that renders the same surface form while avoiding known-bad patterns. (Auto-stripping thinking from past turns is *not* one of these — that's intended template behaviour the renderer reproduces; use `thinking_retention` to override it.)

## Testing

Expand Down
70 changes: 40 additions & 30 deletions docs/renderer-config.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,10 +50,10 @@ Configs are frozen. To override a field, construct a new instance or call

```python
r = create_renderer(tokenizer) # AutoRendererConfig() is the default
r = create_renderer(tokenizer, AutoRendererConfig(preserve_all_thinking=True))
r = create_renderer(tokenizer, AutoRendererConfig(thinking_retention="all"))
```

`AutoRendererConfig` carries only the shared `preserve_*` flags. Template
`AutoRendererConfig` carries only the shared `thinking_retention` flag. Template
kwargs depend on the renderer, so overriding them requires naming the
renderer explicitly:

Expand All @@ -67,33 +67,43 @@ falling back for a VLM would produce token streams the trainer can't
reconstruct. Text-only fine-tunes without a registered renderer fall back to
`DefaultRenderer` and log the choice at INFO.

## `preserve_*` flags

Every variant carries two renderer-agnostic flags on `_BaseRendererConfig`:

- `preserve_all_thinking: bool = False` — re-emit `reasoning_content` on
every past assistant turn, even when the chat template would drop it.
- `preserve_thinking_between_tool_calls: bool = False` — re-emit
`reasoning_content` only inside the in-flight tool cycle (the contiguous
A-T-…-A block after the most recent `user` message, when it contains at
least one `tool` response). A new user turn closes the block and drops
its thinking.

These OR-compose with template-level toggles. GLM-5's `clear_thinking` and
Nemotron-3's `truncate_history_thinking` already gate past thinking; the
`preserve_*` flags add to that:

| `clear_thinking` | `preserve_all_thinking` | past thinking? |
|------------------|-------------------------|----------------|
| `True` (default — drop) | `False` (default) | dropped |
| `True` | `True` | kept |
| `False` (keep) | `False` | kept |
| `False` | `True` | kept |

`preserve_*` can only extend retention, never force a drop. The canonical
use case is **compaction**: injecting a `user` turn like *"summarize the work
so far"* puts every prior assistant in a past cycle, and
`preserve_all_thinking=True` keeps reasoning visible end-to-end.
## `thinking_retention`

Every variant carries one renderer-agnostic flag on `_BaseRendererConfig`,
an ascending scale whose floor is the chat template's own decision:

- `thinking_retention: Literal["template", "tool_cycle", "all"] = "template"`
- `"template"` (default) — defer entirely to the chat template.
- `"tool_cycle"` — additionally re-emit `reasoning_content` inside the
in-flight tool cycle (the contiguous A-T-…-A block after the most
recent `user` message, when it contains at least one `tool` response).
A new user turn closes the block and drops its thinking.
- `"all"` — additionally re-emit `reasoning_content` on every past
assistant turn, even when the chat template would drop it.

The levels are nested: `"all"` ⊇ `"tool_cycle"` ⊇ `"template"`, and the
level is honoured end-to-end — `render()` and `bridge_to_next_turn` both
consult it, so multi-turn rollouts reproduce the template's history handling
faithfully by default. GLM-5's `clear_thinking` and Nemotron-3's
`truncate_history_thinking` are byte-equivalent template kwargs (`False` ≡
`"all"`) gating the same past thinking; `thinking_retention` composes with
them as:

| `clear_thinking` | `thinking_retention` | past thinking? |
|------------------|----------------------|----------------|
| `True` (default — drop) | `"template"` (default) | dropped |
| `True` | `"all"` | kept |
| `False` (keep) | `"template"` | kept |
| `False` | `"all"` | kept |

`thinking_retention` can only extend retention, never force a drop — the
template is the floor. Because the kwarg and `thinking_retention` name the
same thing, explicitly setting a keep-history kwarg to `False` *and* a
non-`"all"` `thinking_retention` is contradictory and raises at config-load
(set `thinking_retention="all"` instead). The canonical use case is **compaction**: injecting
a `user` turn like *"summarize the work so far"* puts every prior assistant
in a past cycle, and `thinking_retention="all"` keeps reasoning visible
end-to-end.

## `DefaultRendererConfig` accepts arbitrary Jinja kwargs

Expand Down Expand Up @@ -139,7 +149,7 @@ In TOML / YAML, the discriminator routes deserialization:
name = "qwen3.5"
enable_thinking = false
add_vision_id = true
preserve_all_thinking = true
thinking_retention = "all"
```

Pydantic dispatches on `name = "qwen3.5"` to `Qwen35RendererConfig`. Bogus
Expand Down
Loading
Loading