feat(base): drop `transformers` from the offset path by hallerite · Pull Request #87 · PrimeIntellect-ai/renderers

hallerite · 2026-06-17T23:50:20Z

Motivation

_get_offset_tokenizer used to require transformers on the render hot path — every renderer that hits a mixed-is_content segment list loaded a vanilla AutoTokenizer to read offset_mapping, because fastokens-patched tokenizers don't track offsets. That transformers requirement defeats the goal of making renderers lightweight enough for tokenizers-only downstreams (TorchTitan / TorchTune).

This PR moves the offset path to tokenizers.Tokenizer.from_pretrained() for the common case, and only falls back to transformers.AutoTokenizer for models whose tokenizer.json on disk doesn't match the AutoTokenizer-wrapped backend (MiniMax-M2.5).

Mechanism

_get_offset_tokenizer resolution order:

Input is already a tokenizers.Tokenizer → return as-is.
Input is a vanilla PreTrainedTokenizerFast (its backend_tokenizer is a tokenizers.Tokenizer) → return the backend directly, no extra load.
Load via tokenizers.Tokenizer.from_pretrained(name_or_path) and verify it encodes a probe string ("Hello, world.\n\n# Test") identically to the user's tokenizer. If yes, cache and use.
If the probe diverges, fall back to transformers.AutoTokenizer and use its backend_tokenizer. The bare tokenizer.json for MiniMax has a custom regex pre-tokenizer that AutoTokenizer replaces with ByteLevel at construction; without that mutation, .\n\n gets merged into one token where the rendering tokenizer would split it as ., \n, \n. This path is the only branch that needs transformers.

attribute_text_segments is rewritten to use tokenizers.Encoding.ids / .offsets directly (no return_offsets_mapping=True dict API). Companion changes to minimax_m2.emit_token_overlap_body and qwen3_vl._Emitter._flush for the same API swap.

Closure refactor (collapse-or-fallback)

Every renderer's emit_text_segments closure now collapses adjacent same-label segments before encoding:

collapsed: list[tuple[str, bool]] = []
for text, label in segments:
    if not text:
        continue
    if collapsed and collapsed[-1][1] == label:
        collapsed[-1] = (collapsed[-1][0] + text, label)
    else:
        collapsed.append((text, label))
if not collapsed:
    return
if len(collapsed) == 1:
    text, label = collapsed[0]
    emit_text(text, msg_idx, is_sampled=is_sampled, is_content=label)
    return
for tok_id, is_content in attribute_text_segments(self._tokenizer, collapsed):
    tokens.append(tok_id); indices.append(msg_idx); ...

Homogeneous after collapse (_TOOLS_HEADER + per-tool JSON, all scaffold): single joined emit_text. Preserves all internal BPE merges (e.g., <tools>\n → token 397) — no offset path needed.
Mixed labels remaining (("system\n", False), (sys_content, True), ("\n\n# Tools...", False)): falls back to attribute_text_segments which goes through the new tokenizers-only offset path.

What this fixes

transformers is no longer a render-time requirement for the common cases (Qwen3 family, GLM family, Nemotron-3, DeepSeek-V3, Laguna-XS.2, Kimi family). Path 3 catches them.
For MiniMax-M2.5, the fallback (path 4) still needs transformers. The error message points users to pip install renderers[transformers] if it's not installed.
The _offset_tokenizers cache now stores tokenizers.Tokenizer instances directly; lookups skip the HF wrapper.

Verification

Full pytest suite: 1854 passed, 53 skipped, 1 xfailed — same as baseline.
The previously-failing tool-call / tool-response tests (51 failures during exploration) all pass.
test_is_content.py (170 tests, the body-extraction contract) green — no semantic changes to is_content.

🤖 Generated with Claude Code

Note

Medium Risk
Changes core tokenization/attribution on the render hot path for mixed scaffold/body segments; wrong offset tokenizer choice would skew training masks, though probe verification and full test suite parity mitigate this.

Overview
Moves body/scaffold attribution (is_content) off HuggingFace’s return_offsets_mapping path onto the tokenizers Rust API, and adds tokenizers>=0.20 as a direct dependency.

_get_offset_tokenizer now resolves a tokenizers.Tokenizer in order: use the input if it’s already one; reuse backend_tokenizer on vanilla fast HF tokenizers; otherwise load Tokenizer.from_pretrained and probe-encode against the user tokenizer; only if that diverges (e.g. MiniMax) fall back to AutoTokenizer and its backend. attribute_text_segments encodes via .encode() and reads Encoding.ids / .offsets.

Across hand-coded renderers, emit_text_segments collapses adjacent segments with the same label, uses a single joined encode when only one label remains (preserving internal BPE merges), and calls attribute_text_segments only when scaffold/body labels are still mixed. MiniMax overlap attribution and Qwen3-VL flush comments follow the same encoding API.

^{Reviewed by Cursor Bugbot for commit c3c51a7. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Drop `transformers` dependency from the offset encoding path in favor of `tokenizers`

Rewrites _get_offset_tokenizer in renderers/base.py to use the HuggingFace tokenizers library directly for offset-aware encoding, falling back to transformers.AutoTokenizer only when necessary.
Adds tokenizers>=0.20 as a direct dependency in pyproject.toml, avoiding the heavier transformers import on the offset path.
Updates attribute_text_segments to use tokenizers.Tokenizer.encode() and its ids/offsets fields instead of the transformers fast tokenizer API.
Adds a homogeneous-segment fast-path (collapse adjacent same-label segments, skip empty, emit via emit_text) to emit_text_segments across all affected renderers, calling attribute_text_segments only when mixed labels remain.
Behavioral Change: _get_offset_tokenizer now accepts a tokenizers.Tokenizer directly; error messages and fallback behavior when a matching tokenizer cannot be found have changed.

^{Macroscope summarized c3c51a7.}

``_get_offset_tokenizer`` now loads via ``tokenizers.Tokenizer.from_pretrained`` and only falls back to ``transformers.AutoTokenizer`` when the bare load diverges from the user's tokenizer (e.g. MiniMax's ``GPT2Tokenizer`` wrapper, where ``AutoTokenizer`` mutates the backend's pre_tokenizer at construction). Most models clear the bare-tokenizers path with no extra load and no ``transformers`` dependency on the render path. Companion change: ``emit_text_segments`` closures across all renderers now collapse adjacent same-label segments before encoding. Homogeneous collapses to a single joined ``emit_text`` (preserves all BPE merges). Genuinely-mixed-label segments go through ``attribute_text_segments`` — which now uses ``tokenizers.Encoding.ids`` / ``.offsets`` directly. ``minimax_m2.emit_token_overlap_body`` and ``qwen3_vl._Emitter._flush`` get the same API swap. Net effect: - ``transformers`` is no longer required on the render path for the common cases (Qwen3/3.5/3.6/3-VL, GLM-4.5/5/5.1/4.7, Nemotron-3, DeepSeek-V3, Laguna-XS.2, Kimi-K2/2.5/2.6). It is still required for MiniMax-M2.5 (path-4 fallback) and for ``load_tokenizer`` / VLM processors / ``create_renderer*`` helpers. - ``_offset_tokenizers`` cache now stores ``tokenizers.Tokenizer`` instances; lookups skip the HF wrapper indirection. - BPE-merge fidelity preserved everywhere: 1854 / 1854 baseline parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit c3c51a7. Configure here.}

cursor · 2026-06-17T23:51:12Z

+            # BPE merges across label-transition boundaries.
            for tok_id, is_content in attribute_text_segments(
-                self._tokenizer, segments
+                self._tokenizer, collapsed


MiniMax extend overlap API mismatch

High Severity

The bridge extend helper’s emit_token_overlap_body still treats _get_offset_tokenizer’s result like a HuggingFace fast tokenizer (__call__ with return_offsets_mapping), but _get_offset_tokenizer now returns a tokenizers.Tokenizer. Non-empty tool responses on the extend path can crash or fail when overlap encoding runs.

^{Reviewed by Cursor Bugbot for commit c3c51a7. Configure here.}

macroscopeapp · 2026-06-17T23:51:38Z

Approvability

Verdict: Needs human review

This PR refactors core tokenization infrastructure to use tokenizers directly instead of transformers, a significant runtime behavior change. Additionally, there is an unresolved high-severity comment identifying an API mismatch bug in the minimax_m2 extend path that could cause crashes.

^{You can customize Macroscope's approvability policy. Learn more.}

cursor Bot reviewed Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(base): drop `transformers` from the offset path#87

feat(base): drop `transformers` from the offset path#87
hallerite wants to merge 1 commit into
mainfrom
drop-emit-text-segments

hallerite commented Jun 17, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 17, 2026

Uh oh!

macroscopeapp Bot commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallerite commented Jun 17, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Mechanism

Closure refactor (collapse-or-fallback)

What this fixes

Verification

Drop transformers dependency from the offset encoding path in favor of tokenizers

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 17, 2026

Choose a reason for hiding this comment

MiniMax extend overlap API mismatch

Uh oh!

macroscopeapp Bot commented Jun 17, 2026

Approvability

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hallerite commented Jun 17, 2026 •

edited by macroscopeapp Bot

Loading

Drop `transformers` dependency from the offset encoding path in favor of `tokenizers`