Skip to content

feat(base): drop transformers from the offset path#87

Open
hallerite wants to merge 1 commit into
mainfrom
drop-emit-text-segments
Open

feat(base): drop transformers from the offset path#87
hallerite wants to merge 1 commit into
mainfrom
drop-emit-text-segments

Conversation

@hallerite

@hallerite hallerite commented Jun 17, 2026

Copy link
Copy Markdown
Member

Motivation

_get_offset_tokenizer used to require transformers on the render hot path — every renderer that hits a mixed-is_content segment list loaded a vanilla AutoTokenizer to read offset_mapping, because fastokens-patched tokenizers don't track offsets. That transformers requirement defeats the goal of making renderers lightweight enough for tokenizers-only downstreams (TorchTitan / TorchTune).

This PR moves the offset path to tokenizers.Tokenizer.from_pretrained() for the common case, and only falls back to transformers.AutoTokenizer for models whose tokenizer.json on disk doesn't match the AutoTokenizer-wrapped backend (MiniMax-M2.5).

Mechanism

_get_offset_tokenizer resolution order:

  1. Input is already a tokenizers.Tokenizer → return as-is.
  2. Input is a vanilla PreTrainedTokenizerFast (its backend_tokenizer is a tokenizers.Tokenizer) → return the backend directly, no extra load.
  3. Load via tokenizers.Tokenizer.from_pretrained(name_or_path) and verify it encodes a probe string ("Hello, world.\n\n# Test") identically to the user's tokenizer. If yes, cache and use.
  4. If the probe diverges, fall back to transformers.AutoTokenizer and use its backend_tokenizer. The bare tokenizer.json for MiniMax has a custom regex pre-tokenizer that AutoTokenizer replaces with ByteLevel at construction; without that mutation, .\n\n gets merged into one token where the rendering tokenizer would split it as ., \n, \n. This path is the only branch that needs transformers.

attribute_text_segments is rewritten to use tokenizers.Encoding.ids / .offsets directly (no return_offsets_mapping=True dict API). Companion changes to minimax_m2.emit_token_overlap_body and qwen3_vl._Emitter._flush for the same API swap.

Closure refactor (collapse-or-fallback)

Every renderer's emit_text_segments closure now collapses adjacent same-label segments before encoding:

collapsed: list[tuple[str, bool]] = []
for text, label in segments:
    if not text:
        continue
    if collapsed and collapsed[-1][1] == label:
        collapsed[-1] = (collapsed[-1][0] + text, label)
    else:
        collapsed.append((text, label))
if not collapsed:
    return
if len(collapsed) == 1:
    text, label = collapsed[0]
    emit_text(text, msg_idx, is_sampled=is_sampled, is_content=label)
    return
for tok_id, is_content in attribute_text_segments(self._tokenizer, collapsed):
    tokens.append(tok_id); indices.append(msg_idx); ...
  • Homogeneous after collapse (_TOOLS_HEADER + per-tool JSON, all scaffold): single joined emit_text. Preserves all internal BPE merges (e.g., <tools>\n → token 397) — no offset path needed.
  • Mixed labels remaining (("system\n", False), (sys_content, True), ("\n\n# Tools...", False)): falls back to attribute_text_segments which goes through the new tokenizers-only offset path.

What this fixes

  • transformers is no longer a render-time requirement for the common cases (Qwen3 family, GLM family, Nemotron-3, DeepSeek-V3, Laguna-XS.2, Kimi family). Path 3 catches them.
  • For MiniMax-M2.5, the fallback (path 4) still needs transformers. The error message points users to pip install renderers[transformers] if it's not installed.
  • The _offset_tokenizers cache now stores tokenizers.Tokenizer instances directly; lookups skip the HF wrapper.

Verification

  • Full pytest suite: 1854 passed, 53 skipped, 1 xfailed — same as baseline.
  • The previously-failing tool-call / tool-response tests (51 failures during exploration) all pass.
  • test_is_content.py (170 tests, the body-extraction contract) green — no semantic changes to is_content.

🤖 Generated with Claude Code


Note

Medium Risk
Changes core tokenization/attribution on the render hot path for mixed scaffold/body segments; wrong offset tokenizer choice would skew training masks, though probe verification and full test suite parity mitigate this.

Overview
Moves body/scaffold attribution (is_content) off HuggingFace’s return_offsets_mapping path onto the tokenizers Rust API, and adds tokenizers>=0.20 as a direct dependency.

_get_offset_tokenizer now resolves a tokenizers.Tokenizer in order: use the input if it’s already one; reuse backend_tokenizer on vanilla fast HF tokenizers; otherwise load Tokenizer.from_pretrained and probe-encode against the user tokenizer; only if that diverges (e.g. MiniMax) fall back to AutoTokenizer and its backend. attribute_text_segments encodes via .encode() and reads Encoding.ids / .offsets.

Across hand-coded renderers, emit_text_segments collapses adjacent segments with the same label, uses a single joined encode when only one label remains (preserving internal BPE merges), and calls attribute_text_segments only when scaffold/body labels are still mixed. MiniMax overlap attribution and Qwen3-VL flush comments follow the same encoding API.

Reviewed by Cursor Bugbot for commit c3c51a7. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Drop transformers dependency from the offset encoding path in favor of tokenizers

  • Rewrites _get_offset_tokenizer in renderers/base.py to use the HuggingFace tokenizers library directly for offset-aware encoding, falling back to transformers.AutoTokenizer only when necessary.
  • Adds tokenizers>=0.20 as a direct dependency in pyproject.toml, avoiding the heavier transformers import on the offset path.
  • Updates attribute_text_segments to use tokenizers.Tokenizer.encode() and its ids/offsets fields instead of the transformers fast tokenizer API.
  • Adds a homogeneous-segment fast-path (collapse adjacent same-label segments, skip empty, emit via emit_text) to emit_text_segments across all affected renderers, calling attribute_text_segments only when mixed labels remain.
  • Behavioral Change: _get_offset_tokenizer now accepts a tokenizers.Tokenizer directly; error messages and fallback behavior when a matching tokenizer cannot be found have changed.

Macroscope summarized c3c51a7.

``_get_offset_tokenizer`` now loads via ``tokenizers.Tokenizer.from_pretrained``
and only falls back to ``transformers.AutoTokenizer`` when the bare load
diverges from the user's tokenizer (e.g. MiniMax's ``GPT2Tokenizer``
wrapper, where ``AutoTokenizer`` mutates the backend's pre_tokenizer at
construction). Most models clear the bare-tokenizers path with no extra
load and no ``transformers`` dependency on the render path.

Companion change: ``emit_text_segments`` closures across all renderers
now collapse adjacent same-label segments before encoding. Homogeneous
collapses to a single joined ``emit_text`` (preserves all BPE merges).
Genuinely-mixed-label segments go through ``attribute_text_segments`` —
which now uses ``tokenizers.Encoding.ids`` / ``.offsets`` directly.
``minimax_m2.emit_token_overlap_body`` and ``qwen3_vl._Emitter._flush``
get the same API swap.

Net effect:
- ``transformers`` is no longer required on the render path for the
  common cases (Qwen3/3.5/3.6/3-VL, GLM-4.5/5/5.1/4.7, Nemotron-3,
  DeepSeek-V3, Laguna-XS.2, Kimi-K2/2.5/2.6). It is still required
  for MiniMax-M2.5 (path-4 fallback) and for ``load_tokenizer`` /
  VLM processors / ``create_renderer*`` helpers.
- ``_offset_tokenizers`` cache now stores ``tokenizers.Tokenizer``
  instances; lookups skip the HF wrapper indirection.
- BPE-merge fidelity preserved everywhere: 1854 / 1854 baseline parity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c3c51a7. Configure here.

Comment thread renderers/minimax_m2.py
# BPE merges across label-transition boundaries.
for tok_id, is_content in attribute_text_segments(
self._tokenizer, segments
self._tokenizer, collapsed

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MiniMax extend overlap API mismatch

High Severity

The bridge extend helper’s emit_token_overlap_body still treats _get_offset_tokenizer’s result like a HuggingFace fast tokenizer (__call__ with return_offsets_mapping), but _get_offset_tokenizer now returns a tokenizers.Tokenizer. Non-empty tool responses on the extend path can crash or fail when overlap encoding runs.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c3c51a7. Configure here.

@macroscopeapp

macroscopeapp Bot commented Jun 17, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

This PR refactors core tokenization infrastructure to use tokenizers directly instead of transformers, a significant runtime behavior change. Additionally, there is an unresolved high-severity comment identifying an API mismatch bug in the minimax_m2 extend path that could cause crashes.

You can customize Macroscope's approvability policy. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant