feat(base): drop transformers from the offset path#87
Conversation
``_get_offset_tokenizer`` now loads via ``tokenizers.Tokenizer.from_pretrained`` and only falls back to ``transformers.AutoTokenizer`` when the bare load diverges from the user's tokenizer (e.g. MiniMax's ``GPT2Tokenizer`` wrapper, where ``AutoTokenizer`` mutates the backend's pre_tokenizer at construction). Most models clear the bare-tokenizers path with no extra load and no ``transformers`` dependency on the render path. Companion change: ``emit_text_segments`` closures across all renderers now collapse adjacent same-label segments before encoding. Homogeneous collapses to a single joined ``emit_text`` (preserves all BPE merges). Genuinely-mixed-label segments go through ``attribute_text_segments`` — which now uses ``tokenizers.Encoding.ids`` / ``.offsets`` directly. ``minimax_m2.emit_token_overlap_body`` and ``qwen3_vl._Emitter._flush`` get the same API swap. Net effect: - ``transformers`` is no longer required on the render path for the common cases (Qwen3/3.5/3.6/3-VL, GLM-4.5/5/5.1/4.7, Nemotron-3, DeepSeek-V3, Laguna-XS.2, Kimi-K2/2.5/2.6). It is still required for MiniMax-M2.5 (path-4 fallback) and for ``load_tokenizer`` / VLM processors / ``create_renderer*`` helpers. - ``_offset_tokenizers`` cache now stores ``tokenizers.Tokenizer`` instances; lookups skip the HF wrapper indirection. - BPE-merge fidelity preserved everywhere: 1854 / 1854 baseline parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c3c51a7. Configure here.
| # BPE merges across label-transition boundaries. | ||
| for tok_id, is_content in attribute_text_segments( | ||
| self._tokenizer, segments | ||
| self._tokenizer, collapsed |
There was a problem hiding this comment.
MiniMax extend overlap API mismatch
High Severity
The bridge extend helper’s emit_token_overlap_body still treats _get_offset_tokenizer’s result like a HuggingFace fast tokenizer (__call__ with return_offsets_mapping), but _get_offset_tokenizer now returns a tokenizers.Tokenizer. Non-empty tool responses on the extend path can crash or fail when overlap encoding runs.
Reviewed by Cursor Bugbot for commit c3c51a7. Configure here.
ApprovabilityVerdict: Needs human review This PR refactors core tokenization infrastructure to use You can customize Macroscope's approvability policy. Learn more. |


Motivation
_get_offset_tokenizerused to requiretransformerson the render hot path — every renderer that hits a mixed-is_contentsegment list loaded a vanillaAutoTokenizerto readoffset_mapping, because fastokens-patched tokenizers don't track offsets. Thattransformersrequirement defeats the goal of makingrendererslightweight enough fortokenizers-only downstreams (TorchTitan / TorchTune).This PR moves the offset path to
tokenizers.Tokenizer.from_pretrained()for the common case, and only falls back totransformers.AutoTokenizerfor models whose tokenizer.json on disk doesn't match the AutoTokenizer-wrapped backend (MiniMax-M2.5).Mechanism
_get_offset_tokenizerresolution order:tokenizers.Tokenizer→ return as-is.PreTrainedTokenizerFast(itsbackend_tokenizeris atokenizers.Tokenizer) → return the backend directly, no extra load.tokenizers.Tokenizer.from_pretrained(name_or_path)and verify it encodes a probe string ("Hello, world.\n\n# Test") identically to the user's tokenizer. If yes, cache and use.transformers.AutoTokenizerand use itsbackend_tokenizer. The baretokenizer.jsonfor MiniMax has a custom regex pre-tokenizer thatAutoTokenizerreplaces withByteLevelat construction; without that mutation,.\n\ngets merged into one token where the rendering tokenizer would split it as.,\n,\n. This path is the only branch that needstransformers.attribute_text_segmentsis rewritten to usetokenizers.Encoding.ids/.offsetsdirectly (noreturn_offsets_mapping=Truedict API). Companion changes tominimax_m2.emit_token_overlap_bodyandqwen3_vl._Emitter._flushfor the same API swap.Closure refactor (collapse-or-fallback)
Every renderer's
emit_text_segmentsclosure now collapses adjacent same-label segments before encoding:_TOOLS_HEADER+ per-tool JSON, all scaffold): single joinedemit_text. Preserves all internal BPE merges (e.g.,<tools>\n→ token 397) — no offset path needed.("system\n", False), (sys_content, True), ("\n\n# Tools...", False)): falls back toattribute_text_segmentswhich goes through the newtokenizers-only offset path.What this fixes
transformersis no longer a render-time requirement for the common cases (Qwen3 family, GLM family, Nemotron-3, DeepSeek-V3, Laguna-XS.2, Kimi family). Path 3 catches them.transformers. The error message points users topip install renderers[transformers]if it's not installed._offset_tokenizerscache now storestokenizers.Tokenizerinstances directly; lookups skip the HF wrapper.Verification
test_is_content.py(170 tests, the body-extraction contract) green — no semantic changes tois_content.🤖 Generated with Claude Code
Note
Medium Risk
Changes core tokenization/attribution on the render hot path for mixed scaffold/body segments; wrong offset tokenizer choice would skew training masks, though probe verification and full test suite parity mitigate this.
Overview
Moves body/scaffold attribution (
is_content) off HuggingFace’sreturn_offsets_mappingpath onto thetokenizersRust API, and addstokenizers>=0.20as a direct dependency._get_offset_tokenizernow resolves atokenizers.Tokenizerin order: use the input if it’s already one; reusebackend_tokenizeron vanilla fast HF tokenizers; otherwise loadTokenizer.from_pretrainedand probe-encode against the user tokenizer; only if that diverges (e.g. MiniMax) fall back toAutoTokenizerand its backend.attribute_text_segmentsencodes via.encode()and readsEncoding.ids/.offsets.Across hand-coded renderers,
emit_text_segmentscollapses adjacent segments with the same label, uses a single joined encode when only one label remains (preserving internal BPE merges), and callsattribute_text_segmentsonly when scaffold/body labels are still mixed. MiniMax overlap attribution and Qwen3-VL flush comments follow the same encoding API.Reviewed by Cursor Bugbot for commit c3c51a7. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Drop
transformersdependency from the offset encoding path in favor oftokenizers_get_offset_tokenizerin renderers/base.py to use the HuggingFacetokenizerslibrary directly for offset-aware encoding, falling back totransformers.AutoTokenizeronly when necessary.tokenizers>=0.20as a direct dependency in pyproject.toml, avoiding the heaviertransformersimport on the offset path.attribute_text_segmentsto usetokenizers.Tokenizer.encode()and itsids/offsetsfields instead of the transformers fast tokenizer API.emit_text) toemit_text_segmentsacross all affected renderers, callingattribute_text_segmentsonly when mixed labels remain._get_offset_tokenizernow accepts atokenizers.Tokenizerdirectly; error messages and fallback behavior when a matching tokenizer cannot be found have changed.Macroscope summarized c3c51a7.