[codex] Support raw image refs for multimodal rendering#89
Draft
eligotts wants to merge 2 commits into
Draft
Conversation
This was referenced Jun 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
mmraw:v2raw multimodal refs inrenderers.mm_store, parsed asRawMMRefobjects withfamily,fingerprint,modality, hash, asset id, and adapter-owned payloadprime_raw_mm_itemenvelopes instead of processed image payloads for Qwen-VL and Kimi K2.5 image renderingimage_grid_thwfor Qwen,grid_thws/media token metadata for Kimi)Companion PRs
Notes
file://.../assets/images/...refs before rendering.Validation
/home/ubuntu/renderers,/home/ubuntu/verifiers, and/home/ubuntu/prime-rl-v1-raw-mm-offloadcompleted inference, env rollouts, train batch creation, trainer step 0, and decoded strict trainer-bound raw image refs.Note
Support raw image refs for multimodal rendering in Qwen and Qwen3-VL renderers
Qwen35RendererandQwen3VLRendererto emit image descriptors (hash,image_grid_thw, placeholder counts) instead of pixel tensors, removing HF processor and per-instance image cache dependencies.materialize_image_refsto both renderers andRendererPool, converting descriptor-only image items into run-scoped image references at request time.generate()in client.py with amaterialize_all_image_refsflag; when set, it materializes image refs before request dispatch and builds image-ref selectors for Qwen instead of base64-encoded tensor payloads.image_cache_maxwith explicit image layout parameters (patch_size,merge_size,min/max_pixels, etc.) in Qwen renderer configs.generate()raisesNotImplementedErrorifmaterialize_all_image_refs=Trueand the renderer lacksmaterialize_image_refs; pixel tensors are no longer embedded inmulti_modal_data.Macroscope summarized 32d5a9d. (Automatic summaries will resume when PR exits draft mode or review begins).