Skip to content

[codex] Support raw image refs for multimodal rendering#89

Draft
eligotts wants to merge 2 commits into
mainfrom
codex/raw-image-assets-renderers
Draft

[codex] Support raw image refs for multimodal rendering#89
eligotts wants to merge 2 commits into
mainfrom
codex/raw-image-assets-renderers

Conversation

@eligotts

@eligotts eligotts commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

  • adds generic mmraw:v2 raw multimodal refs in renderers.mm_store, parsed as RawMMRef objects with family, fingerprint, modality, hash, asset id, and adapter-owned payload
  • emits strict prime_raw_mm_item envelopes instead of processed image payloads for Qwen-VL and Kimi K2.5 image rendering
  • keeps adapter-specific layout details in renderer-owned payloads (image_grid_thw for Qwen, grid_thws/media token metadata for Kimi)
  • supports materializing all raw image refs for retry paths after vLLM multimodal cache misses
  • keeps run-scoped image asset refs file-backed so downstream Prime-RL trainer materializes images with its own processor

Companion PRs

Notes

  • Draft/WIP: stacked with the Verifiers and Prime-RL raw image offload PRs.
  • Verifiers is expected to offload image content to file://.../assets/images/... refs before rendering.
  • This intentionally treats raw image refs as the supported path, not processed multimodal feature sidecars.

Validation

  • End-to-end hosted-style smoke through Prime-RL with /home/ubuntu/renderers, /home/ubuntu/verifiers, and /home/ubuntu/prime-rl-v1-raw-mm-offload completed inference, env rollouts, train batch creation, trainer step 0, and decoded strict trainer-bound raw image refs.

Note

Support raw image refs for multimodal rendering in Qwen and Qwen3-VL renderers

  • Reworks Qwen35Renderer and Qwen3VLRenderer to emit image descriptors (hash, image_grid_thw, placeholder counts) instead of pixel tensors, removing HF processor and per-instance image cache dependencies.
  • Adds materialize_image_refs to both renderers and RendererPool, converting descriptor-only image items into run-scoped image references at request time.
  • Introduces mm_store.py, a new module providing utilities for run-scoped image reference construction, on-disk asset offloading, and layout fingerprinting.
  • Updates generate() in client.py with a materialize_all_image_refs flag; when set, it materializes image refs before request dispatch and builds image-ref selectors for Qwen instead of base64-encoded tensor payloads.
  • Replaces image_cache_max with explicit image layout parameters (patch_size, merge_size, min/max_pixels, etc.) in Qwen renderer configs.
  • Risk: generate() raises NotImplementedError if materialize_all_image_refs=True and the renderer lacks materialize_image_refs; pixel tensors are no longer embedded in multi_modal_data.

Macroscope summarized 32d5a9d. (Automatic summaries will resume when PR exits draft mode or review begins).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant