Skip to content

[skyrl][tinker] Use VLLMRenderer in SkyRL train backend#1496

Open
nithinvc wants to merge 2 commits intoNovaSky-AI:mainfrom
nithinvc:nithinc/tinker-apply-renderer
Open

[skyrl][tinker] Use VLLMRenderer in SkyRL train backend#1496
nithinvc wants to merge 2 commits intoNovaSky-AI:mainfrom
nithinvc:nithinc/tinker-apply-renderer

Conversation

@nithinvc
Copy link
Copy Markdown
Contributor

@nithinvc nithinvc commented Apr 10, 2026

Summary

Integrates the VLLMRenderer (landed in #1464) into the SkyRL train backend so that VLM training batches include image placeholder tokens and decoded vision tensors (pixel_values, image_grid_thw).

  • When using new inference (_SKYRL_USE_NEW_INFERENCE), _to_training_batch lazily creates a VLLMRenderer and renders all ModelInputs through it.
  • Extracts pixel_values and image_grid_thw from rendered outputs and adds them to the TrainingInputBatch as TensorList entries (one tensor per batch element, since patch counts vary per image).
  • Extends _pad_batch to handle TensorList fields by cycling and cloning entries, matching the existing padding strategy for regular tensors.
  • Reorders forward_backward and forward to call _to_training_batch before _sleep_inference_engines, since the renderer needs the inference servers need to be initialized. Note that this does not wake the KV cache or model GPU memory since that is explicitly done in save_weights_for_sampler via the dispatcher.

E2E Tinker VLM Classifier Curves

With #1484 , we can now run tinker vision language recipes against the SkyRL. Merging both closes #1200

Example:

 _SKYRL_USE_NEW_INFERENCE=1 uv run --extra tinker --extra fsdp -m skyrl.tinker.api \
    --base-model "Qwen/Qwen3-VL-8B-Instruct" \
    --backend fsdp \
    --backend-config '{"trainer.placement.policy_num_gpus_per_node": 8, "generator.inference_engine.num_engines": 8, "trainer.placement.colocate_all": true, "trainer.use_sample_packing": false}'

Cookbook

 TINKER_API_KEY=tml-dummy uv run --with tinker --with datasets --with torch python -m \
    tinker_cookbook.recipes.vlm_classifier.train  \
    base_url=http://localhost:8000 \
    model_name="Qwen/Qwen3-VL-4B-Instruct" \
    dataset=caltech101 

Train nll:
train_nll

Val nll:
val_nll

Val accuracy:
accuracy


Open with Devin

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates VLLMRenderer and TensorList into the SkyRLTrainBackend to support multi-modal data processing. Key changes include updating _to_training_batch to utilize the new renderer, handling vision-related tensors (pixel_values, image_grid_thw), and adjusting the sequence of engine sleep calls. Feedback highlights a potential ValueError when mixing multi-modal and text-only inputs due to inconsistent batch sizes, concerns regarding the renderer's dependency on active inference engines after reordering sleep calls, and performance overhead from using asyncio.run in the training hot path.

Comment thread skyrl/backends/skyrl_train_backend.py Outdated
Comment thread skyrl/backends/skyrl_train_backend.py
if self._renderer is None:
self._ensure_inference_engines()
self._renderer = VLLMRenderer(self._inference_engine_client, self._cfg.trainer.policy.model.path)
rendered_inputs = asyncio.run(self._renderer(prepared_batch.all_model_inputs))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using asyncio.run inside _to_training_batch introduces significant overhead because it creates and tears down a new event loop for every call. Since _to_training_batch is called on every training step, this can impact performance.

Consider using a persistent event loop or an alternative approach to bridge the synchronous backend methods with the asynchronous renderer client, especially since this is a hot path during training.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, and in general I think the entire backend should move to a single persistent event loop. However, the scope of this PR is only on correctness for now unless requested

devin-ai-integration[bot]

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[skyrl][RFC] VLM Training in SkyRL with the SkyRL train backend

1 participant