[skyrl][tinker] Use VLLMRenderer in SkyRL train backend#1496
[skyrl][tinker] Use VLLMRenderer in SkyRL train backend#1496nithinvc wants to merge 2 commits intoNovaSky-AI:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request integrates VLLMRenderer and TensorList into the SkyRLTrainBackend to support multi-modal data processing. Key changes include updating _to_training_batch to utilize the new renderer, handling vision-related tensors (pixel_values, image_grid_thw), and adjusting the sequence of engine sleep calls. Feedback highlights a potential ValueError when mixing multi-modal and text-only inputs due to inconsistent batch sizes, concerns regarding the renderer's dependency on active inference engines after reordering sleep calls, and performance overhead from using asyncio.run in the training hot path.
| if self._renderer is None: | ||
| self._ensure_inference_engines() | ||
| self._renderer = VLLMRenderer(self._inference_engine_client, self._cfg.trainer.policy.model.path) | ||
| rendered_inputs = asyncio.run(self._renderer(prepared_batch.all_model_inputs)) |
There was a problem hiding this comment.
Using asyncio.run inside _to_training_batch introduces significant overhead because it creates and tears down a new event loop for every call. Since _to_training_batch is called on every training step, this can impact performance.
Consider using a persistent event loop or an alternative approach to bridge the synchronous backend methods with the asynchronous renderer client, especially since this is a hot path during training.
There was a problem hiding this comment.
This is good, and in general I think the entire backend should move to a single persistent event loop. However, the scope of this PR is only on correctness for now unless requested
Summary
Integrates the VLLMRenderer (landed in #1464) into the SkyRL train backend so that VLM training batches include image placeholder tokens and decoded vision tensors (
pixel_values,image_grid_thw)._SKYRL_USE_NEW_INFERENCE),_to_training_batchlazily creates aVLLMRendererand renders allModelInputs through it.pixel_valuesandimage_grid_thwfrom rendered outputs and adds them to theTrainingInputBatchasTensorListentries (one tensor per batch element, since patch counts vary per image)._pad_batchto handleTensorListfields by cycling and cloning entries, matching the existing padding strategy for regular tensors.forward_backwardandforwardto call_to_training_batchbefore_sleep_inference_engines, since the renderer needs the inference servers need to be initialized. Note that this does not wake the KV cache or model GPU memory since that is explicitly done insave_weights_for_samplervia the dispatcher.E2E Tinker VLM Classifier Curves
With #1484 , we can now run tinker vision language recipes against the SkyRL. Merging both closes #1200
Example:
_SKYRL_USE_NEW_INFERENCE=1 uv run --extra tinker --extra fsdp -m skyrl.tinker.api \ --base-model "Qwen/Qwen3-VL-8B-Instruct" \ --backend fsdp \ --backend-config '{"trainer.placement.policy_num_gpus_per_node": 8, "generator.inference_engine.num_engines": 8, "trainer.placement.colocate_all": true, "trainer.use_sample_packing": false}'Cookbook
TINKER_API_KEY=tml-dummy uv run --with tinker --with datasets --with torch python -m \ tinker_cookbook.recipes.vlm_classifier.train \ base_url=http://localhost:8000 \ model_name="Qwen/Qwen3-VL-4B-Instruct" \ dataset=caltech101Train nll:

Val nll:

Val accuracy:
