[docs][example] VLM Examples#1531
Conversation
e4cb81b to
53a1dd9
Compare
SumanthRH
left a comment
There was a problem hiding this comment.
Overall LGTM, next time around let's break such changes up into smaller PRs. Have you run the dataset preparation scripts E2E? let's make sure we do that.
There was a problem hiding this comment.
The YAML is legacy and will be deleted soon. Revert?
| print(f"Saved full training set ({len(train_dataset)} examples) to {train_parquet_path}") | ||
|
|
||
| # Process and save the val split | ||
| if "val" in dataset: |
There was a problem hiding this comment.
The split name should be validation according to https://huggingface.co/datasets/hiyouga/geometry3k/viewer/default/validation, not val
|
|
||
| # Process and save the val split | ||
| if "val" in dataset: | ||
| val_dataset = dataset["val"] |
There was a problem hiding this comment.
The split name is validation not val according to https://huggingface.co/datasets/hiyouga/geometry3k/viewer/default/validation
| return None | ||
|
|
||
|
|
||
| def grade_answer_verl(solution_str: str, ground_truth: str) -> bool: |
There was a problem hiding this comment.
Let's instead call this grade_answer_from_boxed and add the reference used (looks like it's VERL) as a comment
| # Algorithm | ||
| trainer.algorithm.advantage_estimator="grpo" \ | ||
| trainer.algorithm.use_kl_loss=false \ | ||
| generator.n_samples_per_prompt=8 \ |
There was a problem hiding this comment.
This snippet here says n_samples_per_prompt=8 but then the actual script at examples/train/geometry3k/run_geometry3k.sh says n_samples_per_prompt=4.
There was a problem hiding this comment.
fixed, both are now 4 samples per prompt
| **Local vLLM source override required (temporary).** VLM training needs a newer vLLM than the `vllm==0.19.0` pinned in the root `pyproject.toml`. Until the next vLLM release ships with the multimodal rendering support used by SkyRL's new inference stack, clone vLLM locally and point uv at it by adding one line under `[tool.uv.sources]` in the repo root `pyproject.toml`: | ||
|
|
There was a problem hiding this comment.
Can you specify the exact commit required? It looks like it needs to be after 80b1823
Yes, the "validation" key got past me since the script training run uses the test split |
Summary
Adds two end-to-end multi-turn VLM RL examples (Geometry-3K and VisGym) along with a Vision-Language RL tutorial that documents the shared VLM setup (flags, dataset record shape, local vLLM override). Also wires the
SkyRLVLMGymGeneratorfrom #1486 into the main entrypoint behind a config flag so VLM runs can be launched end-to-end fromppo_base_config.yaml.examples/train/geometry3k/) — multi-turn GRPO on hiyouga/geometry3k withQwen/Qwen3-VL-8B-Instruct. Up to 3 turns per episode; model checks candidate answers with acalc_scoretool before committing to a final\boxed{}answer. Binary reward.examples/train/visgym/) — multi-image multi-turn RL where every env step returns a new image observation. Two recipes:run_visgym_from_instruct.sh— vanillaQwen3-VL-8B-Instruct, keyword actions, task-only reward, KL on.run_visgym_from_sft.sh— starts from a structured<observation>/<justification>/<action>SFT checkpoint with tuple actions and a mixed task+format reward.tutorials/vision_language_rl.mdx(shared VLM setup, required flags, dataset shape, support matrix) and example pages for each recipe underexamples/. Docs pages include reward curves and a VisGym rollout GIF.generator.vision_language_generatorconfig flag. When true,main_base.pyconstructsSkyRLVLMGymGeneratorinstead ofSkyRLGymGenerator. Defaults to false, no behavior change for existing runs.mm_token_type_idsshim (model_wrapper.py) — transformers v5 expectsmm_token_type_idsto be populated at tokenization to distinguish text vs. multimodal tokens, but vLLM doesn't support transformers v5 yet and doesn't return them. Populate here fromimage_token_idwhen images are present and the field is missing. Remove once vLLM ships transformers v5 support.Test plan
bash examples/train/geometry3k/run_geometry3k.shtrains end-to-end on 1×8×H100; reward curve matchesdocs/public/images/examples/geometry3k_reward.png.bash examples/train/visgym/run_visgym_from_instruct.shtrains end-to-end on 1×8×H100; reward curve matchesdocs/public/images/examples/visgym_maze2d_reward.png.MODEL_PATH=/path/to/sft_ckpt bash examples/train/visgym/run_visgym_from_sft.shtrains end-to-end.vision_language_generator: falseis the default).cd docs && npm run build.