[docs][example] VLM Examples by nithinvc · Pull Request #1531 · NovaSky-AI/SkyRL

nithinvc · 2026-04-17T21:59:00Z

Summary

Adds two end-to-end multi-turn VLM RL examples (Geometry-3K and VisGym) along with a Vision-Language RL tutorial that documents the shared VLM setup (flags, dataset record shape, local vLLM override). Also wires the SkyRLVLMGymGenerator from #1486 into the main entrypoint behind a config flag so VLM runs can be launched end-to-end from ppo_base_config.yaml.

Geometry-3K example (examples/train/geometry3k/) — multi-turn GRPO on hiyouga/geometry3k with Qwen/Qwen3-VL-8B-Instruct. Up to 3 turns per episode; model checks candidate answers with a calc_score tool before committing to a final \boxed{} answer. Binary reward.
VisGym example (examples/train/visgym/) — multi-image multi-turn RL where every env step returns a new image observation. Two recipes:
- run_visgym_from_instruct.sh — vanilla Qwen3-VL-8B-Instruct, keyword actions, task-only reward, KL on.
- run_visgym_from_sft.sh — starts from a structured <observation>/<justification>/<action> SFT checkpoint with tuple actions and a mixed task+format reward.
Docs — new tutorials/vision_language_rl.mdx (shared VLM setup, required flags, dataset shape, support matrix) and example pages for each recipe under examples/. Docs pages include reward curves and a VisGym rollout GIF.
Generator wiring — new generator.vision_language_generator config flag. When true, main_base.py constructs SkyRLVLMGymGenerator instead of SkyRLGymGenerator. Defaults to false, no behavior change for existing runs.
mm_token_type_ids shim (model_wrapper.py) — transformers v5 expects mm_token_type_ids to be populated at tokenization to distinguish text vs. multimodal tokens, but vLLM doesn't support transformers v5 yet and doesn't return them. Populate here from image_token_id when images are present and the field is missing. Remove once vLLM ships transformers v5 support.

Test plan

bash examples/train/geometry3k/run_geometry3k.sh trains end-to-end on 1×8×H100; reward curve matches docs/public/images/examples/geometry3k_reward.png.
bash examples/train/visgym/run_visgym_from_instruct.sh trains end-to-end on 1×8×H100; reward curve matches docs/public/images/examples/visgym_maze2d_reward.png.
MODEL_PATH=/path/to/sft_ckpt bash examples/train/visgym/run_visgym_from_sft.sh trains end-to-end.
Existing non-VLM runs unaffected (vision_language_generator: false is the default).
Docs build: cd docs && npm run build.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

SumanthRH

Overall LGTM, next time around let's break such changes up into smaller PRs. Have you run the dataset preparation scripts E2E? let's make sure we do that.

SumanthRH · 2026-04-20T22:41:14Z

The YAML is legacy and will be deleted soon. Revert?

SumanthRH · 2026-04-20T23:15:12Z

+    print(f"Saved full training set ({len(train_dataset)} examples) to {train_parquet_path}")
+
+    # Process and save the val split
+    if "val" in dataset:


The split name should be validation according to https://huggingface.co/datasets/hiyouga/geometry3k/viewer/default/validation, not val

SumanthRH · 2026-04-20T23:18:38Z

+
+    # Process and save the val split
+    if "val" in dataset:
+        val_dataset = dataset["val"]


The split name is validation not val according to https://huggingface.co/datasets/hiyouga/geometry3k/viewer/default/validation

SumanthRH · 2026-04-20T23:26:57Z

+    return None
+
+
+def grade_answer_verl(solution_str: str, ground_truth: str) -> bool:


Let's instead call this grade_answer_from_boxed and add the reference used (looks like it's VERL) as a comment

SumanthRH · 2026-04-20T23:32:45Z

+  # Algorithm
+  trainer.algorithm.advantage_estimator="grpo" \
+  trainer.algorithm.use_kl_loss=false \
+  generator.n_samples_per_prompt=8 \


This snippet here says n_samples_per_prompt=8 but then the actual script at examples/train/geometry3k/run_geometry3k.sh says n_samples_per_prompt=4.

fixed, both are now 4 samples per prompt

SumanthRH · 2026-04-20T23:36:54Z

+**Local vLLM source override required (temporary).** VLM training needs a newer vLLM than the `vllm==0.19.0` pinned in the root `pyproject.toml`. Until the next vLLM release ships with the multimodal rendering support used by SkyRL's new inference stack, clone vLLM locally and point uv at it by adding one line under `[tool.uv.sources]` in the repo root `pyproject.toml`:
+


Can you specify the exact commit required? It looks like it needs to be after 80b1823

nithinvc · 2026-04-21T00:31:36Z

Overall LGTM, next time around let's break such changes up into smaller PRs. Have you run the dataset preparation scripts E2E? let's make sure we do that.

Yes, the "validation" key got past me since the script training run uses the test split test.parquet when doing evals. Just reran both dataset generation scripts and they produce the correct parquet files

nithinvc added 7 commits April 17, 2026 16:08

geometry 3k examples

84dd8dd

cleanup

f28bf6d

cleanup and correct val set

bd5bd94

visgym examples

634bb0a

docs first pass and examples

d9c6fb1

cleanup

98c3a5d

add cites

53a1dd9

nithinvc force-pushed the nithinc/geometry-3k-example branch from e4cb81b to 53a1dd9 Compare April 17, 2026 23:09

nithinvc and others added 10 commits April 18, 2026 00:50

example uv sync

f8743ef

add vlm generator constructor

b98ae01

env

9528b1c

cleanup

78a4085

docs and examples, switch to local install

af9bc40

update example scripts, fix log probs diff

00ac0a0

merge remote

7a071dc

docs, images, and examples update

f4a1f8d

cleanup

8289fb9

Merge branch 'main' into nithinc/geometry-3k-example

831f86a

nithinvc marked this pull request as ready for review April 20, 2026 21:26

This comment was marked as resolved.

Sign in to view

devin-ai-integration Bot reviewed Apr 20, 2026

View reviewed changes

nithinvc changed the title ~~[wip][docs][example] VLM Examples~~ [docs][example] VLM Examples Apr 20, 2026

SumanthRH reviewed Apr 20, 2026

View reviewed changes

fixes

dbf9005

SumanthRH approved these changes Apr 21, 2026

View reviewed changes

SumanthRH merged commit f83573c into NovaSky-AI:main Apr 21, 2026
5 of 7 checks passed

		return None


		def grade_answer_verl(solution_str: str, ground_truth: str) -> bool:

		Local vLLM source override required (temporary). VLM training needs a newer vLLM than the `vllm==0.19.0` pinned in the root `pyproject.toml`. Until the next vLLM release ships with the multimodal rendering support used by SkyRL's new inference stack, clone vLLM locally and point uv at it by adding one line under `[tool.uv.sources]` in the repo root `pyproject.toml`:

Conversation

nithinvc commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

SumanthRH left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nithinvc commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nithinvc commented Apr 17, 2026 •

edited

Loading