I implement the Qwen3-VL model and run grpo_classification.py. However, my forward rewards are always 0, the format_reward function didn't revise.
I implement the Qwen3-VL model and run grpo_classification.py. However, my forward rewards are always 0, the format_reward function didn't revise.