[Roadmap] RL-Kernel Roadmap Q3 - Q4

# RL-Kernel Roadmap (2026)

We have organized RL-Kernel's goals for the second half of 2026 into several focus areas. Discussion happens on our Discord.

Repo: https://github.com/RL-Align/RL-Kernel

Discord: https://discord.com/invite/5HfkFjmPD

Hardware: NVIDIA CUDA | AMD ROCm | domestic accelerator

RL-Kernel is a micro-level RL operator library. We do not own RL scheduling frameworks; verl and slime handle the macro dataflow. Our job is to replace inefficient PyTorch paths around loss, KL, log-prob, variable-length packing, and rollout/training consistency with extreme fused kernels, and to keep those kernels numerically aligned across rollout engines such as vLLM/sglang and training engines such as Megatron/DeepSpeed/FSDP.

## Roadmap Rules

Priorities are intentionally uneven:

1. P0: benchmark/CI/docs foundation, the three P0 showcase kernels, and train-inference consistency tooling.
2. P1: close one end-to-end loop and jointly validate rollout vs training consistency on top of the P0 tooling and kernels.
3. P2: large-scale kernel development across ROCm, CUDA, and Triton, including the previous fused-operator backlog.
4. P3: domestic-accelerator backend and remaining research/expansion work.

An item is considered done only when it has:

1. a code path in the repository;
2. focused tests or a reproducible smoke test;
3. a benchmark, validation report, or documented usage path when the item is user-facing;
4. clear fallback behavior for unsupported hardware or missing optional dependencies.

## Current Foundation

These pieces are already landed and form the base for the next phase:

- [x] TMA-accelerated online-softmax fused LogP for SM90+ (#26)
- [x] CUDA fused logp variants for GRPO training (#5, #33)
- [x] selected-logprob reference path for current/old/ref policy testing (#28)
- [x] minimal RL batch schema and synthetic GRPO fixtures for kernel validation (#27)
- [x] Prefix-Shared Fused Attention and same-prompt multi-output KV reuse (#65, #71)
- [x] vLLM shared-prefix rollout sampler (#34)
- [x] fused sampling op (#3)
- [x] hardware-aware context for ROCm and CUDA, plus operator HAL wrappers (#1, #36)
- [x] pure Triton FlashAttention dense fallback
- [x] Ray actor management, DeepSpeed training worker, and zero-copy weight sync bridge (#17, #51, #60)
- [x] production GRPO objective inputs and reductions (#67)
- [x] RL-shaped fused_logp benchmark suite and minimal loss-step tests (#29, #30)
- [x] automated profiling suite and VRAM baseline with TRL comparison (#19, #79)
- [x] minimal reproducible single-GPU GRPO training script (#76, #80)
- [x] MkDocs documentation skeleton and GitHub Actions lint/test workflow (#22, #31)

## P0: Foundation, Showcase Kernels, and Consistency Tooling

Goal: lock the things everything else depends on: reproducible benchmark/CI/doc infrastructure, three high-value P0 kernels across CUDA/ROCm/Triton, and train-inference consistency tooling.

### P0.1 Benchmark, CI, and Docs

- [ ] H100 SXM5 benchmark results alongside A100 baseline (#75)
- [ ] hardware benchmark dashboard and markdown metrics (#20)
- [ ] end-to-end observability with NVTX tracing first, Prometheus metrics second (#72)
- [ ] train-inference consistency regression CI with a per-PR KL/logprob-drift assertion
- [ ] migration guide from TRL / DeepSpeed-Chat to RL-Kernel (#73)
- [ ] onboarding tutorials for NVIDIA and AMD users and source-build contributors (#23)
- [ ] FAQ covering common installation, CUDA, and hardware issues (#74)
- [ ] cross-platform Docker matrix and CI workflows for hardware regression and automated tests

### P0.2 Typical Fused Operators on CUDA / ROCm / Triton

The core fused operators that must exist and be numerically correct on all three backends before scale-out work begins.

- [ ] Fused CE LogProb without materializing logits + fused backward. cc @KJLdefeated 
      Avoid landing large `[B, S, V]` logits for large-vocabulary models, so the saved memory can be used for larger batches or longer CoT responses. The forward path streams vocab blocks with online softmax; the backward path recomputes tiles to trade compute for memory instead of storing full logits/probabilities. 
      - CUDA: use SM90 TMA/WGMMA-oriented streaming where available, with SM80 cp.async/mma.sync fallback; keep online-softmax state in registers and shared memory.
      - ROCm: use wavefront=64-aware reductions and LDS layouts, replacing TMA with manual double buffering to cover CDNA limitations.
      - Triton: provide a portable semantic baseline and tolerance test target for CUDA/ROCm native implementations.

- [ ] Fused FlashAttention with causal mask, varlen packing, and exported attention LSE.
      Target long-context RL workloads with packed variable-length batches. Export attention softmax LSE for backward, diagnostics, and rollout/training attention alignment. The exported LSE is attention-domain LSE, not vocab-logprob LSE.
      - CUDA: SM90 WGMMA + TMA path, SM80 mma.sync fallback, with varlen and LSE export.
      - ROCm: MFMA-based kernel with 16x16x16-style tiling, CK comparison, and RL-Kernel-specific LSE/varlen semantics.
      - Triton: extend the existing dense fallback with LSE export and varlen support as the cross-platform semantic baseline.

- [ ] Batch-Invariant Deterministic LogProb.
      Lock selected-logprob reduction order so the same sequence does not drift with batch size, chunked prefill, prefix-cache mode, or packing layout. This is the core P0 operator for preventing KL drift caused by rollout/training logprob inconsistency. 
      - CUDA: avoid atomicAdd, fix the tree-reduction topology and block partition, and keep per-row reduction independent of surrounding batch shape. (#96)
      - ROCm: use deterministic wavefront=64 reductions and connect the result to the ROCm-to-CUDA parity suite.
      - Triton: disable autotune, lock configs, and validate that reduction behavior is stable for the supported BLOCK_SIZE set.

### P0.3 Train-Inference Consistency Tooling

Goal: make "same model, same sequence, same policy state" produce aligned log-probs across rollout and training engines. This is the main technical identity of RL-Kernel.

- [ ] [RFC] batch-invariant RL kernel suite: same sequence must produce the same output across batch size, chunked prefill, prefix cache on/off, and padding layout (#101)
- [ ] end-to-end log-prob cross-benchmark tool comparing rollout engines with training engines (#106)
- [ ] layer-wise hidden-state alignment probe to locate the first operator where drift appears
- [ ] TP-invariant reductions for FSDP(TP=1) vs TP>1 rollout/training parity (#102)
- [ ] FP16 rollout/training numerical path with explicit rounding and tolerance policy
- [ ] audit GRPO loss reduction semantics and validate at production scale (#64)

Note: The consistency tooling below is refined and re-scheduled under the P0.3+ Sprint workstreams. Specifically: the cross-benchmark tool (#106) → WS3; the layer-wise drift probe → WS4; TP-invariant reductions (#102) → WS2 (#109); the dtype/tolerance policy → WS1 (#154) + WS2 (#116). P0.3 lists them by tool type; the Sprint section lists them by workstream and execution order.

### Exit Criteria

1. Benchmarks publish hardware, model, dtype, shape, baseline, and command lines, and CI protects the agreed numerical contract.
2. The three P0 showcase kernels have documented CUDA/ROCm/Triton semantics, runnable reference paths, and tolerance tests.
3. A contributor can run one command that compares rollout and training log-probs on a fixed model, prompt set, and seed; failures report the first divergent layer/operator.

## P0.3+ Sprint: Operator-Level Train-Inference Consistency (Demo-grade → Usable-grade)
Final goal (the full vision): On standard Transformer architecture, under real multi-GPU training, with the real vLLM rollout engine and the real Megatron/FSDP training engine, prove that the same sequence produces aligned (bitwise / tight-tolerance) logprobs across both, and that the alignment holds across batch size, parallelism config, and padding. In parallel, integrate RL-Kernel operators into vime as the operator-level consistency layer beneath its framework-level alignment.
This month's scope: WS1 + WS2 + WS5. WS3 (real-engine alignment) and WS4 (diagnostics & reproducibility) are deferred to the next phase. 

This month proves operator-level consistency on single-GPU and multi-GPU (our op chain is self-consistent), plus a scoped-out vime integration path — not real vLLM == real Megatron alignment, which requires WS3.
Out of scope this month: MoE, linear attention, FP8, ROCm / domestic chips.

Definition of "usable-grade" (the full bar):
1. A full batch-invariant forward chain (RMSNorm + matmul + attention + logprob), not isolated ops.
2. Consistency holds under TP>1 and across mismatched rollout/training parallelism.
3. The cross-benchmark tool (#106) compares real vLLM output vs real Megatron output (not a simulated forward) and reports drift reduced to zero.
4. A layer-wise probe locates the first divergent operator when drift appears.

This month targets items 1–2 (WS1 + WS2). Items 3–4 require WS3/WS4 and are the next-phase target.

Task granularity: Each Workstream below is an epic (one tracking issue). Each - [ ] item under a Workstream is a direction that becomes its own GitHub issue, not a single PR. Once an issue is picked up, the owner breaks it into 1 or N PRs and lists the planned PRs in the issue description. Rule of thumb: one independently reviewable, independently mergeable change completable in 1–3 days = one PR; otherwise split further. Granularity is uneven by design: items like Per-PR CI or Positioning doc may be a single PR, while vLLM rollout integration, matmul, and #106 are large issues that split into 5–6 PRs each.

### Workstream 1 — Full Batch-Invariant Forward Chain
The foundation. Ops are mutually independent and can be built in parallel; each needs its own ground-truth and batch-config sweep.
- [ ] Ground-truth harness + numerical contract (highest priority; all other ops depend on it): deterministic reference for every op on a fixed standard-Transformer model (e.g. Llama-3-8B / Qwen3-8B dense) + fixed prompt set + seed; define the tolerance policy (bitwise where feasible, tight-tolerance otherwise); produce a concrete per-dtype pass/fail threshold table (e.g. max abs logprob diff, KL bound) used as the single source of truth for "aligned" across WS2 (and WS3/WS4 in the next phase). #108  cc @maxiaosong1124 @a-kaa 
- [ ]  Batch-invariant RMSNorm (forward + backward): confirm/rebuild #38 as batch-invariant; lock reduction order. #145 cc @EthanZero2Hero 
- [ ]  Batch-invariant matmul / GEMM: cuBLAS cannot guarantee batch-invariance and split-k breaks it; either write a deterministic GEMM or integrate a DeepGEMM-style deterministic matmul. Highest technical risk in WS1. #146 cc @Flink-ddd 
- [ ]  Batch-invariant attention (standard softmax): avoid split-KV; single-SM whole-sequence or dual-kernel with identical accumulation order. #147 cc @maxiaosong1124 @EthanZero2Hero 
- [ ]  Batch-invariant logprob (build on #96): selected-logprob with locked reduction. #148 cc @Flink-ddd 
- [ ]  Batch-invariant elementwise / RoPE pass-through audit: confirm pointwise ops and RoPE do not reintroduce drift. #149 cc @a-kaa 
- [ ] Chain integration test (WS1 exit gate): assemble all ops into one end-to-end forward and backward pass; assert the full-chain output and gradients are invariant across batch=1/N, chunked-prefill on/off, and padding layouts. #150 cc @maxiaosong1124 @EthanZero2Hero @Flink-ddd 
- [ ] Batch-invariant embedding + LM head projection: confirm the input embedding lookup and the final vocab projection (a large matmul, directly upstream of logprob) also run on the batch-invariant path; drift here propagates straight into logprob. #151 cc @inaniloquentee 
- [ ] KV-cache path consistency: ensure prefill and decode produce the same reductions; cover the decode-stage path explicitly, not only chunked-prefill (a classic rollout↔training drift source). #152 cc @zhangj1an
- [ ] Backward-pass consistency (not only forward): each op's batch-invariant validation must cover backward as well — drifting gradient reductions break training even when forward is aligned. #153 cc @maxiaosong1124 @EthanZero2Hero @Flink-ddd 
- [ ] dtype coverage in the numerical contract: pin which dtypes are tested (BF16 mandatory for RL training; FP32 reference); FP8 explicitly out of scope this month. All ops validate against the same pinned dtype set. #154 cc @maxiaosong1124 @EthanZero2Hero @Flink-ddd 

### Workstream 2 — Distributed / Parallelism Invariance
Real training is multi-GPU; without this layer, WS1 does not hold in real scenarios.
- [ ] TP-invariant reductions ([[WS2] TP-invariant reductions (single-GPU vs multi-GPU accumulation order) #109](https://github.com/RL-Align/RL-Kernel/issues/109), refines [#102](https://github.com/RL-Align/RL-Kernel/issues/102)): make TP=1 and TP>1 produce identical reduction order for the WS1 ops. cc @a-kaa 
- [ ]  SP-aware logprob / loss reductions ([[WS2] SP-aware logprob / loss reductions (long-sequence sharding) #110](https://github.com/RL-Align/RL-Kernel/issues/110), related to [#49](https://github.com/RL-Align/RL-Kernel/issues/49)): sequence-parallel consistency, including SP-aware attention. cc @a-kaa 
- [ ]  Cross-config alignment ([[WS2] Cross-config alignment (mismatched rollout/training parallelism) #111](https://github.com/RL-Align/RL-Kernel/issues/111)): rollout often runs one parallelism (e.g. TP=2) while training runs another (e.g. FSDP). Prove logprob aligns across mismatched parallelism configs, not only same-config. The most difficult and schedule-critical item in WS2. cc @Flink-ddd 
- [ ]  Deterministic NCCL all-reduce, incl. DP gradient all-reduce ([[WS2] Deterministic NCCL all-reduce (incl. DP gradient all-reduce) #112](https://github.com/RL-Align/RL-Kernel/issues/112)): audit all-reduce ordering; use NVLink-Sharp in-switch deterministic reductions on Hopper (CUDA 12.8+) where available; deterministic fallback where not; make cross-DP-rank gradient all-reduce order deterministic as a distinct scenario. Independent of WS1 ops — can start in W1.  cc @inaniloquentee  
- [ ]  Distributed chain test ([[WS2] Distributed chain test #113](https://github.com/RL-Align/RL-Kernel/issues/113)): re-run the WS1 chain test under TP>1 and mismatched configs, covering forward + backward; add to CI. cc @Flink-ddd 
- [ ]  RNG / randomness-source alignment ([[WS2] RNG / randomness-source alignment #114](https://github.com/RL-Align/RL-Kernel/issues/114)): unify and verify RNG state (dropout masks, sampling seeds) across rollout and training so apparent "operator drift" is not actually an unaligned random source; wire as a precondition into the distributed chain test (#113). cc @Flink-ddd 
- [ ]  Tolerance contract alignment & WS2 drift report format ([[WS2] Tolerance contract alignment & WS2 drift report format #116](https://github.com/RL-Align/RL-Kernel/issues/116), depends on [#108](https://github.com/RL-Align/RL-Kernel/issues/108)): make every WS2 test judge "aligned" against the same #108 threshold table and emit drift reports in one shared format. cc @z1ying 
- [ ]  PP consistency — deferred this month ([[WS2] PP consistency — scope boundary (deferred this month) #115](https://github.com/RL-Align/RL-Kernel/issues/115)): this month covers TP/SP only; PP is deferred to the next phase. Tracking placeholder, not assigned. 

### Workstream 3 — Real-Engine Alignment (critical path)
Moving from our own assembled chain to real vLLM matching real Megatron. This is the step most likely to consume the schedule buffer. All vLLM integration is non-intrusive and lives in the RL-Kernel repo using vLLM's extension points (custom op / backend plugin); we do not modify vLLM source or file PRs against the vLLM repo this month. W1 first confirms whether vLLM's extension points are sufficient for the logprob / attention hooks we need.
- [ ] Environment & infra setup (W1, blocking WS2 and WS3): stand up the shared multi-GPU environment with matched vLLM + Megatron/FSDP versions, the fixed model loadable on both engines, and TP/FSDP launchable. First infra task of the sprint (W1) — both WS2's distributed tests and WS3's engine alignment depend on it. #127 
- [ ] Input parity precheck: confirm identical tokenization, special tokens, and padding side (left/right) between vLLM and Megatron before any comparison — mismatched inputs invalidate all downstream alignment. #128 
- [ ] vLLM rollout integration: wire WS1 batch-invariant ops into vLLM's real rollout path (logprob / attention), behind a non-intrusive custom-op / optional-backend flag so pure inference is unaffected. (Shared investigation with WS5 — same probing of vLLM's rollout path.) #129 
- [ ]  Megatron / FSDP training integration: wire the same ops into the real training forward. #130 
- [ ]  End-to-end logprob cross-benchmark tool (#106): one command runs the same fixed model + prompt set + seed through real vLLM and real Megatron, dumps both logprob streams, and computes drift. This is the headline deliverable. #131 
- [ ]  Drift-to-zero validation: with batch-invariant ops enabled, real vLLM logprob matches real Megatron logprob (within tolerance); with them disabled, drift is clearly visible (control group). #132 
- [ ]  Weight-sync correctness: ensure the training/rollout weight-sync bridge does not itself introduce numerical mismatch (a known source of silent drift). #133 
- [ ] Weight / version parity precheck (hidden prerequisite for #106): confirm vLLM and Megatron load the exact same checkpoint and weight conversion before any alignment comparison — alignment is meaningless otherwise. #134 
- [ ] Sampling vs logprob decoupling: rollout samples (temperature / top-p) while training computes teacher-forcing logprob; align "logprob given the same token sequence," keeping sampling and logprob computation explicitly separated so the comparison measures the same quantity. #135 

### Workstream 4 —Diagnostics & Reproducibility
Parallel, supports all other workstreams.
- [ ]  Layer-wise hidden-state alignment probe: when drift appears, automatically locate the first divergent layer / operator instead of only reporting final KL. #136 
- [ ]  Per-PR consistency regression CI: KL / logprob-drift assertion so the numerical contract does not silently regress. #137 
- [ ]  Reproducible benchmark report: hardware, model, dtype, parallelism config, command lines, drift numbers (enabled vs disabled), and performance overhead. #138 
- [ ]  Positioning doc: framework-level (vime) vs operator-level (ours) vs inference-side (Thinking Machines / DeepSeek-V4); make the differentiation explicit; frame it as "first reproducible end-to-end RL train-inference consistency," NOT "first batch-invariant kernels" (those already exist — Thinking Machines / sglang / DeepSeek-V4); overclaiming will be challenged. #139 
- [ ] Drift visualization tool: render the layer-wise probe output as a simple "where drift starts" chart, for both debugging and demos — not just raw numbers. #140 
- [ ] Demo deliverable: a one-command reproducible script plus the drift chart (enabled vs disabled), packaged for live or recorded demonstration — the artifact leadership presents externally. #141 

### Workstream 5 - RL-Kernel to vime Integration Exploration
Independent and decoupled from the consistency critical path; driven by a small dedicated effort. Overlaps with WS3's vLLM integration probing, the same investigation of vLLM's rollout path feeds both, so coordinate to avoid duplicate work.
- [ ] Map vime's architecture ([[WS5] Map vime architecture and RL-Kernel hook points #118](https://github.com/RL-Align/RL-Kernel/issues/118)): confirm it is built on slime's training stack with vLLM (+ vllm-router) as the rollout backend; identify where logprob / loss / sampling operators are invoked on both the rollout (vLLM) and training (Megatron) sides, and locate the hook points where RL-Kernel operators can be injected. cc @inaniloquentee 
- [ ] Produce an "RL-Kernel to vime integration" design doc ([[WS5] RL-Kernel to vime integration design doc #119](https://github.com/RL-Align/RL-Kernel/issues/119)): insertion points, change surface, non-intrusive integration path (custom op / optional backend), and an explicit positioning statement: operator-level consistency (ours) as the layer beneath vime's framework-level alignment, complementary rather than competing. cc @inaniloquentee 
- [ ] Minimal PoC ([[WS5] Minimal vime PoC with one RL-Kernel operator #120](https://github.com/RL-Align/RL-Kernel/issues/120)): wire one existing RL-Kernel operator (start with rollout-side fused_logp) into a single vime path and run a minimal example confirming the operator is actually invoked by vime. cc @inaniloquentee 
- [ ] Document vime's current interface limitations ([[WS5] vime interface limitations and upstream change list #121](https://github.com/RL-Align/RL-Kernel/issues/121)): record the upstream changes we would need to push; this feeds the P1 full-integration effort. cc @inaniloquentee 
- [ ] Run vime's native minimal RL example ([[WS5] vime native minimal RL baseline #117](https://github.com/RL-Align/RL-Kernel/issues/117)): run without our operators as a baseline first, so a failed PoC can be attributed to our integration rather than an unconfigured vime. cc @inaniloquentee 
- [ ] RL-Kernel to vime integration and benchmark plan (#158): integrate RL-Kernel into vime as an optional GRPO/logprob acceleration layer, starting with `fused logp`, `ratio_kl`, and `grpo_loss`; benchmark vime vs vime + RL-Kernel on dense workloads, and validate MoE only as `vime + R3` vs `vime + R3 + RL-Kernel` without claiming MoE-kernel acceleration or replacing R3. cc @inaniloquentee
### Critical Path & Scheduling
This month's scope: WS1 + WS2 + WS5. WS3 (real-engine alignment) and WS4 (diagnostics & reproducibility) are deferred to the next phase. The goal this month is operator-level consistency proven on single-GPU and multi-GPU (self-consistency), plus a vime integration path scoped out — not full real-engine train-inference alignment, which requires WS3.
Serial critical path (determines whether the month succeeds):

**WS1 ground-truth harness → WS1 ops → WS1 chain test (single-GPU) → WS2 distributed chain test (multi-GPU) → cross-config alignment**

WS2 merges in once WS1 ops reach a working draft; WS5 runs in parallel throughout and is largely independent (its PoC uses the existing fused_logp, not this month's new ops). Environment & infra setup (listed under WS3) must still be done this month because WS2's multi-GPU tests depend on it — pull it forward to W1.

Where adding people helps (parallel): the WS1 ops, the WS2 sub-items, and WS5.

Where adding people does not help (serial bottlenecks): WS1 chain integration and WS2 cross-config alignment; these depend on time and experience, not headcount.

Weekly milestones (goal: operator-level consistency with alignment data on single + multi-GPU, not production-grade real-engine alignment):

W1: ground-truth harness + numerical contract finalized; environment & infra setup done (matched multi-GPU env, fixed model loadable, TP/FSDP launchable — blocking WS2); WS1 ops started in parallel (matmul gets a dedicated owner; harness owner goes first since all ops depend on it); WS5 maps vime architecture and runs vime's native example as a baseline.
W2: each WS1 op passes its single-op batch sweep; WS2 begins merging in (TP / SP / deterministic collectives) as WS1 ops reach working draft; WS5 integration design doc drafted.
W3: WS1 chain test passes (full-chain single-GPU consistency, forward + backward); WS2 TP>1 consistency working; WS5 minimal PoC (one operator invoked by vime).
W4: WS2 distributed chain test + cross-config alignment (the hardest WS2 item) finalized; reproducible benchmark capturing the operator-level consistency results (single + multi-GPU, enabled vs disabled drift, overhead). Remaining time for debugging.

### Risks
This month's highest-uncertainty items are batch-invariant matmul (WS1) and cross-config alignment (WS2): matmul because cuBLAS does not provide batch-invariance and split-k breaks it; cross-config because aligning logprob across mismatched parallelism (e.g. rollout TP=2 vs training FSDP) has intricate reduction-order logic. Assign a dedicated engineer to each, and treat cross-config as the serial bottleneck that decides whether W4 lands.

WS1's ground-truth harness is a hidden dependency for everyone — if it slips, every op loses its verification target. It must be the first thing finished in W1.

Scope reminder for external communication: completing WS1 + WS2 proves operator-level consistency (our op chain is self-consistent across batch size, padding, and parallelism) — it does NOT prove real vLLM == real Megatron, which is WS3. Describe the result as "operator-level train-inference consistency, validated on single and multi-GPU," not "train-inference consistency closed-loop."

(Deferred to next phase, noted for forward planning: WS3 real-engine alignment is the highest-risk item overall — real vLLM vs real Megatron commonly shows each engine correct on its own while the comparison still differs by ~1e-3, and tracing it to an overlooked layout or collective can take significant time. When WS3 starts, assign experienced engineers and begin probing engines immediately.)


## P0.4: Multimodal RL Operator Development Plan

Goal: support image-text and omni-modal RL workloads without expanding RL-Kernel into a multimodal scheduling framework. verl / EasyR1 / OpenRLHF / TRL / vLLM-style stacks already expose VLM GRPO/PPO/RLOO paths with image-text prompts, multi-image or video inputs, processor caching, VLM reward models, and framework-level rollout orchestration. RL-Kernel should own the operator layer where modality expansion, packing, masking, caching, and rollout/training logprob consistency currently become slow or fragile.

Scope boundary:

1. Image-text VLM RL first: Qwen2.5-VL / Qwen3-VL-style workloads with text generation conditioned on images.
2. Multi-image and short-video inputs second: frame/patch packing and media-prefix cache behavior are in scope; full video-generation training is not a P0.4 deliverable.
3. Audio, diffusion, and continuous-latent RL remain RFC-first until there is a reproducible minimal case and a partner willing to validate correctness and reward impact.
4. Visual grounding and GUI/computer-use agents are explicit validation scenarios after the base VLM path, because they add coordinate/action spans while still depending on the same selected-logprob, masking, and reward-scatter contracts.
5. Frameworks keep dataset loading, chat templates, media decoding, reward I/O, rollout orchestration, and model-specific processors. RL-Kernel provides reusable operators, numerical contracts, and validation harnesses.

### Common Multimodal RL Problems

- Processor / token-layout drift: rollout engines and training engines may insert image tokens, padding, special tokens, or chat-template spans differently. A small mismatch invalidates selected logprob, KL, and GRPO/PPO loss comparisons.
- Variable visual-token explosion: image resolution, number of images, and video frame count create highly ragged visual-prefix lengths. Naive padding wastes memory and changes reduction shapes, which can reintroduce batch-size-dependent drift.
- Modality-aware loss masking: RL losses should usually score assistant text tokens while ignoring prompt, image placeholder, vision-prefix, and tool-observation spans. Current framework paths often rebuild these masks in Python and then materialize large tensors before logprob/loss.
- Media-prefix and KV-cache reuse: group rollouts in GRPO repeatedly reuse the same prompt and media. Without stable media IDs, packed offsets, and deterministic cache gather/scatter, the rollout path recomputes vision encoders or compares a cached rollout path with an uncached training path.
- Vision tower / projector hotspots: VLM workloads add vision encoder outputs, projector GEMMs, patch merge/unpad, 2D or temporal position handling, and final LM-head logprob. These ops sit upstream of the same RL logprob path and can be both memory-heavy and numerically inconsistent across batch layouts.
- Multimodal reward latency and reward scatter: VLM-as-judge, OCR, rule-based geometry, and external verifier rewards often return sequence-level or region-level scores. The training path still needs token-level reward/advantage scatter, group normalization, and masking without CPU-GPU synchronization.
- Spatial grounding and coordinate/action tokens: grounding, GUI, and robotics-style tasks output boxes, points, clicks, swipes, or structured action JSON. Their rewards depend on decoded coordinates or action fields, so tokenization, coordinate normalization, and action-span boundaries must be reproducible across rollout and training.
- Multi-turn visual-agent trajectories: GUI/browser/mobile tasks alternate screenshots, tool observations, model actions, and environment feedback. Only the model action spans should receive policy loss, while observations and tool results must stay in context without corrupting masks or KL.
- Multimodal reward/cost separation: safety and preference alignment often produce both helpfulness reward and safety cost from multimodal reward models. These signals need separate normalization, clipping, and scatter paths so constrained objectives do not silently mix reward and cost semantics.
- Continuous-modality trajectory mismatch: omni-modal or diffusion-style RL rollouts may be denoising or latent trajectories rather than discrete token sequences. The operator contract must separate discrete-token RL kernels from continuous-latent KL/reward accumulation before implementation begins.

### Operator Workstreams

- [ ] Multimodal batch schema and trace fixtures: extend the minimal RL batch schema with `media_ids`, `modality_spans`, `loss_spans`, `vision_token_offsets`, `media_cache_keys`, and `processor_fingerprint`; ship synthetic image-text fixtures plus one real Geo3K/Qwen-VL-style smoke fixture.
- [ ] Processor-layout parity checker: one command compares rollout vs training token IDs, image-token insertion, position IDs, attention masks, media spans, and loss masks before running any logprob comparison; failures point to the first mismatched span.
- [ ] Modality-aware pack-and-pad / unpad operator: pack ragged text, image, and video patch spans into RL-shaped batches with stable offsets; support deterministic unpack for diagnostics and fallback to PyTorch when the backend lacks a native kernel.
- [ ] Modality-mask selected-logprob operator: extend selected-logprob, ratio/KL, and GRPO/PPO loss kernels so they consume compact `loss_spans` / `modality_spans` directly and never score visual-prefix or prompt tokens by accident.
- [ ] Batch-invariant VLM prefill chain: validate RMSNorm, attention, projector, LM head, and selected-logprob on image-text prompts across batch=1/N, multi-image count, padding side, processor-cache on/off, and prefix-cache on/off.
- [ ] Visual projector + LM-head fused path: optimize the projector-to-text bridge and final vocab projection for large visual prefixes; keep deterministic accumulation order and reuse the P0.3+ dtype/tolerance contract.
- [ ] Media-prefix cache index and deterministic gather/scatter: define stable media cache keys and packed KV offsets so group rollouts can reuse identical media prefixes while training can replay the same logical sequence for logprob validation.
- [ ] Video frame / patch packing helper: add a small operator contract for frame-major vs token-major packing, temporal position IDs, and per-frame masks; first target short-video understanding, not video generation.
- [ ] Multimodal reward-to-token scatter: fuse sequence-level, region-level, OCR/verifier, and length-penalty rewards into token-level advantages with group normalization, leave-one-out statistics, and deterministic reduction order.
- [ ] Grounding / coordinate-action schema: add `region_spans`, `bbox_targets`, `point_targets`, `action_spans`, `coordinate_system`, and `screen_size` fields; validate normalized boxes, click points, and structured action tokens before reward or logprob comparison.
- [ ] GUI / computer-use trajectory packer: pack repeated screenshot-observation-action turns with deterministic action loss masks, per-step rewards, and tool/environment observation spans; cover browser/mobile screenshots and action JSON as the first target.
- [ ] Reward-cost scatter for multimodal safety RL: keep helpfulness reward, safety cost, and optional verifier reward as separate tensors through normalization and objective aggregation; support constrained PPO/GRPO-style objectives without mixing signs or masks.
- [ ] Continuous-latent RL RFC: define the operator boundary for diffusion/audio/omni-modal trajectories, including latent-step KL, denoising-step reward accumulation, and trajectory masks; no kernel implementation until the RFC has a minimal correctness harness.
- Audio/codec sub-case (source: sglang-omni#774): MOSS-TTS reference encoder produces batch-shape-dependent discrete codec tokens (single-vs-batch mismatch ~4–9%) — BF16 GEMM-shape drift crosses the residual quantizer (RLFQ) boundaries and flips token IDs at the continuous→discrete encode step, before they enter the LM, breaking rollout↔training input alignment. This is an input-tokenization / discrete-token case rather than a continuous-latent trajectory, and anchors where this RFC must draw the discrete-vs-continuous boundary. First reproducible minimal case, partner-confirmed but not yet staffed.

- [ ] Framework compatibility harness: validate the same fixture through vLLM/sglang rollout and verl/vime/OpenRLHF/TRL-style training adapters, proving that RL-Kernel operators observe the same media spans, masks, and selected logprob targets.

### Exit Criteria

1. A VLM contributor can run one image-text GRPO smoke test that verifies processor layout parity, modality-aware selected logprob, and loss masking before training.
2. The same image-text fixture produces aligned logprob/loss across batch size, padding layout, media-cache mode, and prefix-cache mode within the P0.3+ tolerance policy.
3. At least one operator path shows a concrete memory or latency win over the PyTorch/framework baseline without changing reward or loss semantics.
4. Unsupported modalities have explicit RFC status, fallback behavior, and validation requirements instead of half-supported kernels.


## P1: End-to-End Loop and Joint Validation

Goal: on top of the P0 tooling and kernels, ship one boring, credible end-to-end path and jointly validate rollout-vs-training consistency on real workloads. The preferred path is vLLM rollout plus RL-Kernel logprob/loss kernels plus a verl/slime integration flag.

- [ ] [RFC] integrate RL-Kernel as a vLLM rollout backend through a non-intrusive custom-op or optional-backend path
- [ ] wire prefix_shared_attention into vLLM PagedAttention KV management
- [ ] veRL / slime backend flag, for example `use_rl_kernel_backend=True`
- [ ] overlapping rollout and training pipeline (#18, #61)
- [ ] production GRPO objective in the overlap pipeline (#68)
- [ ] GRPO logic with group relative reward computation (#16)
- [ ] generation-engine paged-KV scoring baseline for stateless Reference/Reward executor (#69)
- [ ] basic vLLM worker for offline rollout (#10)
- [ ] advanced vLLM sampler with shared prefix caching (#11)
- [ ] stateless (Zero-KV-Cache) forward engine for Reference and Reward models (#47)
- [ ] standardized policy and reference model wrappers (#14)
- [ ] use real policy logprobs and reward providers for production GRPO validation (#63)

### Exit Criteria

1. A user can run a documented single-node GRPO loop with vLLM rollout, RL-Kernel logprob/loss, and a real reward provider.
2. Pure inference workloads remain unaffected when the RL-Kernel backend is disabled.
3. The integration reports memory, latency, and logprob-consistency metrics measured against the P0 cross-benchmark tooling.

## P2: Large-Scale Kernel Development across ROCm / CUDA / Triton

Goal: take the P0 kernel set and the broader fused-operator backlog to scale across the three first-class backends: dispatch infrastructure, hardware-specific fast paths, cross-backend parity, packaging, and the distributed executors that large runs depend on.

### Cross-Backend Dispatch and Hardware Fast Paths

- [ ] dual-backend kernel dispatching infrastructure (#4)
- [ ] Triton-based cross-platform logp kernel (#6)
- [ ] FlashAttention with dynamic hardware routing (#8)
- [ ] FlashInfer operator for NVIDIA-specific optimization (#7)
- [ ] Composable Kernel / ROCm FlashAttention for AMD parity (#39)
- [ ] abstraction layer for AMD ROCm AITER integration (#9)
- [ ] persistent kernel auto-tuning in the Hardware Abstraction Layer (#50)
- [ ] ROCm-to-CUDA numerical parity suite, checking matching outputs rather than only successful execution

Moved from the previous P0 operator backlog:

- [ ] fused GRPO loss with in-place group reward normalization (#46)
- [ ] fused Policy Ratio and KL Penalty kernel (Triton / CUDA / ROCm) (#40)
- [ ] fused RMSNorm (forward + backward) for policy training (#38)
- [ ] fused masking and variable-length sequence packing (pack-and-pad) (#42)
- [ ] fused entropy / entropy-bonus kernel
- [ ] fused PPO clip loss for memory-stable PPO training (fused_ppo_clip_loss)
- [ ] parallel GAE scan to avoid CPU-GPU sync in advantage estimation (parallel_gae_scan)
- [ ] DPO loss function with kernel binding (#15)
- [ ] long-sequence online-softmax logp for long-CoT responses (8k-32k+)
- [ ] proactive workspace sizing and memory-safety boundary checks for long-context online-softmax
- [ ] chunked full-vocab KL / distillation KL fused with lm_head, full-distribution forward / reverse KL beyond the sampled-token path (#40)
- [ ] standalone masked advantage whitening / baseline normalization with group, global-batch, and leave-one-out statistics; micro-batch-invariant reduction order
- [ ] generalized ratio-clip-aggregate policy-gradient loss primitive: PPO / GRPO first, GSPO / DAPO / CISPO / Dr.GRPO later as configs
- [ ] clipped / Huber value loss kernel for critic training (PPO / VAPO), closing the critic path alongside parallel_gae_scan and fused_ppo_clip_loss
- [ ] fused reward shaping + reward-to-token scatter for length penalty, overlong punishment, KL-in-reward, last-token scatter, or broadcast, eliminating CPU-GPU round trips

### Packaging and Build

- [ ] PyTorch C++ Extension / Ninja JIT compilation pipeline hardening
- [ ] CUDA AOT wheel matrix, starting with CUDA 12.1 and 12.4, then CUDA 11.8 if demand is clear (#81)
- [ ] pre-compiled wheel matrix for AMD ROCm (5.7, 6.0+)
- [ ] Triton AOT compilation support for deployment environments without NVCC
- [ ] automated manylinux release pipeline through GitHub Actions
- [ ] PyTorch `torch.compile` / Dynamo compatibility for all custom ops

### Distributed and Executors

- [ ] Sequence-Parallel aware LogProb and Loss reductions (#49)
- [ ] TP-Aware fast forward pass for Reward / Critic models (#44)
- [ ] zero-copy weight synchronization bridge - advanced (#13)
- [ ] DeepSpeed-based training worker framework - hardening (#12)
- [ ] staleness-aware weight sync for asynchronous RL rollout
- [ ] fused_ema_weight_sync: in-place EMA weight synchronization kernel to eliminate intermediate copy overhead

### Exit Criteria

1. The P0 kernels and the P2 operator backlog run and match outputs across CUDA, ROCm, and Triton, not just execute.
2. Fresh CUDA/ROCm users can install or build RL-Kernel without reading source code.
3. Large multi-GPU runs have stable distributed reductions and weight-sync paths.

## P3: Domestic Accelerator and Research Expansion

Goal: expand after the three-backend path has correctness, integration, and reproducibility locked.

### Domestic Accelerator

- [ ] Domestic chip backend PoC - scoping issue first (Ascend / Cambricon / Metax), defining target hardware, operator, expected fallback behavior, and validation hardware

### Architecture-Specific Optimizations

- [ ] MLA (Multi-head Latent Attention) aware prefix-shared rollout kernel for DeepSeek-style models
- [ ] memory-efficient MoE routing and expert-level reward assignment ops
- [ ] MoE routing consistency replay: reuse inference-time expert routing during training (e.g. Qwen3-30B-A3B MoE)
- [ ] fused RoPE optimizations for Llama-3 / DeepSeek long-context RL
- [ ] FP8 dynamic scaling alignment for identical scale calculation and truncation logic on H100/SM90+
- [ ] fp8 quantization support for the logprob kernel on H100 (#77)

### Objective Research Kernels

- [ ] GSPO sequence-level importance-ratio kernel
- [ ] DAPO asymmetric clip plus truncated/masked IS fused kernel
- [ ] Dr. GRPO unbiased advantage variant flag on the fused GRPO loss (#46)
- [ ] REINFORCE++ global-baseline advantage path (batch-mean / EMA baseline)
- [ ] fused Gumbel-Softmax sampler for differentiable rollouts (#45)
- [ ] composable policy-optimization primitives shared by GRPO, GSPO, DAPO, Dr. GRPO, PPO, and future objectives (after GSPO/DAPO land)

### Additional Engine Integrations

- [ ] sglang rollout operator integration
- [ ] OpenAI-compatible local serving entry point (#21), treated as an integration reference rather than a core serving product

### Exit Criteria

1. Each new hardware backend has an explicit correctness suite, fallback story, and benchmark command.
2. Research kernels land behind stable primitives instead of creating one-off APIs.
3. Engine integrations are optional and do not expand RL-Kernel into a macro scheduling framework.

### Deferred Unless Partner-Driven

These are useful, but they should not block the core roadmap unless a maintainer or external partner commits hardware, benchmarks, and validation time:

- [ ] full domestic accelerator backend support
- [ ] multi-node/RDMA rollout-training weight synchronization as a first-class product
- [ ] broad OpenAI-compatible serving beyond local testing
- [ ] complete parity across every combination of vLLM, sglang, Megatron, DeepSpeed, FSDP, verl, and slime
- [ ] Multi-modal continuous-to-discrete input tokenization consistency (audio codec GEMM batch-variance alignment). Awaiting reproducible minimal case and partner validation to assess impact on RL training loss.

This roadmap is a living document. If you are interested in any item, especially an RFC or a P0/P1 task, comment on it or open an issue, and join the Discord discussion. Contributions across CUDA, ROCm, and domestic silicon are all welcome.


[Roadmap] RL-Kernel Roadmap Q3 - Q4 #83

Description

RL-Kernel Roadmap (2026)

Roadmap Rules

Current Foundation

P0: Foundation, Showcase Kernels, and Consistency Tooling

P0.1 Benchmark, CI, and Docs

P0.2 Typical Fused Operators on CUDA / ROCm / Triton

P0.3 Train-Inference Consistency Tooling

Exit Criteria

P0.3+ Sprint: Operator-Level Train-Inference Consistency (Demo-grade → Usable-grade)

Workstream 1 — Full Batch-Invariant Forward Chain

Workstream 2 — Distributed / Parallelism Invariance

Workstream 3 — Real-Engine Alignment (critical path)

Workstream 4 —Diagnostics & Reproducibility

Workstream 5 - RL-Kernel to vime Integration Exploration

Critical Path & Scheduling

Risks

P0.4: Multimodal RL Operator Development Plan

Common Multimodal RL Problems

Operator Workstreams

Exit Criteria

P1: End-to-End Loop and Joint Validation

Exit Criteria

P2: Large-Scale Kernel Development across ROCm / CUDA / Triton

Cross-Backend Dispatch and Hardware Fast Paths

Packaging and Build

Distributed and Executors

Exit Criteria

P3: Domestic Accelerator and Research Expansion

Domestic Accelerator

Architecture-Specific Optimizations

Objective Research Kernels

Additional Engine Integrations

Exit Criteria

Deferred Unless Partner-Driven

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions