You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hardware: NVIDIA CUDA | AMD ROCm | domestic accelerator
RL-Kernel is a micro-level RL operator library. We do not own RL scheduling frameworks; verl and slime handle the macro dataflow. Our job is to replace inefficient PyTorch paths around loss, KL, log-prob, variable-length packing, and rollout/training consistency with extreme fused kernels, and to keep those kernels numerically aligned across rollout engines such as vLLM/sglang and training engines such as Megatron/DeepSpeed/FSDP.
Roadmap Rules
Priorities are intentionally uneven:
P0: benchmark/CI/docs foundation, the three P0 showcase kernels, and train-inference consistency tooling.
P1: close one end-to-end loop and jointly validate rollout vs training consistency on top of the P0 tooling and kernels.
P2: large-scale kernel development across ROCm, CUDA, and Triton, including the previous fused-operator backlog.
P3: domestic-accelerator backend and remaining research/expansion work.
An item is considered done only when it has:
a code path in the repository;
focused tests or a reproducible smoke test;
a benchmark, validation report, or documented usage path when the item is user-facing;
clear fallback behavior for unsupported hardware or missing optional dependencies.
Current Foundation
These pieces are already landed and form the base for the next phase:
P0: Foundation, Showcase Kernels, and Consistency Tooling
Goal: lock the things everything else depends on: reproducible benchmark/CI/doc infrastructure, three high-value P0 kernels across CUDA/ROCm/Triton, and train-inference consistency tooling.
cross-platform Docker matrix and CI workflows for hardware regression and automated tests
P0.2 Typical Fused Operators on CUDA / ROCm / Triton
The core fused operators that must exist and be numerically correct on all three backends before scale-out work begins.
Fused CE LogProb without materializing logits + fused backward. cc @KJLdefeated
Avoid landing large [B, S, V] logits for large-vocabulary models, so the saved memory can be used for larger batches or longer CoT responses. The forward path streams vocab blocks with online softmax; the backward path recomputes tiles to trade compute for memory instead of storing full logits/probabilities.
- CUDA: use SM90 TMA/WGMMA-oriented streaming where available, with SM80 cp.async/mma.sync fallback; keep online-softmax state in registers and shared memory.
- ROCm: use wavefront=64-aware reductions and LDS layouts, replacing TMA with manual double buffering to cover CDNA limitations.
- Triton: provide a portable semantic baseline and tolerance test target for CUDA/ROCm native implementations.
Fused FlashAttention with causal mask, varlen packing, and exported attention LSE.
Target long-context RL workloads with packed variable-length batches. Export attention softmax LSE for backward, diagnostics, and rollout/training attention alignment. The exported LSE is attention-domain LSE, not vocab-logprob LSE.
- CUDA: SM90 WGMMA + TMA path, SM80 mma.sync fallback, with varlen and LSE export.
- ROCm: MFMA-based kernel with 16x16x16-style tiling, CK comparison, and RL-Kernel-specific LSE/varlen semantics.
- Triton: extend the existing dense fallback with LSE export and varlen support as the cross-platform semantic baseline.
Batch-Invariant Deterministic LogProb.
Lock selected-logprob reduction order so the same sequence does not drift with batch size, chunked prefill, prefix-cache mode, or packing layout. This is the core P0 operator for preventing KL drift caused by rollout/training logprob inconsistency.
- CUDA: avoid atomicAdd, fix the tree-reduction topology and block partition, and keep per-row reduction independent of surrounding batch shape. (Implement Batch-Invariant Deterministic LogProb CUDA Kernel to Eliminate Batch-Size Drift #96)
- ROCm: use deterministic wavefront=64 reductions and connect the result to the ROCm-to-CUDA parity suite.
- Triton: disable autotune, lock configs, and validate that reduction behavior is stable for the supported BLOCK_SIZE set.
P0.3 Train-Inference Consistency Tooling
Goal: make "same model, same sequence, same policy state" produce aligned log-probs across rollout and training engines. This is the main technical identity of RL-Kernel.
Note: The consistency tooling below is refined and re-scheduled under the P0.3+ Sprint workstreams. Specifically: the cross-benchmark tool (#106) → WS3; the layer-wise drift probe → WS4; TP-invariant reductions (#102) → WS2 (#109); the dtype/tolerance policy → WS1 (#154) + WS2 (#116). P0.3 lists them by tool type; the Sprint section lists them by workstream and execution order.
Exit Criteria
Benchmarks publish hardware, model, dtype, shape, baseline, and command lines, and CI protects the agreed numerical contract.
The three P0 showcase kernels have documented CUDA/ROCm/Triton semantics, runnable reference paths, and tolerance tests.
A contributor can run one command that compares rollout and training log-probs on a fixed model, prompt set, and seed; failures report the first divergent layer/operator.
Final goal (the full vision): On standard Transformer architecture, under real multi-GPU training, with the real vLLM rollout engine and the real Megatron/FSDP training engine, prove that the same sequence produces aligned (bitwise / tight-tolerance) logprobs across both, and that the alignment holds across batch size, parallelism config, and padding. In parallel, integrate RL-Kernel operators into vime as the operator-level consistency layer beneath its framework-level alignment.
This month's scope: WS1 + WS2 + WS5. WS3 (real-engine alignment) and WS4 (diagnostics & reproducibility) are deferred to the next phase.
This month proves operator-level consistency on single-GPU and multi-GPU (our op chain is self-consistent), plus a scoped-out vime integration path — not real vLLM == real Megatron alignment, which requires WS3.
Out of scope this month: MoE, linear attention, FP8, ROCm / domestic chips.
Definition of "usable-grade" (the full bar):
A full batch-invariant forward chain (RMSNorm + matmul + attention + logprob), not isolated ops.
Consistency holds under TP>1 and across mismatched rollout/training parallelism.
A layer-wise probe locates the first divergent operator when drift appears.
This month targets items 1–2 (WS1 + WS2). Items 3–4 require WS3/WS4 and are the next-phase target.
Task granularity: Each Workstream below is an epic (one tracking issue). Each - [ ] item under a Workstream is a direction that becomes its own GitHub issue, not a single PR. Once an issue is picked up, the owner breaks it into 1 or N PRs and lists the planned PRs in the issue description. Rule of thumb: one independently reviewable, independently mergeable change completable in 1–3 days = one PR; otherwise split further. Granularity is uneven by design: items like Per-PR CI or Positioning doc may be a single PR, while vLLM rollout integration, matmul, and #106 are large issues that split into 5–6 PRs each.
Workstream 1 — Full Batch-Invariant Forward Chain
The foundation. Ops are mutually independent and can be built in parallel; each needs its own ground-truth and batch-config sweep.
Ground-truth harness + numerical contract (highest priority; all other ops depend on it): deterministic reference for every op on a fixed standard-Transformer model (e.g. Llama-3-8B / Qwen3-8B dense) + fixed prompt set + seed; define the tolerance policy (bitwise where feasible, tight-tolerance otherwise); produce a concrete per-dtype pass/fail threshold table (e.g. max abs logprob diff, KL bound) used as the single source of truth for "aligned" across WS2 (and WS3/WS4 in the next phase). [WS1] Ground-truth harness + numerical contract for batch-invariant ops #108 cc @maxiaosong1124@a-kaa
Batch-invariant embedding + LM head projection: confirm the input embedding lookup and the final vocab projection (a large matmul, directly upstream of logprob) also run on the batch-invariant path; drift here propagates straight into logprob. [WS1][kernels] Batch-invariant embedding + LM head projection #151 cc @inaniloquentee
Deterministic NCCL all-reduce, incl. DP gradient all-reduce ([WS2] Deterministic NCCL all-reduce (incl. DP gradient all-reduce) #112): audit all-reduce ordering; use NVLink-Sharp in-switch deterministic reductions on Hopper (CUDA 12.8+) where available; deterministic fallback where not; make cross-DP-rank gradient all-reduce order deterministic as a distinct scenario. Independent of WS1 ops — can start in W1. cc @inaniloquentee
Distributed chain test ([WS2] Distributed chain test #113): re-run the WS1 chain test under TP>1 and mismatched configs, covering forward + backward; add to CI. cc @Flink-ddd
Moving from our own assembled chain to real vLLM matching real Megatron. This is the step most likely to consume the schedule buffer. All vLLM integration is non-intrusive and lives in the RL-Kernel repo using vLLM's extension points (custom op / backend plugin); we do not modify vLLM source or file PRs against the vLLM repo this month. W1 first confirms whether vLLM's extension points are sufficient for the logprob / attention hooks we need.
Environment & infra setup (W1, blocking WS2 and WS3): stand up the shared multi-GPU environment with matched vLLM + Megatron/FSDP versions, the fixed model loadable on both engines, and TP/FSDP launchable. First infra task of the sprint (W1) — both WS2's distributed tests and WS3's engine alignment depend on it. [WS3] Environment & infra setup (shared multi-GPU + dual-engine) #127
Input parity precheck: confirm identical tokenization, special tokens, and padding side (left/right) between vLLM and Megatron before any comparison — mismatched inputs invalidate all downstream alignment. [WS3] Input parity precheck (tokenization / padding side) #128
vLLM rollout integration: wire WS1 batch-invariant ops into vLLM's real rollout path (logprob / attention), behind a non-intrusive custom-op / optional-backend flag so pure inference is unaffected. (Shared investigation with WS5 — same probing of vLLM's rollout path.) [WS3] vLLM rollout integration #129
Drift-to-zero validation: with batch-invariant ops enabled, real vLLM logprob matches real Megatron logprob (within tolerance); with them disabled, drift is clearly visible (control group). [WS3] Drift-to-zero validation #132
Weight-sync correctness: ensure the training/rollout weight-sync bridge does not itself introduce numerical mismatch (a known source of silent drift). [WS3] Weight-sync correctness #133
Sampling vs logprob decoupling: rollout samples (temperature / top-p) while training computes teacher-forcing logprob; align "logprob given the same token sequence," keeping sampling and logprob computation explicitly separated so the comparison measures the same quantity. [WS3] Sampling vs logprob decoupling #135
Workstream 4 —Diagnostics & Reproducibility
Parallel, supports all other workstreams.
Layer-wise hidden-state alignment probe: when drift appears, automatically locate the first divergent layer / operator instead of only reporting final KL. [WS4] Layer-wise hidden-state alignment probe #136
Positioning doc: framework-level (vime) vs operator-level (ours) vs inference-side (Thinking Machines / DeepSeek-V4); make the differentiation explicit; frame it as "first reproducible end-to-end RL train-inference consistency," NOT "first batch-invariant kernels" (those already exist — Thinking Machines / sglang / DeepSeek-V4); overclaiming will be challenged. [WS4] Positioning doc (canonical external version) #139
Drift visualization tool: render the layer-wise probe output as a simple "where drift starts" chart, for both debugging and demos — not just raw numbers. [WS4] Drift visualization tool #140
Demo deliverable: a one-command reproducible script plus the drift chart (enabled vs disabled), packaged for live or recorded demonstration — the artifact leadership presents externally. [WS4] Demo deliverable #141
Workstream 5 - RL-Kernel to vime Integration Exploration
Independent and decoupled from the consistency critical path; driven by a small dedicated effort. Overlaps with WS3's vLLM integration probing, the same investigation of vLLM's rollout path feeds both, so coordinate to avoid duplicate work.
Map vime's architecture ([WS5] Map vime architecture and RL-Kernel hook points #118): confirm it is built on slime's training stack with vLLM (+ vllm-router) as the rollout backend; identify where logprob / loss / sampling operators are invoked on both the rollout (vLLM) and training (Megatron) sides, and locate the hook points where RL-Kernel operators can be injected. cc @inaniloquentee
Produce an "RL-Kernel to vime integration" design doc ([WS5] RL-Kernel to vime integration design doc #119): insertion points, change surface, non-intrusive integration path (custom op / optional backend), and an explicit positioning statement: operator-level consistency (ours) as the layer beneath vime's framework-level alignment, complementary rather than competing. cc @inaniloquentee
Run vime's native minimal RL example ([WS5] vime native minimal RL baseline #117): run without our operators as a baseline first, so a failed PoC can be attributed to our integration rather than an unconfigured vime. cc @inaniloquentee
RL-Kernel to vime integration and benchmark plan ([WS5] RL-Kernel to vime integration and benchmark plan #158): integrate RL-Kernel into vime as an optional GRPO/logprob acceleration layer, starting with fused logp, ratio_kl, and grpo_loss; benchmark vime vs vime + RL-Kernel on dense workloads, and validate MoE only as vime + R3 vs vime + R3 + RL-Kernel without claiming MoE-kernel acceleration or replacing R3. cc @inaniloquentee
Critical Path & Scheduling
This month's scope: WS1 + WS2 + WS5. WS3 (real-engine alignment) and WS4 (diagnostics & reproducibility) are deferred to the next phase. The goal this month is operator-level consistency proven on single-GPU and multi-GPU (self-consistency), plus a vime integration path scoped out — not full real-engine train-inference alignment, which requires WS3.
Serial critical path (determines whether the month succeeds):
WS1 ground-truth harness → WS1 ops → WS1 chain test (single-GPU) → WS2 distributed chain test (multi-GPU) → cross-config alignment
WS2 merges in once WS1 ops reach a working draft; WS5 runs in parallel throughout and is largely independent (its PoC uses the existing fused_logp, not this month's new ops). Environment & infra setup (listed under WS3) must still be done this month because WS2's multi-GPU tests depend on it — pull it forward to W1.
Where adding people helps (parallel): the WS1 ops, the WS2 sub-items, and WS5.
Where adding people does not help (serial bottlenecks): WS1 chain integration and WS2 cross-config alignment; these depend on time and experience, not headcount.
Weekly milestones (goal: operator-level consistency with alignment data on single + multi-GPU, not production-grade real-engine alignment):
W1: ground-truth harness + numerical contract finalized; environment & infra setup done (matched multi-GPU env, fixed model loadable, TP/FSDP launchable — blocking WS2); WS1 ops started in parallel (matmul gets a dedicated owner; harness owner goes first since all ops depend on it); WS5 maps vime architecture and runs vime's native example as a baseline.
W2: each WS1 op passes its single-op batch sweep; WS2 begins merging in (TP / SP / deterministic collectives) as WS1 ops reach working draft; WS5 integration design doc drafted.
W3: WS1 chain test passes (full-chain single-GPU consistency, forward + backward); WS2 TP>1 consistency working; WS5 minimal PoC (one operator invoked by vime).
W4: WS2 distributed chain test + cross-config alignment (the hardest WS2 item) finalized; reproducible benchmark capturing the operator-level consistency results (single + multi-GPU, enabled vs disabled drift, overhead). Remaining time for debugging.
Risks
This month's highest-uncertainty items are batch-invariant matmul (WS1) and cross-config alignment (WS2): matmul because cuBLAS does not provide batch-invariance and split-k breaks it; cross-config because aligning logprob across mismatched parallelism (e.g. rollout TP=2 vs training FSDP) has intricate reduction-order logic. Assign a dedicated engineer to each, and treat cross-config as the serial bottleneck that decides whether W4 lands.
WS1's ground-truth harness is a hidden dependency for everyone — if it slips, every op loses its verification target. It must be the first thing finished in W1.
Scope reminder for external communication: completing WS1 + WS2 proves operator-level consistency (our op chain is self-consistent across batch size, padding, and parallelism) — it does NOT prove real vLLM == real Megatron, which is WS3. Describe the result as "operator-level train-inference consistency, validated on single and multi-GPU," not "train-inference consistency closed-loop."
(Deferred to next phase, noted for forward planning: WS3 real-engine alignment is the highest-risk item overall — real vLLM vs real Megatron commonly shows each engine correct on its own while the comparison still differs by ~1e-3, and tracing it to an overlooked layout or collective can take significant time. When WS3 starts, assign experienced engineers and begin probing engines immediately.)
P0.4: Multimodal RL Operator Development Plan
Goal: support image-text and omni-modal RL workloads without expanding RL-Kernel into a multimodal scheduling framework. verl / EasyR1 / OpenRLHF / TRL / vLLM-style stacks already expose VLM GRPO/PPO/RLOO paths with image-text prompts, multi-image or video inputs, processor caching, VLM reward models, and framework-level rollout orchestration. RL-Kernel should own the operator layer where modality expansion, packing, masking, caching, and rollout/training logprob consistency currently become slow or fragile.
Scope boundary:
Image-text VLM RL first: Qwen2.5-VL / Qwen3-VL-style workloads with text generation conditioned on images.
Multi-image and short-video inputs second: frame/patch packing and media-prefix cache behavior are in scope; full video-generation training is not a P0.4 deliverable.
Audio, diffusion, and continuous-latent RL remain RFC-first until there is a reproducible minimal case and a partner willing to validate correctness and reward impact.
Visual grounding and GUI/computer-use agents are explicit validation scenarios after the base VLM path, because they add coordinate/action spans while still depending on the same selected-logprob, masking, and reward-scatter contracts.
Frameworks keep dataset loading, chat templates, media decoding, reward I/O, rollout orchestration, and model-specific processors. RL-Kernel provides reusable operators, numerical contracts, and validation harnesses.
Common Multimodal RL Problems
Processor / token-layout drift: rollout engines and training engines may insert image tokens, padding, special tokens, or chat-template spans differently. A small mismatch invalidates selected logprob, KL, and GRPO/PPO loss comparisons.
Variable visual-token explosion: image resolution, number of images, and video frame count create highly ragged visual-prefix lengths. Naive padding wastes memory and changes reduction shapes, which can reintroduce batch-size-dependent drift.
Modality-aware loss masking: RL losses should usually score assistant text tokens while ignoring prompt, image placeholder, vision-prefix, and tool-observation spans. Current framework paths often rebuild these masks in Python and then materialize large tensors before logprob/loss.
Media-prefix and KV-cache reuse: group rollouts in GRPO repeatedly reuse the same prompt and media. Without stable media IDs, packed offsets, and deterministic cache gather/scatter, the rollout path recomputes vision encoders or compares a cached rollout path with an uncached training path.
Vision tower / projector hotspots: VLM workloads add vision encoder outputs, projector GEMMs, patch merge/unpad, 2D or temporal position handling, and final LM-head logprob. These ops sit upstream of the same RL logprob path and can be both memory-heavy and numerically inconsistent across batch layouts.
Multimodal reward latency and reward scatter: VLM-as-judge, OCR, rule-based geometry, and external verifier rewards often return sequence-level or region-level scores. The training path still needs token-level reward/advantage scatter, group normalization, and masking without CPU-GPU synchronization.
Spatial grounding and coordinate/action tokens: grounding, GUI, and robotics-style tasks output boxes, points, clicks, swipes, or structured action JSON. Their rewards depend on decoded coordinates or action fields, so tokenization, coordinate normalization, and action-span boundaries must be reproducible across rollout and training.
Multi-turn visual-agent trajectories: GUI/browser/mobile tasks alternate screenshots, tool observations, model actions, and environment feedback. Only the model action spans should receive policy loss, while observations and tool results must stay in context without corrupting masks or KL.
Multimodal reward/cost separation: safety and preference alignment often produce both helpfulness reward and safety cost from multimodal reward models. These signals need separate normalization, clipping, and scatter paths so constrained objectives do not silently mix reward and cost semantics.
Continuous-modality trajectory mismatch: omni-modal or diffusion-style RL rollouts may be denoising or latent trajectories rather than discrete token sequences. The operator contract must separate discrete-token RL kernels from continuous-latent KL/reward accumulation before implementation begins.
Operator Workstreams
Multimodal batch schema and trace fixtures: extend the minimal RL batch schema with media_ids, modality_spans, loss_spans, vision_token_offsets, media_cache_keys, and processor_fingerprint; ship synthetic image-text fixtures plus one real Geo3K/Qwen-VL-style smoke fixture.
Processor-layout parity checker: one command compares rollout vs training token IDs, image-token insertion, position IDs, attention masks, media spans, and loss masks before running any logprob comparison; failures point to the first mismatched span.
Modality-aware pack-and-pad / unpad operator: pack ragged text, image, and video patch spans into RL-shaped batches with stable offsets; support deterministic unpack for diagnostics and fallback to PyTorch when the backend lacks a native kernel.
Modality-mask selected-logprob operator: extend selected-logprob, ratio/KL, and GRPO/PPO loss kernels so they consume compact loss_spans / modality_spans directly and never score visual-prefix or prompt tokens by accident.
Batch-invariant VLM prefill chain: validate RMSNorm, attention, projector, LM head, and selected-logprob on image-text prompts across batch=1/N, multi-image count, padding side, processor-cache on/off, and prefix-cache on/off.
Visual projector + LM-head fused path: optimize the projector-to-text bridge and final vocab projection for large visual prefixes; keep deterministic accumulation order and reuse the P0.3+ dtype/tolerance contract.
Media-prefix cache index and deterministic gather/scatter: define stable media cache keys and packed KV offsets so group rollouts can reuse identical media prefixes while training can replay the same logical sequence for logprob validation.
Video frame / patch packing helper: add a small operator contract for frame-major vs token-major packing, temporal position IDs, and per-frame masks; first target short-video understanding, not video generation.
Multimodal reward-to-token scatter: fuse sequence-level, region-level, OCR/verifier, and length-penalty rewards into token-level advantages with group normalization, leave-one-out statistics, and deterministic reduction order.
Grounding / coordinate-action schema: add region_spans, bbox_targets, point_targets, action_spans, coordinate_system, and screen_size fields; validate normalized boxes, click points, and structured action tokens before reward or logprob comparison.
GUI / computer-use trajectory packer: pack repeated screenshot-observation-action turns with deterministic action loss masks, per-step rewards, and tool/environment observation spans; cover browser/mobile screenshots and action JSON as the first target.
Reward-cost scatter for multimodal safety RL: keep helpfulness reward, safety cost, and optional verifier reward as separate tensors through normalization and objective aggregation; support constrained PPO/GRPO-style objectives without mixing signs or masks.
Continuous-latent RL RFC: define the operator boundary for diffusion/audio/omni-modal trajectories, including latent-step KL, denoising-step reward accumulation, and trajectory masks; no kernel implementation until the RFC has a minimal correctness harness.
Audio/codec sub-case (source: sglang-omni#774): MOSS-TTS reference encoder produces batch-shape-dependent discrete codec tokens (single-vs-batch mismatch ~4–9%) — BF16 GEMM-shape drift crosses the residual quantizer (RLFQ) boundaries and flips token IDs at the continuous→discrete encode step, before they enter the LM, breaking rollout↔training input alignment. This is an input-tokenization / discrete-token case rather than a continuous-latent trajectory, and anchors where this RFC must draw the discrete-vs-continuous boundary. First reproducible minimal case, partner-confirmed but not yet staffed.
Framework compatibility harness: validate the same fixture through vLLM/sglang rollout and verl/vime/OpenRLHF/TRL-style training adapters, proving that RL-Kernel operators observe the same media spans, masks, and selected logprob targets.
Exit Criteria
A VLM contributor can run one image-text GRPO smoke test that verifies processor layout parity, modality-aware selected logprob, and loss masking before training.
The same image-text fixture produces aligned logprob/loss across batch size, padding layout, media-cache mode, and prefix-cache mode within the P0.3+ tolerance policy.
At least one operator path shows a concrete memory or latency win over the PyTorch/framework baseline without changing reward or loss semantics.
Unsupported modalities have explicit RFC status, fallback behavior, and validation requirements instead of half-supported kernels.
P1: End-to-End Loop and Joint Validation
Goal: on top of the P0 tooling and kernels, ship one boring, credible end-to-end path and jointly validate rollout-vs-training consistency on real workloads. The preferred path is vLLM rollout plus RL-Kernel logprob/loss kernels plus a verl/slime integration flag.
[RFC] integrate RL-Kernel as a vLLM rollout backend through a non-intrusive custom-op or optional-backend path
wire prefix_shared_attention into vLLM PagedAttention KV management
veRL / slime backend flag, for example use_rl_kernel_backend=True
A user can run a documented single-node GRPO loop with vLLM rollout, RL-Kernel logprob/loss, and a real reward provider.
Pure inference workloads remain unaffected when the RL-Kernel backend is disabled.
The integration reports memory, latency, and logprob-consistency metrics measured against the P0 cross-benchmark tooling.
P2: Large-Scale Kernel Development across ROCm / CUDA / Triton
Goal: take the P0 kernel set and the broader fused-operator backlog to scale across the three first-class backends: dispatch infrastructure, hardware-specific fast paths, cross-backend parity, packaging, and the distributed executors that large runs depend on.
Each new hardware backend has an explicit correctness suite, fallback story, and benchmark command.
Research kernels land behind stable primitives instead of creating one-off APIs.
Engine integrations are optional and do not expand RL-Kernel into a macro scheduling framework.
Deferred Unless Partner-Driven
These are useful, but they should not block the core roadmap unless a maintainer or external partner commits hardware, benchmarks, and validation time:
full domestic accelerator backend support
multi-node/RDMA rollout-training weight synchronization as a first-class product
broad OpenAI-compatible serving beyond local testing
complete parity across every combination of vLLM, sglang, Megatron, DeepSpeed, FSDP, verl, and slime
Multi-modal continuous-to-discrete input tokenization consistency (audio codec GEMM batch-variance alignment). Awaiting reproducible minimal case and partner validation to assess impact on RL training loss.
This roadmap is a living document. If you are interested in any item, especially an RFC or a P0/P1 task, comment on it or open an issue, and join the Discord discussion. Contributions across CUDA, ROCm, and domestic silicon are all welcome.
RL-Kernel Roadmap (2026)
We have organized RL-Kernel's goals for the second half of 2026 into several focus areas. Discussion happens on our Discord.
Repo: https://github.com/RL-Align/RL-Kernel
Discord: https://discord.com/invite/5HfkFjmPD
Hardware: NVIDIA CUDA | AMD ROCm | domestic accelerator
RL-Kernel is a micro-level RL operator library. We do not own RL scheduling frameworks; verl and slime handle the macro dataflow. Our job is to replace inefficient PyTorch paths around loss, KL, log-prob, variable-length packing, and rollout/training consistency with extreme fused kernels, and to keep those kernels numerically aligned across rollout engines such as vLLM/sglang and training engines such as Megatron/DeepSpeed/FSDP.
Roadmap Rules
Priorities are intentionally uneven:
An item is considered done only when it has:
Current Foundation
These pieces are already landed and form the base for the next phase:
P0: Foundation, Showcase Kernels, and Consistency Tooling
Goal: lock the things everything else depends on: reproducible benchmark/CI/doc infrastructure, three high-value P0 kernels across CUDA/ROCm/Triton, and train-inference consistency tooling.
P0.1 Benchmark, CI, and Docs
P0.2 Typical Fused Operators on CUDA / ROCm / Triton
The core fused operators that must exist and be numerically correct on all three backends before scale-out work begins.
Fused CE LogProb without materializing logits + fused backward. cc @KJLdefeated
Avoid landing large
[B, S, V]logits for large-vocabulary models, so the saved memory can be used for larger batches or longer CoT responses. The forward path streams vocab blocks with online softmax; the backward path recomputes tiles to trade compute for memory instead of storing full logits/probabilities.- CUDA: use SM90 TMA/WGMMA-oriented streaming where available, with SM80 cp.async/mma.sync fallback; keep online-softmax state in registers and shared memory.
- ROCm: use wavefront=64-aware reductions and LDS layouts, replacing TMA with manual double buffering to cover CDNA limitations.
- Triton: provide a portable semantic baseline and tolerance test target for CUDA/ROCm native implementations.
Fused FlashAttention with causal mask, varlen packing, and exported attention LSE.
Target long-context RL workloads with packed variable-length batches. Export attention softmax LSE for backward, diagnostics, and rollout/training attention alignment. The exported LSE is attention-domain LSE, not vocab-logprob LSE.
- CUDA: SM90 WGMMA + TMA path, SM80 mma.sync fallback, with varlen and LSE export.
- ROCm: MFMA-based kernel with 16x16x16-style tiling, CK comparison, and RL-Kernel-specific LSE/varlen semantics.
- Triton: extend the existing dense fallback with LSE export and varlen support as the cross-platform semantic baseline.
Batch-Invariant Deterministic LogProb.
Lock selected-logprob reduction order so the same sequence does not drift with batch size, chunked prefill, prefix-cache mode, or packing layout. This is the core P0 operator for preventing KL drift caused by rollout/training logprob inconsistency.
- CUDA: avoid atomicAdd, fix the tree-reduction topology and block partition, and keep per-row reduction independent of surrounding batch shape. (Implement Batch-Invariant Deterministic LogProb CUDA Kernel to Eliminate Batch-Size Drift #96)
- ROCm: use deterministic wavefront=64 reductions and connect the result to the ROCm-to-CUDA parity suite.
- Triton: disable autotune, lock configs, and validate that reduction behavior is stable for the supported BLOCK_SIZE set.
P0.3 Train-Inference Consistency Tooling
Goal: make "same model, same sequence, same policy state" produce aligned log-probs across rollout and training engines. This is the main technical identity of RL-Kernel.
Note: The consistency tooling below is refined and re-scheduled under the P0.3+ Sprint workstreams. Specifically: the cross-benchmark tool (#106) → WS3; the layer-wise drift probe → WS4; TP-invariant reductions (#102) → WS2 (#109); the dtype/tolerance policy → WS1 (#154) + WS2 (#116). P0.3 lists them by tool type; the Sprint section lists them by workstream and execution order.
Exit Criteria
P0.3+ Sprint: Operator-Level Train-Inference Consistency (Demo-grade → Usable-grade)
Final goal (the full vision): On standard Transformer architecture, under real multi-GPU training, with the real vLLM rollout engine and the real Megatron/FSDP training engine, prove that the same sequence produces aligned (bitwise / tight-tolerance) logprobs across both, and that the alignment holds across batch size, parallelism config, and padding. In parallel, integrate RL-Kernel operators into vime as the operator-level consistency layer beneath its framework-level alignment.
This month's scope: WS1 + WS2 + WS5. WS3 (real-engine alignment) and WS4 (diagnostics & reproducibility) are deferred to the next phase.
This month proves operator-level consistency on single-GPU and multi-GPU (our op chain is self-consistent), plus a scoped-out vime integration path — not real vLLM == real Megatron alignment, which requires WS3.
Out of scope this month: MoE, linear attention, FP8, ROCm / domestic chips.
Definition of "usable-grade" (the full bar):
This month targets items 1–2 (WS1 + WS2). Items 3–4 require WS3/WS4 and are the next-phase target.
Task granularity: Each Workstream below is an epic (one tracking issue). Each - [ ] item under a Workstream is a direction that becomes its own GitHub issue, not a single PR. Once an issue is picked up, the owner breaks it into 1 or N PRs and lists the planned PRs in the issue description. Rule of thumb: one independently reviewable, independently mergeable change completable in 1–3 days = one PR; otherwise split further. Granularity is uneven by design: items like Per-PR CI or Positioning doc may be a single PR, while vLLM rollout integration, matmul, and #106 are large issues that split into 5–6 PRs each.
Workstream 1 — Full Batch-Invariant Forward Chain
The foundation. Ops are mutually independent and can be built in parallel; each needs its own ground-truth and batch-config sweep.
Workstream 2 — Distributed / Parallelism Invariance
Real training is multi-GPU; without this layer, WS1 does not hold in real scenarios.
Workstream 3 — Real-Engine Alignment (critical path)
Moving from our own assembled chain to real vLLM matching real Megatron. This is the step most likely to consume the schedule buffer. All vLLM integration is non-intrusive and lives in the RL-Kernel repo using vLLM's extension points (custom op / backend plugin); we do not modify vLLM source or file PRs against the vLLM repo this month. W1 first confirms whether vLLM's extension points are sufficient for the logprob / attention hooks we need.
Workstream 4 —Diagnostics & Reproducibility
Parallel, supports all other workstreams.
Workstream 5 - RL-Kernel to vime Integration Exploration
Independent and decoupled from the consistency critical path; driven by a small dedicated effort. Overlaps with WS3's vLLM integration probing, the same investigation of vLLM's rollout path feeds both, so coordinate to avoid duplicate work.
fused logp,ratio_kl, andgrpo_loss; benchmark vime vs vime + RL-Kernel on dense workloads, and validate MoE only asvime + R3vsvime + R3 + RL-Kernelwithout claiming MoE-kernel acceleration or replacing R3. cc @inaniloquenteeCritical Path & Scheduling
This month's scope: WS1 + WS2 + WS5. WS3 (real-engine alignment) and WS4 (diagnostics & reproducibility) are deferred to the next phase. The goal this month is operator-level consistency proven on single-GPU and multi-GPU (self-consistency), plus a vime integration path scoped out — not full real-engine train-inference alignment, which requires WS3.
Serial critical path (determines whether the month succeeds):
WS1 ground-truth harness → WS1 ops → WS1 chain test (single-GPU) → WS2 distributed chain test (multi-GPU) → cross-config alignment
WS2 merges in once WS1 ops reach a working draft; WS5 runs in parallel throughout and is largely independent (its PoC uses the existing fused_logp, not this month's new ops). Environment & infra setup (listed under WS3) must still be done this month because WS2's multi-GPU tests depend on it — pull it forward to W1.
Where adding people helps (parallel): the WS1 ops, the WS2 sub-items, and WS5.
Where adding people does not help (serial bottlenecks): WS1 chain integration and WS2 cross-config alignment; these depend on time and experience, not headcount.
Weekly milestones (goal: operator-level consistency with alignment data on single + multi-GPU, not production-grade real-engine alignment):
W1: ground-truth harness + numerical contract finalized; environment & infra setup done (matched multi-GPU env, fixed model loadable, TP/FSDP launchable — blocking WS2); WS1 ops started in parallel (matmul gets a dedicated owner; harness owner goes first since all ops depend on it); WS5 maps vime architecture and runs vime's native example as a baseline.
W2: each WS1 op passes its single-op batch sweep; WS2 begins merging in (TP / SP / deterministic collectives) as WS1 ops reach working draft; WS5 integration design doc drafted.
W3: WS1 chain test passes (full-chain single-GPU consistency, forward + backward); WS2 TP>1 consistency working; WS5 minimal PoC (one operator invoked by vime).
W4: WS2 distributed chain test + cross-config alignment (the hardest WS2 item) finalized; reproducible benchmark capturing the operator-level consistency results (single + multi-GPU, enabled vs disabled drift, overhead). Remaining time for debugging.
Risks
This month's highest-uncertainty items are batch-invariant matmul (WS1) and cross-config alignment (WS2): matmul because cuBLAS does not provide batch-invariance and split-k breaks it; cross-config because aligning logprob across mismatched parallelism (e.g. rollout TP=2 vs training FSDP) has intricate reduction-order logic. Assign a dedicated engineer to each, and treat cross-config as the serial bottleneck that decides whether W4 lands.
WS1's ground-truth harness is a hidden dependency for everyone — if it slips, every op loses its verification target. It must be the first thing finished in W1.
Scope reminder for external communication: completing WS1 + WS2 proves operator-level consistency (our op chain is self-consistent across batch size, padding, and parallelism) — it does NOT prove real vLLM == real Megatron, which is WS3. Describe the result as "operator-level train-inference consistency, validated on single and multi-GPU," not "train-inference consistency closed-loop."
(Deferred to next phase, noted for forward planning: WS3 real-engine alignment is the highest-risk item overall — real vLLM vs real Megatron commonly shows each engine correct on its own while the comparison still differs by ~1e-3, and tracing it to an overlooked layout or collective can take significant time. When WS3 starts, assign experienced engineers and begin probing engines immediately.)
P0.4: Multimodal RL Operator Development Plan
Goal: support image-text and omni-modal RL workloads without expanding RL-Kernel into a multimodal scheduling framework. verl / EasyR1 / OpenRLHF / TRL / vLLM-style stacks already expose VLM GRPO/PPO/RLOO paths with image-text prompts, multi-image or video inputs, processor caching, VLM reward models, and framework-level rollout orchestration. RL-Kernel should own the operator layer where modality expansion, packing, masking, caching, and rollout/training logprob consistency currently become slow or fragile.
Scope boundary:
Common Multimodal RL Problems
Operator Workstreams
Multimodal batch schema and trace fixtures: extend the minimal RL batch schema with
media_ids,modality_spans,loss_spans,vision_token_offsets,media_cache_keys, andprocessor_fingerprint; ship synthetic image-text fixtures plus one real Geo3K/Qwen-VL-style smoke fixture.Processor-layout parity checker: one command compares rollout vs training token IDs, image-token insertion, position IDs, attention masks, media spans, and loss masks before running any logprob comparison; failures point to the first mismatched span.
Modality-aware pack-and-pad / unpad operator: pack ragged text, image, and video patch spans into RL-shaped batches with stable offsets; support deterministic unpack for diagnostics and fallback to PyTorch when the backend lacks a native kernel.
Modality-mask selected-logprob operator: extend selected-logprob, ratio/KL, and GRPO/PPO loss kernels so they consume compact
loss_spans/modality_spansdirectly and never score visual-prefix or prompt tokens by accident.Batch-invariant VLM prefill chain: validate RMSNorm, attention, projector, LM head, and selected-logprob on image-text prompts across batch=1/N, multi-image count, padding side, processor-cache on/off, and prefix-cache on/off.
Visual projector + LM-head fused path: optimize the projector-to-text bridge and final vocab projection for large visual prefixes; keep deterministic accumulation order and reuse the P0.3+ dtype/tolerance contract.
Media-prefix cache index and deterministic gather/scatter: define stable media cache keys and packed KV offsets so group rollouts can reuse identical media prefixes while training can replay the same logical sequence for logprob validation.
Video frame / patch packing helper: add a small operator contract for frame-major vs token-major packing, temporal position IDs, and per-frame masks; first target short-video understanding, not video generation.
Multimodal reward-to-token scatter: fuse sequence-level, region-level, OCR/verifier, and length-penalty rewards into token-level advantages with group normalization, leave-one-out statistics, and deterministic reduction order.
Grounding / coordinate-action schema: add
region_spans,bbox_targets,point_targets,action_spans,coordinate_system, andscreen_sizefields; validate normalized boxes, click points, and structured action tokens before reward or logprob comparison.GUI / computer-use trajectory packer: pack repeated screenshot-observation-action turns with deterministic action loss masks, per-step rewards, and tool/environment observation spans; cover browser/mobile screenshots and action JSON as the first target.
Reward-cost scatter for multimodal safety RL: keep helpfulness reward, safety cost, and optional verifier reward as separate tensors through normalization and objective aggregation; support constrained PPO/GRPO-style objectives without mixing signs or masks.
Continuous-latent RL RFC: define the operator boundary for diffusion/audio/omni-modal trajectories, including latent-step KL, denoising-step reward accumulation, and trajectory masks; no kernel implementation until the RFC has a minimal correctness harness.
Audio/codec sub-case (source: sglang-omni#774): MOSS-TTS reference encoder produces batch-shape-dependent discrete codec tokens (single-vs-batch mismatch ~4–9%) — BF16 GEMM-shape drift crosses the residual quantizer (RLFQ) boundaries and flips token IDs at the continuous→discrete encode step, before they enter the LM, breaking rollout↔training input alignment. This is an input-tokenization / discrete-token case rather than a continuous-latent trajectory, and anchors where this RFC must draw the discrete-vs-continuous boundary. First reproducible minimal case, partner-confirmed but not yet staffed.
Framework compatibility harness: validate the same fixture through vLLM/sglang rollout and verl/vime/OpenRLHF/TRL-style training adapters, proving that RL-Kernel operators observe the same media spans, masks, and selected logprob targets.
Exit Criteria
P1: End-to-End Loop and Joint Validation
Goal: on top of the P0 tooling and kernels, ship one boring, credible end-to-end path and jointly validate rollout-vs-training consistency on real workloads. The preferred path is vLLM rollout plus RL-Kernel logprob/loss kernels plus a verl/slime integration flag.
use_rl_kernel_backend=TrueExit Criteria
P2: Large-Scale Kernel Development across ROCm / CUDA / Triton
Goal: take the P0 kernel set and the broader fused-operator backlog to scale across the three first-class backends: dispatch infrastructure, hardware-specific fast paths, cross-backend parity, packaging, and the distributed executors that large runs depend on.
Cross-Backend Dispatch and Hardware Fast Paths
Moved from the previous P0 operator backlog:
Packaging and Build
torch.compile/ Dynamo compatibility for all custom opsDistributed and Executors
Exit Criteria
P3: Domestic Accelerator and Research Expansion
Goal: expand after the three-backend path has correctness, integration, and reproducibility locked.
Domestic Accelerator
Architecture-Specific Optimizations
Objective Research Kernels
Additional Engine Integrations
Exit Criteria
Deferred Unless Partner-Driven
These are useful, but they should not block the core roadmap unless a maintainer or external partner commits hardware, benchmarks, and validation time:
This roadmap is a living document. If you are interested in any item, especially an RFC or a P0/P1 task, comment on it or open an issue, and join the Discord discussion. Contributions across CUDA, ROCm, and domestic silicon are all welcome.