[BugFix][S-TIR] Wrap bare scalar bodies in DefaultGPUSchedule to avoid root-block crash#19514
Merged
tlopex merged 1 commit intoapache:mainfrom May 6, 2026
Merged
Conversation
…d root-block crash
When a PrimFunc body is a bare `SBlockRealize` (a fully-scalar op with
no enclosing loops and no iter vars), the realized block is itself the
function's root sref. `ThreadBind` reaches the `loops.empty()` branch
and calls `Schedule::AddUnitLoop(block)`, which fails the
`sref->parent != nullptr` check in `s_tir::AddUnitLoop` with
"Cannot add loops on top of the root block".
Before constructing the schedule, rewrite GPU-bound PrimFuncs whose
body is a bare-leaf `SBlockRealize` so the realized block is no longer
the root. The wrap conditions are intentionally narrow: body is
`SBlockRealize`, the block has empty `iter_vars`, and the block's body
is not `For` or `SBlockRealize` (so that well-formed implicit roots
already wrapping a loop nest are left alone). The new shape is
SBlockRealize(block=SBlock("root", body=
For(u, 0, 1, kSerial, SBlockRealize(iter_values=[u],
block=<original block, iter_vars=[IterVar(0..1, vu, kDataPar)]>))))
The synthesised 1-extent data-parallel iter keeps iter_values and
iter_vars counts consistent for downstream checks, and the new For
gives ThreadBind a real loop to bind to blockIdx.x / threadIdx.x.
Closes apache#17873.
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to handle PrimFunc bodies consisting of a bare SBlockRealize, such as scalar operations, within the DefaultGPUSchedule pass. It adds a WrapBareSBlockBody helper function to wrap these blocks in a unit loop, which prevents crashes during thread binding by ensuring there is a loop to operate on. A new test case for scalar addition has been included to verify the implementation. I have no feedback to provide as there were no review comments.
b92e53b to
727d931
Compare
tlopex
approved these changes
May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Closes #17873.
DefaultGPUSchedulecrashes when a PrimFunc body is a bareSBlockRealize(a fully-scalar op with no enclosing loops and no itervars):
Minimal repro (TVMScript decorators are omitted in this snippet to
satisfy the PR-body lint; the regression test uses the regular
T.prim_funcform):Root Cause
The realized
scalar_addblock is itself the prim_func body's rootsref — it has no parent stmt to mutate.
ThreadBind(
src/s_tir/transform/default_gpu_schedule.cc) reaches theloops.empty()branch and callssch->AddUnitLoop(block), which failsthe
sref->parent != nullptrcheck ins_tir::AddUnitLoop(
src/s_tir/schedule/primitive/loop_transformation.cc:1166).The schedule infrastructure additionally requires the prim_func body
to be an
SBlockRealizewhose block is the function's root(
GetRootPrimFuncinsrc/s_tir/schedule/analysis/analysis.cc:53),so the body cannot simply be wrapped in a top-level
For.Fix
Before constructing the schedule, rewrite GPU-bound PrimFuncs whose
body is a bare-leaf
SBlockRealizeso the realized block is no longerthe root. The wrap conditions are intentionally narrow:
func->bodyisSBlockRealize,iter_vars, andFororSBlockRealize(i.e. it is a leafcomputation, not the well-formed implicit root that wraps a loop
nest produced by the rest of the pipeline).
When all three hold, the body becomes:
The synthesised 1-extent data-parallel iter keeps
iter_values.size() == iter_vars.size()for downstream checks, and thenew For loop gives
ThreadBinda real loop to bind toblockIdx.x/threadIdx.x. Already-scheduled functions and host-only PrimFuncs areskipped via the existing
IsScheduledOnGPU/kIsScheduledgating.Testing
10 passed (9 existing + 1 new
test_scalar_block_no_loops). End-to-endcompile + execute on RTX 3080 (sm_86): the scalar repro returns the
expected
2.0 + 3.0 = 5.0.