Validate EmbedLayerNormalization indices and use 64-bit offsets in CUDA BERT LayerNorm/SkipLayerNorm by titaiwangms · Pull Request #29257 · microsoft/onnxruntime

titaiwangms · 2026-06-25T01:41:11Z

Description

This change has two coherent parts in the CUDA BERT embedding/layer-norm path:

EmbedLayerNormalization index validation (CUDA). The CUDA kernel now validates the word, position, and segment ids against their embedding-table row counts before using them to index the tables, mirroring the existing CPU path (device-side, early return, no silent clamp). CheckInputs additionally requires position_embedding to have at least sequence_length rows when position_ids is not supplied. The device error-flag host readback is skipped while a CUDA graph is being captured (cudaStreamIsCapturing) so CUDA-graph capture remains supported.
64-bit write-element offsets in CUDA BERT LayerNorm / SkipLayerNorm / EmbedLayerNorm. The global output write-element offsets in the LayerNorm, SimplifiedLayerNorm, LayerNormSmall, and SimplifiedLayerNormSmall device helpers (and their EmbedLayerNorm / SkipLayerNorm call sites) are widened to 64-bit so that large tensors (batch * seq * hidden > 2^31) index the output correctly. The gamma/beta indices remain 32-bit (bounded by the hidden dimension).

Changes

Validate input_ids, position_ids, and segment_ids values against the corresponding embedding-table row counts in the CUDA kernel.
Require position_embedding rows >= sequence_length in CheckInputs when position_ids is absent.
Skip the device error-flag readback during CUDA-graph capture.
Widen the CUDA BERT LayerNorm/SkipLayerNorm/EmbedLayerNorm global write-element offsets to 64-bit.
Add EmbedLayerNorm expect-failure tests on the CPU and CUDA execution providers; existing LayerNorm/SkipLayerNorm numeric tests are unchanged.

Motivation

Improves input validation and error diagnostics for malformed EmbedLayerNormalization inputs (aligning CUDA with CPU), and ensures correct indexing for large tensors. No behavior change for valid, in-range inputs.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

titaiwangms · 2026-06-25T01:45:30Z

Superseded: per updated plan this is split into validation-only #29256 (kept standalone) and a separate 64-bit write-offset PR stacked on #29256.

…DA BERT LayerNorm/SkipLayerNorm Two related changes to the CUDA BERT normalization path: (1) EmbedLayerNormalization input validation. The CUDA kernel validates the word/position/segment ids against their embedding-table row counts on the device and returns a clear error instead of indexing past the tables (no silent clamp), mirroring the CPU implementation. CheckInputs requires the position_embedding table to have at least sequence_length rows when position_ids is not provided. The host error-flag readback is skipped while a CUDA graph is being captured so graph capture remains supported. Read-side offset arithmetic uses int64. (2) 64-bit write-element offsets. The global write-element offsets in the CUDA BERT LayerNorm/SkipLayerNorm/EmbedLayerNorm write path are widened to 64-bit so tensors where batch*seq*hidden exceeds 2^31 index correctly; the gamma/beta indices (bounded by hidden_size) stay 32-bit. Comments document why each widened site must remain 64-bit. No behavior change at normal sizes; existing LayerNorm/SkipLayerNorm numeric tests are unchanged. Adds EmbedLayerNorm expect-failure tests for the validated cases. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

titaiwangms force-pushed the validate-embedlayernorm-and-int64-layernorm-offsets branch from 917db75 to eaa86f8 Compare June 25, 2026 01:43

titaiwangms closed this Jun 25, 2026

titaiwangms deleted the validate-embedlayernorm-and-int64-layernorm-offsets branch June 25, 2026 01:45

titaiwangms restored the validate-embedlayernorm-and-int64-layernorm-offsets branch June 25, 2026 01:46

titaiwangms reopened this Jun 25, 2026

titaiwangms mentioned this pull request Jun 25, 2026

Validate EmbedLayerNormalization embedding indices (CUDA) #29256

Closed

titaiwangms force-pushed the validate-embedlayernorm-and-int64-layernorm-offsets branch from eaa86f8 to ad564ff Compare June 25, 2026 01:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Validate EmbedLayerNormalization indices and use 64-bit offsets in CUDA BERT LayerNorm/SkipLayerNorm#29257

Validate EmbedLayerNormalization indices and use 64-bit offsets in CUDA BERT LayerNorm/SkipLayerNorm#29257
titaiwangms wants to merge 1 commit into
microsoft:mainfrom
titaiwangms:validate-embedlayernorm-and-int64-layernorm-offsets

titaiwangms commented Jun 25, 2026

Uh oh!

titaiwangms commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

titaiwangms commented Jun 25, 2026

Description

Changes

Motivation

Uh oh!

titaiwangms commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant