Validate EmbedLayerNormalization embedding indices (CUDA) by titaiwangms · Pull Request #29256 · microsoft/onnxruntime

titaiwangms · 2026-06-25T01:37:27Z

Description

The CUDA EmbedLayerNormalization kernel now validates the word, position, and segment ids against their embedding-table row counts before using them to index the tables, mirroring the existing CPU path. Invalid ids are rejected via an early return (device error flag) instead of being used to index out of range or silently clamped.

Changes

CUDA kernel: validate input_ids, position_ids, and segment_ids values against the corresponding embedding-table row counts (device-side, early return, no silent clamp), matching the CPU behavior.
CheckInputs: require position_embedding to have at least sequence_length rows when position_ids is not supplied.
Use 64-bit arithmetic for the read-side word/position/segment embedding offsets to avoid 32-bit overflow when computing table row offsets.
Skip the device error-flag host readback while a CUDA graph is being captured (cudaStreamIsCapturing) so CUDA-graph capture remains supported.
Add expect-failure tests on the CPU and CUDA execution providers, plus minor test/readability cleanups.

Notes

This change keeps the LayerNorm device helper signature unchanged; the output write-index 64-bit widening is handled separately.

Motivation

Improves input validation and error diagnostics for malformed EmbedLayerNormalization inputs and aligns the CUDA path with the CPU path. No behavior change for valid inputs.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Harden EmbedLayerNormalization input handling so malformed inputs are rejected with a clear INVALID_ARGUMENT instead of reading past the embedding tables. - Shared CheckInputs helper: when position_ids are not supplied, require position_embedding to have at least sequence_length rows. - CUDA kernel: validate word/position/segment ids against their embedding table sizes, surfacing an out-of-range id via a device error flag instead of silently clamping; widen the embedding offset arithmetic to int64 to avoid overflow. - CUDA op: read back the device error flag and return INVALID_ARGUMENT. - Tests: enable EmbedLayerNormNegativePositionIds on CUDA and add EmbedLayerNormWordIdOutOfRange and EmbedLayerNormPositionEmbeddingTooFewRows expect-failure cases covering CPU and CUDA. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…h capture The device kernel already validates word/position/segment ids and skips the embedding reads for any out-of-range id, so input handling stays in range regardless of the host-side readback. The readback only upgrades a skipped row into a clean INVALID_ARGUMENT status. cudaStreamSynchronize and device-to-host copies are not allowed on a stream that is capturing a CUDA graph, so make the readback conditional: query cudaStreamIsCapturing and perform the copy + synchronize + status check only when the stream is not capturing. This restores CUDA graph capture support for static-shape models (where EmbedLayerNormalization is typically the first node) while keeping ids in range device-side; error surfacing resumes on normal non-capturing runs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Source-only follow-up to the EmbedLayerNormalization input-validation changes: - Widen the output offset to int64 to match the word/position/segment offsets; narrow explicitly at the LayerNorm call, which takes an int offset. - Reword the kernel and test comments to neutral "out of range of the embedding table" phrasing. - Trim the segment_embedding_length parameter comment to fit the column limit. - Document that the CUDA NHWC EP is intentionally left enabled for the negative-position-id test (it shares the same validated kernel and is skipped automatically if not registered). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

titaiwangms · 2026-06-25T01:40:52Z

Superseded: EmbedLayerNormalization index validation is being combined with the CUDA BERT LayerNorm/SkipLayerNorm 64-bit offset change into a single PR (they share onnxruntime/contrib_ops/cuda/bert/embed_layer_norm_impl.cu and are not file-disjoint). Replaced by the combined PR.

titaiwangms · 2026-06-25T01:46:33Z

Superseded by the combined PR #29257 (EmbedLayerNormalization index validation + 64-bit CUDA BERT LayerNorm/SkipLayerNorm write offsets). Closing per final packaging.

titaiwangms and others added 3 commits June 25, 2026 01:36

titaiwangms closed this Jun 25, 2026

titaiwangms deleted the validate-embedlayernorm-embedding-indices branch June 25, 2026 01:40

titaiwangms mentioned this pull request Jun 25, 2026

Validate EmbedLayerNormalization indices and use 64-bit offsets in CUDA BERT LayerNorm/SkipLayerNorm #29257

Open

titaiwangms restored the validate-embedlayernorm-embedding-indices branch June 25, 2026 01:45

titaiwangms reopened this Jun 25, 2026

titaiwangms closed this Jun 25, 2026

titaiwangms deleted the validate-embedlayernorm-embedding-indices branch June 25, 2026 01:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Validate EmbedLayerNormalization embedding indices (CUDA)#29256

Validate EmbedLayerNormalization embedding indices (CUDA)#29256
titaiwangms wants to merge 3 commits into
microsoft:mainfrom
titaiwangms:validate-embedlayernorm-embedding-indices

titaiwangms commented Jun 25, 2026

Uh oh!

titaiwangms commented Jun 25, 2026

Uh oh!

titaiwangms commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

titaiwangms commented Jun 25, 2026

Description

Changes

Notes

Motivation

Uh oh!

titaiwangms commented Jun 25, 2026

Uh oh!

titaiwangms commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant