Skip to content

Validate EmbedLayerNormalization embedding indices (CUDA)#29256

Closed
titaiwangms wants to merge 3 commits into
microsoft:mainfrom
titaiwangms:validate-embedlayernorm-embedding-indices
Closed

Validate EmbedLayerNormalization embedding indices (CUDA)#29256
titaiwangms wants to merge 3 commits into
microsoft:mainfrom
titaiwangms:validate-embedlayernorm-embedding-indices

Conversation

@titaiwangms

Copy link
Copy Markdown
Contributor

Description

The CUDA EmbedLayerNormalization kernel now validates the word, position, and segment ids against their embedding-table row counts before using them to index the tables, mirroring the existing CPU path. Invalid ids are rejected via an early return (device error flag) instead of being used to index out of range or silently clamped.

Changes

  • CUDA kernel: validate input_ids, position_ids, and segment_ids values against the corresponding embedding-table row counts (device-side, early return, no silent clamp), matching the CPU behavior.
  • CheckInputs: require position_embedding to have at least sequence_length rows when position_ids is not supplied.
  • Use 64-bit arithmetic for the read-side word/position/segment embedding offsets to avoid 32-bit overflow when computing table row offsets.
  • Skip the device error-flag host readback while a CUDA graph is being captured (cudaStreamIsCapturing) so CUDA-graph capture remains supported.
  • Add expect-failure tests on the CPU and CUDA execution providers, plus minor test/readability cleanups.

Notes

This change keeps the LayerNorm device helper signature unchanged; the output write-index 64-bit widening is handled separately.

Motivation

Improves input validation and error diagnostics for malformed EmbedLayerNormalization inputs and aligns the CUDA path with the CPU path. No behavior change for valid inputs.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

titaiwangms and others added 3 commits June 25, 2026 01:36
Harden EmbedLayerNormalization input handling so malformed inputs are
rejected with a clear INVALID_ARGUMENT instead of reading past the
embedding tables.

- Shared CheckInputs helper: when position_ids are not supplied, require
  position_embedding to have at least sequence_length rows.
- CUDA kernel: validate word/position/segment ids against their embedding
  table sizes, surfacing an out-of-range id via a device error flag
  instead of silently clamping; widen the embedding offset arithmetic to
  int64 to avoid overflow.
- CUDA op: read back the device error flag and return INVALID_ARGUMENT.
- Tests: enable EmbedLayerNormNegativePositionIds on CUDA and add
  EmbedLayerNormWordIdOutOfRange and EmbedLayerNormPositionEmbeddingTooFewRows
  expect-failure cases covering CPU and CUDA.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…h capture

The device kernel already validates word/position/segment ids and skips the
embedding reads for any out-of-range id, so input handling stays in range
regardless of the host-side readback. The readback only upgrades a skipped
row into a clean INVALID_ARGUMENT status.

cudaStreamSynchronize and device-to-host copies are not allowed on a stream
that is capturing a CUDA graph, so make the readback conditional: query
cudaStreamIsCapturing and perform the copy + synchronize + status check only
when the stream is not capturing. This restores CUDA graph capture support
for static-shape models (where EmbedLayerNormalization is typically the first
node) while keeping ids in range device-side; error surfacing resumes on
normal non-capturing runs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Source-only follow-up to the EmbedLayerNormalization input-validation
changes:

- Widen the output offset to int64 to match the word/position/segment
  offsets; narrow explicitly at the LayerNorm call, which takes an int
  offset.
- Reword the kernel and test comments to neutral "out of range of the
  embedding table" phrasing.
- Trim the segment_embedding_length parameter comment to fit the column
  limit.
- Document that the CUDA NHWC EP is intentionally left enabled for the
  negative-position-id test (it shares the same validated kernel and is
  skipped automatically if not registered).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@titaiwangms

Copy link
Copy Markdown
Contributor Author

Superseded: EmbedLayerNormalization index validation is being combined with the CUDA BERT LayerNorm/SkipLayerNorm 64-bit offset change into a single PR (they share onnxruntime/contrib_ops/cuda/bert/embed_layer_norm_impl.cu and are not file-disjoint). Replaced by the combined PR.

@titaiwangms titaiwangms deleted the validate-embedlayernorm-embedding-indices branch June 25, 2026 01:40
@titaiwangms titaiwangms restored the validate-embedlayernorm-embedding-indices branch June 25, 2026 01:45
@titaiwangms titaiwangms reopened this Jun 25, 2026
@titaiwangms

Copy link
Copy Markdown
Contributor Author

Superseded by the combined PR #29257 (EmbedLayerNormalization index validation + 64-bit CUDA BERT LayerNorm/SkipLayerNorm write offsets). Closing per final packaging.

@titaiwangms titaiwangms deleted the validate-embedlayernorm-embedding-indices branch June 25, 2026 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant