Validate EmbedLayerNormalization embedding indices (CUDA)#29256
Closed
titaiwangms wants to merge 3 commits into
Closed
Validate EmbedLayerNormalization embedding indices (CUDA)#29256titaiwangms wants to merge 3 commits into
titaiwangms wants to merge 3 commits into
Conversation
Harden EmbedLayerNormalization input handling so malformed inputs are rejected with a clear INVALID_ARGUMENT instead of reading past the embedding tables. - Shared CheckInputs helper: when position_ids are not supplied, require position_embedding to have at least sequence_length rows. - CUDA kernel: validate word/position/segment ids against their embedding table sizes, surfacing an out-of-range id via a device error flag instead of silently clamping; widen the embedding offset arithmetic to int64 to avoid overflow. - CUDA op: read back the device error flag and return INVALID_ARGUMENT. - Tests: enable EmbedLayerNormNegativePositionIds on CUDA and add EmbedLayerNormWordIdOutOfRange and EmbedLayerNormPositionEmbeddingTooFewRows expect-failure cases covering CPU and CUDA. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…h capture The device kernel already validates word/position/segment ids and skips the embedding reads for any out-of-range id, so input handling stays in range regardless of the host-side readback. The readback only upgrades a skipped row into a clean INVALID_ARGUMENT status. cudaStreamSynchronize and device-to-host copies are not allowed on a stream that is capturing a CUDA graph, so make the readback conditional: query cudaStreamIsCapturing and perform the copy + synchronize + status check only when the stream is not capturing. This restores CUDA graph capture support for static-shape models (where EmbedLayerNormalization is typically the first node) while keeping ids in range device-side; error surfacing resumes on normal non-capturing runs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Source-only follow-up to the EmbedLayerNormalization input-validation changes: - Widen the output offset to int64 to match the word/position/segment offsets; narrow explicitly at the LayerNorm call, which takes an int offset. - Reword the kernel and test comments to neutral "out of range of the embedding table" phrasing. - Trim the segment_embedding_length parameter comment to fit the column limit. - Document that the CUDA NHWC EP is intentionally left enabled for the negative-position-id test (it shares the same validated kernel and is skipped automatically if not registered). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
Author
|
Superseded: EmbedLayerNormalization index validation is being combined with the CUDA BERT LayerNorm/SkipLayerNorm 64-bit offset change into a single PR (they share onnxruntime/contrib_ops/cuda/bert/embed_layer_norm_impl.cu and are not file-disjoint). Replaced by the combined PR. |
Contributor
Author
|
Superseded by the combined PR #29257 (EmbedLayerNormalization index validation + 64-bit CUDA BERT LayerNorm/SkipLayerNorm write offsets). Closing per final packaging. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The CUDA
EmbedLayerNormalizationkernel now validates the word, position, and segment ids against their embedding-table row counts before using them to index the tables, mirroring the existing CPU path. Invalid ids are rejected via an early return (device error flag) instead of being used to index out of range or silently clamped.Changes
input_ids,position_ids, andsegment_idsvalues against the corresponding embedding-table row counts (device-side, early return, no silent clamp), matching the CPU behavior.CheckInputs: requireposition_embeddingto have at leastsequence_lengthrows whenposition_idsis not supplied.cudaStreamIsCapturing) so CUDA-graph capture remains supported.Notes
This change keeps the
LayerNormdevice helper signature unchanged; the output write-index 64-bit widening is handled separately.Motivation
Improves input validation and error diagnostics for malformed
EmbedLayerNormalizationinputs and aligns the CUDA path with the CPU path. No behavior change for valid inputs.Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com