Skip to content

Validate EmbedLayerNormalization indices and use 64-bit offsets in CUDA BERT LayerNorm/SkipLayerNorm#29257

Open
titaiwangms wants to merge 1 commit into
microsoft:mainfrom
titaiwangms:validate-embedlayernorm-and-int64-layernorm-offsets
Open

Validate EmbedLayerNormalization indices and use 64-bit offsets in CUDA BERT LayerNorm/SkipLayerNorm#29257
titaiwangms wants to merge 1 commit into
microsoft:mainfrom
titaiwangms:validate-embedlayernorm-and-int64-layernorm-offsets

Conversation

@titaiwangms

Copy link
Copy Markdown
Contributor

Description

This change has two coherent parts in the CUDA BERT embedding/layer-norm path:

  1. EmbedLayerNormalization index validation (CUDA). The CUDA kernel now validates the word, position, and segment ids against their embedding-table row counts before using them to index the tables, mirroring the existing CPU path (device-side, early return, no silent clamp). CheckInputs additionally requires position_embedding to have at least sequence_length rows when position_ids is not supplied. The device error-flag host readback is skipped while a CUDA graph is being captured (cudaStreamIsCapturing) so CUDA-graph capture remains supported.

  2. 64-bit write-element offsets in CUDA BERT LayerNorm / SkipLayerNorm / EmbedLayerNorm. The global output write-element offsets in the LayerNorm, SimplifiedLayerNorm, LayerNormSmall, and SimplifiedLayerNormSmall device helpers (and their EmbedLayerNorm / SkipLayerNorm call sites) are widened to 64-bit so that large tensors (batch * seq * hidden > 2^31) index the output correctly. The gamma/beta indices remain 32-bit (bounded by the hidden dimension).

Changes

  • Validate input_ids, position_ids, and segment_ids values against the corresponding embedding-table row counts in the CUDA kernel.
  • Require position_embedding rows >= sequence_length in CheckInputs when position_ids is absent.
  • Skip the device error-flag readback during CUDA-graph capture.
  • Widen the CUDA BERT LayerNorm/SkipLayerNorm/EmbedLayerNorm global write-element offsets to 64-bit.
  • Add EmbedLayerNorm expect-failure tests on the CPU and CUDA execution providers; existing LayerNorm/SkipLayerNorm numeric tests are unchanged.

Motivation

Improves input validation and error diagnostics for malformed EmbedLayerNormalization inputs (aligning CUDA with CPU), and ensures correct indexing for large tensors. No behavior change for valid, in-range inputs.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

@titaiwangms titaiwangms force-pushed the validate-embedlayernorm-and-int64-layernorm-offsets branch from 917db75 to eaa86f8 Compare June 25, 2026 01:43
@titaiwangms

Copy link
Copy Markdown
Contributor Author

Superseded: per updated plan this is split into validation-only #29256 (kept standalone) and a separate 64-bit write-offset PR stacked on #29256.

@titaiwangms titaiwangms deleted the validate-embedlayernorm-and-int64-layernorm-offsets branch June 25, 2026 01:45
@titaiwangms titaiwangms restored the validate-embedlayernorm-and-int64-layernorm-offsets branch June 25, 2026 01:46
@titaiwangms titaiwangms reopened this Jun 25, 2026
…DA BERT LayerNorm/SkipLayerNorm

Two related changes to the CUDA BERT normalization path:

(1) EmbedLayerNormalization input validation. The CUDA kernel validates the
word/position/segment ids against their embedding-table row counts on the
device and returns a clear error instead of indexing past the tables (no
silent clamp), mirroring the CPU implementation. CheckInputs requires the
position_embedding table to have at least sequence_length rows when
position_ids is not provided. The host error-flag readback is skipped while
a CUDA graph is being captured so graph capture remains supported. Read-side
offset arithmetic uses int64.

(2) 64-bit write-element offsets. The global write-element offsets in the
CUDA BERT LayerNorm/SkipLayerNorm/EmbedLayerNorm write path are widened to
64-bit so tensors where batch*seq*hidden exceeds 2^31 index correctly; the
gamma/beta indices (bounded by hidden_size) stay 32-bit. Comments document
why each widened site must remain 64-bit. No behavior change at normal sizes;
existing LayerNorm/SkipLayerNorm numeric tests are unchanged.

Adds EmbedLayerNorm expect-failure tests for the validated cases.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@titaiwangms titaiwangms force-pushed the validate-embedlayernorm-and-int64-layernorm-offsets branch from eaa86f8 to ad564ff Compare June 25, 2026 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant