[WS1][kernels] Batch-invariant matmul / GEMM

Part of WS1 — Full Batch-Invariant Forward Chain (epic: #<WS1 tracking issue>)

## Why

This is the highest-technical-risk op in WS1. cuBLAS selects kernels by heuristic based on problem shape, and split-K decompositions change the reduction order of the K dimension — both break batch-invariance the moment batch size or sequence length shifts the chosen kernel. Matmul is also the most frequent op in the network (QKV, MLP, LM head), so drift here dominates everything downstream.

## Scope

Provide a deterministic, batch-invariant GEMM the forward chain can route through.

- Either implement a deterministic GEMM (fixed tiling, no split-K, fixed K-accumulation order) or integrate a DeepGEMM-style deterministic matmul.
- Guarantee the K-dimension reduction order is fixed and independent of M (the batch / token dimension), so a row's output does not change when other rows are added or removed.
- FP32 accumulation for BF16 inputs; TF32 behavior must be explicitly pinned.
- Initial target shapes (from the standard-Transformer model): QKV projection, MLP up/gate/down, and LM-head projection (or a representative reduced-vocab CI config).
- Validate against the #108 harness across the standard batch-config sweep.

Possible implementation routes:
- A deterministic baseline GEMM for the selected WS1 shapes.
- CUTLASS with fixed tile shape, fixed epilogue, split-K disabled.
- A DeepGEMM-style deterministic matmul integration.

## Out of scope

- Squeezing peak TFLOPs / full perf tuning — correctness and invariance first; a perf pass can follow.
- FP8 GEMM (out of scope this month).
- Full cuBLAS replacement / all possible matrix shapes.
- Distributed / tensor-parallel GEMM (WS2).

## Acceptance criteria

- For a fixed (M, N, K) row, output is bitwise-identical (or within #108 tolerance) regardless of the batch / M dimension and regardless of chunked-prefill splitting.
- No split-K or heuristic kernel selection that varies with batch shape; the chosen path is pinned and documented.
- Passes the #108 shared test helper across batch=1/N, chunked-prefill on/off, padding layouts.
- The GEMM backward paths (`dX` and `dW`) pass the shared gradient-invariance check from the WS1 backward-consistency issue.
- A short note documents the chosen approach (custom kernel vs DeepGEMM) and its measured overhead vs the cuBLAS baseline.

## Notes

- Depends on #108.
- Highest risk in WS1 — start early and surface the approach decision (custom kernel vs DeepGEMM) for review before heavy implementation.
- A slower deterministic baseline is acceptable as the first milestone.
- Shared by the LM-head projection issue (the final vocab matmul routes through here).

## Planned PRs

- [ ] Design note: deterministic GEMM approach (custom fixed-tile vs DeepGEMM integration)
- [ ] Implement / integrate deterministic GEMM (no split-K, fixed K-accumulation)
- [ ] Tests for QKV / MLP / LM-head projection shapes
- [ ] Wire one real projection (e.g. LM head) through the deterministic path
- [ ] Benchmark vs cuBLAS baseline; document overhead + supported shapes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WS1][kernels] Batch-invariant matmul / GEMM #146

Why

Scope

Out of scope

Acceptance criteria

Notes

Planned PRs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[WS1][kernels] Batch-invariant matmul / GEMM #146

Description

Why

Scope

Out of scope

Acceptance criteria

Notes

Planned PRs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions