[Common][PyTorch] Add a new score func `sqrtsoftplus` to the fused router by yaox12 · Pull Request #2633 · NVIDIA/TransformerEngine

yaox12 · 2026-01-29T10:30:17Z

Description

Added a new score func sqrtsoftplus
Add tests
All tests are passing

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Xin Yao <xiny@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Xin Yao <xiny@nvidia.com>

greptile-apps · 2026-02-06T06:45:01Z

Greptile Overview

Greptile Summary

This PR adds a new score function sqrtsoftplus to the fused router, implementing sqrt(softplus(x)) = sqrt(log(1 + exp(x))) as an alternative to sigmoid and softmax.

Key changes:

Added CUDA device functions apply_sqrtsoftplus_on_float and apply_sqrtsoftplus_bwd_on_float in utils.h with numerically stable implementation matching PyTorch's Softplus(beta=1.0, threshold=20.0)
Integrated sqrtsoftplus into both forward and backward passes of fused_topk_with_score_function and fused_score_for_moe_aux_loss kernels
Updated C++ bindings to accept "sqrtsoftplus" as a valid score function (mapped to value 2)
Extended expert_bias support to work with sqrtsoftplus (previously only sigmoid)
Added comprehensive test coverage with PyTorch reference implementation
Updated API documentation across all layers

Implementation details:

Forward: Stores original logits in intermediate_output (needed for backward sigmoid computation)
Backward: Computes gradient as sigmoid(x) / (2*y) where y is the sqrtsoftplus output
Properly handles the normalization backward pass when topk > 1
Maintains consistency with existing sigmoid/softmax patterns

The implementation is mathematically sound, properly tested, and maintains backward compatibility.

Confidence Score: 5/5

This PR is safe to merge with no blocking issues
The implementation is mathematically sound, well-tested, and follows established patterns. The sqrtsoftplus forward and backward passes are correctly implemented with proper numerical stability. All integration points (CUDA kernels, C++ bindings, Python interface) are properly updated. Comprehensive test coverage validates correctness against PyTorch reference implementation.
No files require special attention

Important Files Changed

Filename	Overview
transformer_engine/common/fused_router/utils.h	Added `sqrtsoftplus` forward and backward device functions with proper numerical stability
transformer_engine/common/fused_router/fused_topk_with_score_function.cu	Integrated `sqrtsoftplus` score function with proper forward/backward passes and expert bias support
transformer_engine/common/fused_router/fused_score_for_moe_aux_loss.cu	Added `sqrtsoftplus` to aux loss computation with normalization backward (check gradient computation)
tests/pytorch/test_fused_router.py	Added comprehensive test coverage for `sqrtsoftplus` with PyTorch reference implementation

Sequence Diagram

sequenceDiagram
    participant User
    participant PyTorch as PyTorch Layer
    participant Router as router.py
    participant CPP as router.cpp
    participant CUDA as CUDA Kernels
    
    User->>PyTorch: Forward pass with logits
    PyTorch->>Router: fused_topk_with_score_function(logits, score_function="sqrtsoftplus")
    Router->>CPP: fused_topk_with_score_function_fwd(logits, score_function="sqrtsoftplus")
    CPP->>CPP: Validate score_function in {softmax, sigmoid, sqrtsoftplus}
    CPP->>CPP: Map "sqrtsoftplus" -> score_function_value=2
    CPP->>CUDA: nvte_fused_topk_with_score_function_forward(score_function=2)
    CUDA->>CUDA: Load logits to shared memory
    CUDA->>CUDA: apply_sqrtsoftplus_on_float: y = sqrt(log(1 + exp(x)))
    CUDA->>CUDA: Add expert_bias if provided
    CUDA->>CUDA: Perform topk selection
    CUDA->>CUDA: Revert expert_bias from topk scores
    CUDA->>CUDA: Normalize: probs = scores / sum(scores) if topk > 1
    CUDA-->>CPP: Return probs, routing_map, intermediate_output
    CPP-->>Router: Return tensors
    Router-->>PyTorch: Return probs, routing_map
    
    User->>PyTorch: Backward pass with grad_probs
    PyTorch->>Router: backward(grad_probs)
    Router->>CPP: fused_topk_with_score_function_bwd(grad_probs, score_function="sqrtsoftplus")
    CPP->>CUDA: nvte_fused_topk_with_score_function_backward(score_function=2)
    CUDA->>CUDA: Load grad_probs and intermediate_output
    CUDA->>CUDA: Backward through normalization (if topk > 1)
    CUDA->>CUDA: Compute sqrtsoftplus output from saved logits
    CUDA->>CUDA: apply_sqrtsoftplus_bwd_on_float: grad *= sigmoid(x) / (2*y)
    CUDA-->>CPP: Return grad_logits
    CPP-->>Router: Return grad tensor
    Router-->>PyTorch: Return grad_logits

greptile-apps

_{7 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/csrc/extensions/router.cpp

greptile-apps · 2026-02-06T06:45:10Z

Additional Comments (1)

transformer_engine/common/include/transformer_engine/fused_router.h
Expert bias API mismatch

The header still documents expert_bias as "Only used at the sigmoid case" (line 27), but this PR enables expert_bias for sqrtsoftplus end-to-end (kernels handle score_function==2, and router.cpp allows it). Please update the API docs here (and any other public docs) to reflect that expert_bias is supported for sqrtsoftplus too, otherwise external callers will get contradictory guidance.

Signed-off-by: Xin Yao <xiny@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/fused_router/fused_score_for_moe_aux_loss.cu

yaox12 and others added 5 commits January 29, 2026 10:28

add sqrtsoftplus

2c80a18

Signed-off-by: Xin Yao <xiny@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

11bb63f

for more information, see https://pre-commit.ci

update and add tests

a6f32ff

Signed-off-by: Xin Yao <xiny@nvidia.com>

update and add tests

b67ba4b

Signed-off-by: Xin Yao <xiny@nvidia.com>

Merge branch 'main' into xiny/add_score_func

52e4c02

yaox12 self-assigned this Feb 6, 2026

yaox12 marked this pull request as ready for review February 6, 2026 06:41

yaox12 added the MoE label Feb 6, 2026

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

transformer_engine/pytorch/csrc/extensions/router.cpp Show resolved Hide resolved

yaox12 and others added 2 commits February 6, 2026 06:55

update docstring

ef9e3ce

Signed-off-by: Xin Yao <xiny@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b9654c1

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

transformer_engine/common/fused_router/fused_score_for_moe_aux_loss.cu Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common][PyTorch] Add a new score func `sqrtsoftplus` to the fused router#2633

[Common][PyTorch] Add a new score func `sqrtsoftplus` to the fused router#2633
yaox12 wants to merge 7 commits intoNVIDIA:mainfrom
yaox12:xiny/add_score_func

yaox12 commented Jan 29, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Feb 6, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

greptile-apps bot commented Feb 6, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yaox12 commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot commented Feb 6, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yaox12 commented Jan 29, 2026 •

edited

Loading

greptile-apps bot commented Feb 6, 2026 •

edited

Loading