Add Laguna-XS.2 contrib model by jimburtoft · Pull Request #158 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-05-07T14:48:28Z

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

NxDI implementation of poolside/Laguna-XS.2, a 33B-parameter / 3B-active Mixture-of-Experts decoder model designed for agentic coding tasks. Runs on trn2.3xlarge with TP=4 (SDK 2.29).

Key architectural novelties implemented:

Softplus attention gating — per-head output gating via F.softplus(g_proj(hidden_states))
Variable GQA heads — 48 Q-heads (full-attention) vs 64 Q-heads (SWA), KV=8 constant
Mixed SWA/full attention — 10 full-attention + 30 sliding-window layers
Dual RoPE — YaRN (factor=32, 131K context) for full-attn, default for SWA
Sigmoid MoE routing with e_score_correction_bias and L1 normalization
MoE fused TKG NKI kernel (sigmoid routing natively supported)
CTE NKI flash attention kernel enabled

Model Information

Model Name: Laguna-XS.2

Model Architecture: Decoder-only transformer (MoE, 256 routed experts + 1 shared expert, top-8 routing)

Purpose: Code generation and agentic coding tasks

Checklist

Required Components

Accuracy Test (test/integration/test_logit_validation.py)
- Uses NxDI logit_validation() framework
- Validates CTE (context encoding) and TKG (token generation) modes
- Compares against pre-computed CPU reference logits
README.md with the following sections:
- Usage Example: Complete code showing model loading, compilation, and generation
- Compatibility Matrix: Tested on trn2.3xlarge with SDK 2.29
- Example Checkpoints: Link to poolside/Laguna-XS.2 on HuggingFace
- Testing Instructions: Full commands for running tests
Source Code (src/)
- src/__init__.py exports NeuronLagunaForCausalLM and LagunaInferenceConfig
- src/modeling_laguna.py — full NxDI implementation (~1266 lines)

Optional Components

Unit Tests (CPU or Neuron-based)

Folder Structure

Confirm your contribution follows this structure:

/contrib/models/Laguna-XS.2/
  README.md
  /src
    __init__.py
    modeling_laguna.py
  /test
    __init__.py
    /integration
      __init__.py
      test_laguna.py
      test_logit_validation.py
      benchmark_batch.py
      benchmark_workloads.py

Testing

How did you test this change?

Tested on trn2.3xlarge (LNC=2, 4 NeuronCores) with Neuron SDK 2.29. Tests include:

test_laguna.py — end-to-end compilation + generation
test_logit_validation.py — logit validation against CPU reference (CTE + TKG modes)
benchmark_batch.py / benchmark_workloads.py — throughput benchmarks across batch sizes

Test Results:

Test	Result
`test_laguna.py` (compile + generate)	PASS
`test_logit_validation.py --cte-only`	PASS (top-5 tolerance 0.01)
`test_logit_validation.py` (full)	PASS (32 tokens, CTE + TKG)

Benchmark Results:

Batch Size	Sequence Length	Throughput (tok/s)	TPOT (ms)
1	8192	91	11.0
4	4096	223	4.5
8	2048	310	3.2

Compatibility

Tested with:

Neuron SDK Version(s): 2.29 (neuronx-cc 2.24, NxDI 0.9.17334)
Instance Type(s): trn2.3xlarge (TP=4, LNC=2)
PyTorch Version: 2.9.0
Python Version: 3.12

Additional Information

Model requires ~66.6 GB in BF16. Fits trn2.3xlarge TP=4 (96 GB HBM) with headroom for KV cache.
Maximum single-bucket CTE is 8192 tokens (compiler instruction limit at 16K+).
Softplus attention gating is not fused into the TKG mega-kernel; it runs as a separate operation per layer.
Reference logit generation requires transformers >= 5.7.0 due to custom modeling code (trust_remote_code=True).

Related Issues

None.

vLLM Integration

This model/feature is intended for use with vLLM
Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

NxDI implementation of PerceptronAI/Isaac-0.2-2B-Preview VLM: - Qwen3 text backbone with SigLIP2 vision encoder - 2-layer MLP projector with pixel shuffle (64 vision tokens/image) - Supports TP=1/2/4, seq_len up to 8192 - 110.7 tok/s text-only, 108.7 tok/s image+text on trn2.3xlarge - 9.0ms TPOT at seq_len=1024 - BF16, CTE flash attention enabled - Validated: cosine 0.9999+ vs CPU reference across all configs

- vLLM-neuron integration with 3-file patch (text-only working, ~78 tok/s) - GPU comparative benchmark: L40S at 52 tok/s vs trn2 at 111 tok/s (2.13x speedup) - modular_isaac.py perceptron import fix (nuke_perceptron_import.py) - execute_model override for logits-to-token-ID conversion - Known limitation: image+text via vLLM not yet supported (pixel_values format mismatch)

Previous benchmark used enforce_eager=True which handicapped GPU to 52 tok/s. With CUDA graphs + torch.compile + FlashAttention v2, L40S achieves 174 tok/s. GPU is 1.5x faster per-core than single NeuronCore, but trn2 DP=4 is 2.5x faster at device level.

NxDI implementation of poolside/Laguna-XS.2 (33B total / 3B active MoE) for agentic coding on trn2.3xlarge (TP=4, SDK 2.29). Architecture features: - Softplus attention gating (per-head output gating) - Variable GQA heads (48 for full-attn, 64 for SWA) - Mixed SWA/full attention with per-layer dispatch - Dual RoPE (YaRN for full-attn, default for SWA) - Sigmoid MoE routing with e_score_correction_bias - MoE fused TKG NKI kernel (sigmoid routing) - CTE NKI flash attention kernel Performance: BS=1/8K=91 tok/s, BS=4/4K=223 tok/s, BS=8/2K=310 tok/s Accuracy: logit_validation() passes (CTE + TKG modes)

- Add vLLM serving support (serve_laguna.py, start-vllm-server.sh) - Fix TKG attention mask padding for vLLM continuous-batching mode - Requires pre-sharded weights for trn2.3xlarge (128GB host RAM)

Implement batch-folding approach to work around the attention_block_tkg kernel's kv_heads=1 limitation. Folds kv_heads into the batch dimension so the kernel sees (B*kv_heads, q_heads_per_kv) instead of (B, q_heads). Changes: - Add attention_block_tokengen_nki_kernel override with mask reshaping from (S_ctx, B, q_heads, S_tkg) to (S_ctx, B*kv_heads, q_per_kv, S_tkg) - Unfold batch-folded output back to standard shape after kernel returns - Add test_mega_kernel.py integration test (compile, accuracy, perf) - Requires companion nki-library patch (multi-KV-head GQA in attention_block_tkg.py) Performance: 89.9 tok/s at BS=1/2K (comparable to non-mega-kernel path). The mega-kernel fuses RMSNorm+QKV+RoPE+Attention, eliminating HBM round-trips between these ops.

tejasamx-aws

unit test dir missing we should add this eventually? + please squash commits for cleaner history in future

tejasamx-aws · 2026-05-10T20:54:03Z

+        #
+        # Default tol_map: K5=0.01, K50=0.02, K1000=0.03, All=0.05
+        # Laguna requires ~20x relaxation for K1000/All due to MoE routing.
+        moe_tol_map = {


The 0.40-0.50 tolerances do make sense for 256-expert sigmoid MoE, but they're 8-50x more relaxed than typical models. could be worth adding a supplementary assertion: assert top1_match_rate >= 0.85

tejasamx-aws · 2026-05-10T20:58:56Z

+def load_reference_logits():
+    """Load pre-generated CPU reference logits."""
+    print(f"  Loading reference logits from {REFERENCE_PATH}...")
+    data = torch.load(REFERENCE_PATH, weights_only=False)


nit: why is weights_only false here?

jimburtoft added 6 commits April 30, 2026 17:31

Add vLLM integration and TKG mask fix for Laguna-XS.2

9e254d9

- Add vLLM serving support (serve_laguna.py, start-vllm-server.sh) - Fix TKG attention mask padding for vLLM continuous-batching mode - Requires pre-sharded weights for trn2.3xlarge (128GB host RAM)

tejasamx-aws approved these changes May 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Laguna-XS.2 contrib model#158

Add Laguna-XS.2 contrib model#158
jimburtoft wants to merge 6 commits into
aws-neuron:mainfrom
jimburtoft:contrib/laguna-xs2

jimburtoft commented May 7, 2026 •

edited by petesraj-aws

Loading

Uh oh!

tejasamx-aws left a comment

Uh oh!

tejasamx-aws May 10, 2026

Uh oh!

tejasamx-aws May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jimburtoft commented May 7, 2026 • edited by petesraj-aws Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Additional Information

Related Issues

vLLM Integration

Uh oh!

tejasamx-aws left a comment

Choose a reason for hiding this comment

Uh oh!

tejasamx-aws May 10, 2026

Choose a reason for hiding this comment

Uh oh!

tejasamx-aws May 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jimburtoft commented May 7, 2026 •

edited by petesraj-aws

Loading