Skip to content

Add Laguna-XS.2 contrib model#158

Open
jimburtoft wants to merge 6 commits into
aws-neuron:mainfrom
jimburtoft:contrib/laguna-xs2
Open

Add Laguna-XS.2 contrib model#158
jimburtoft wants to merge 6 commits into
aws-neuron:mainfrom
jimburtoft:contrib/laguna-xs2

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

@jimburtoft jimburtoft commented May 7, 2026

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

NxDI implementation of poolside/Laguna-XS.2, a 33B-parameter / 3B-active Mixture-of-Experts decoder model designed for agentic coding tasks. Runs on trn2.3xlarge with TP=4 (SDK 2.29).

Key architectural novelties implemented:

  • Softplus attention gating — per-head output gating via F.softplus(g_proj(hidden_states))
  • Variable GQA heads — 48 Q-heads (full-attention) vs 64 Q-heads (SWA), KV=8 constant
  • Mixed SWA/full attention — 10 full-attention + 30 sliding-window layers
  • Dual RoPE — YaRN (factor=32, 131K context) for full-attn, default for SWA
  • Sigmoid MoE routing with e_score_correction_bias and L1 normalization
  • MoE fused TKG NKI kernel (sigmoid routing natively supported)
  • CTE NKI flash attention kernel enabled

Model Information

Model Name: Laguna-XS.2

Model Architecture: Decoder-only transformer (MoE, 256 routed experts + 1 shared expert, top-8 routing)

Purpose: Code generation and agentic coding tasks

Checklist

Required Components

  • Accuracy Test (test/integration/test_logit_validation.py)

    • Uses NxDI logit_validation() framework
    • Validates CTE (context encoding) and TKG (token generation) modes
    • Compares against pre-computed CPU reference logits
  • README.md with the following sections:

    • Usage Example: Complete code showing model loading, compilation, and generation
    • Compatibility Matrix: Tested on trn2.3xlarge with SDK 2.29
    • Example Checkpoints: Link to poolside/Laguna-XS.2 on HuggingFace
    • Testing Instructions: Full commands for running tests
  • Source Code (src/)

    • src/__init__.py exports NeuronLagunaForCausalLM and LagunaInferenceConfig
    • src/modeling_laguna.py — full NxDI implementation (~1266 lines)

Optional Components

  • Unit Tests (CPU or Neuron-based)

Folder Structure

Confirm your contribution follows this structure:

/contrib/models/Laguna-XS.2/
  README.md
  /src
    __init__.py
    modeling_laguna.py
  /test
    __init__.py
    /integration
      __init__.py
      test_laguna.py
      test_logit_validation.py
      benchmark_batch.py
      benchmark_workloads.py

Testing

How did you test this change?

Tested on trn2.3xlarge (LNC=2, 4 NeuronCores) with Neuron SDK 2.29. Tests include:

  1. test_laguna.py — end-to-end compilation + generation
  2. test_logit_validation.py — logit validation against CPU reference (CTE + TKG modes)
  3. benchmark_batch.py / benchmark_workloads.py — throughput benchmarks across batch sizes

Test Results:

Test Result
test_laguna.py (compile + generate) PASS
test_logit_validation.py --cte-only PASS (top-5 tolerance 0.01)
test_logit_validation.py (full) PASS (32 tokens, CTE + TKG)

Benchmark Results:

Batch Size Sequence Length Throughput (tok/s) TPOT (ms)
1 8192 91 11.0
4 4096 223 4.5
8 2048 310 3.2

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.29 (neuronx-cc 2.24, NxDI 0.9.17334)
  • Instance Type(s): trn2.3xlarge (TP=4, LNC=2)
  • PyTorch Version: 2.9.0
  • Python Version: 3.12

Additional Information

  • Model requires ~66.6 GB in BF16. Fits trn2.3xlarge TP=4 (96 GB HBM) with headroom for KV cache.
  • Maximum single-bucket CTE is 8192 tokens (compiler instruction limit at 16K+).
  • Softplus attention gating is not fused into the TKG mega-kernel; it runs as a separate operation per layer.
  • Reference logit generation requires transformers >= 5.7.0 due to custom modeling code (trust_remote_code=True).

Related Issues

None.

vLLM Integration

  • This model/feature is intended for use with vLLM
  • Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

NxDI implementation of PerceptronAI/Isaac-0.2-2B-Preview VLM:
- Qwen3 text backbone with SigLIP2 vision encoder
- 2-layer MLP projector with pixel shuffle (64 vision tokens/image)
- Supports TP=1/2/4, seq_len up to 8192
- 110.7 tok/s text-only, 108.7 tok/s image+text on trn2.3xlarge
- 9.0ms TPOT at seq_len=1024
- BF16, CTE flash attention enabled
- Validated: cosine 0.9999+ vs CPU reference across all configs
- vLLM-neuron integration with 3-file patch (text-only working, ~78 tok/s)
- GPU comparative benchmark: L40S at 52 tok/s vs trn2 at 111 tok/s (2.13x speedup)
- modular_isaac.py perceptron import fix (nuke_perceptron_import.py)
- execute_model override for logits-to-token-ID conversion
- Known limitation: image+text via vLLM not yet supported (pixel_values format mismatch)
Previous benchmark used enforce_eager=True which handicapped GPU to 52 tok/s.
With CUDA graphs + torch.compile + FlashAttention v2, L40S achieves 174 tok/s.
GPU is 1.5x faster per-core than single NeuronCore, but trn2 DP=4 is 2.5x faster at device level.
NxDI implementation of poolside/Laguna-XS.2 (33B total / 3B active MoE)
for agentic coding on trn2.3xlarge (TP=4, SDK 2.29).

Architecture features:
- Softplus attention gating (per-head output gating)
- Variable GQA heads (48 for full-attn, 64 for SWA)
- Mixed SWA/full attention with per-layer dispatch
- Dual RoPE (YaRN for full-attn, default for SWA)
- Sigmoid MoE routing with e_score_correction_bias
- MoE fused TKG NKI kernel (sigmoid routing)
- CTE NKI flash attention kernel

Performance: BS=1/8K=91 tok/s, BS=4/4K=223 tok/s, BS=8/2K=310 tok/s
Accuracy: logit_validation() passes (CTE + TKG modes)
- Add vLLM serving support (serve_laguna.py, start-vllm-server.sh)
- Fix TKG attention mask padding for vLLM continuous-batching mode
- Requires pre-sharded weights for trn2.3xlarge (128GB host RAM)
Implement batch-folding approach to work around the attention_block_tkg
kernel's kv_heads=1 limitation. Folds kv_heads into the batch dimension
so the kernel sees (B*kv_heads, q_heads_per_kv) instead of (B, q_heads).

Changes:
- Add attention_block_tokengen_nki_kernel override with mask reshaping
  from (S_ctx, B, q_heads, S_tkg) to (S_ctx, B*kv_heads, q_per_kv, S_tkg)
- Unfold batch-folded output back to standard shape after kernel returns
- Add test_mega_kernel.py integration test (compile, accuracy, perf)
- Requires companion nki-library patch (multi-KV-head GQA in
  attention_block_tkg.py)

Performance: 89.9 tok/s at BS=1/2K (comparable to non-mega-kernel path).
The mega-kernel fuses RMSNorm+QKV+RoPE+Attention, eliminating HBM
round-trips between these ops.
Copy link
Copy Markdown

@tejasamx-aws tejasamx-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unit test dir missing we should add this eventually? + please squash commits for cleaner history in future

#
# Default tol_map: K5=0.01, K50=0.02, K1000=0.03, All=0.05
# Laguna requires ~20x relaxation for K1000/All due to MoE routing.
moe_tol_map = {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 0.40-0.50 tolerances do make sense for 256-expert sigmoid MoE, but they're 8-50x more relaxed than typical models. could be worth adding a supplementary assertion: assert top1_match_rate >= 0.85

def load_reference_logits():
"""Load pre-generated CPU reference logits."""
print(f" Loading reference logits from {REFERENCE_PATH}...")
data = torch.load(REFERENCE_PATH, weights_only=False)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why is weights_only false here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants