Add Laguna-XS.2 contrib model#158
Open
jimburtoft wants to merge 6 commits into
Open
Conversation
NxDI implementation of PerceptronAI/Isaac-0.2-2B-Preview VLM: - Qwen3 text backbone with SigLIP2 vision encoder - 2-layer MLP projector with pixel shuffle (64 vision tokens/image) - Supports TP=1/2/4, seq_len up to 8192 - 110.7 tok/s text-only, 108.7 tok/s image+text on trn2.3xlarge - 9.0ms TPOT at seq_len=1024 - BF16, CTE flash attention enabled - Validated: cosine 0.9999+ vs CPU reference across all configs
- vLLM-neuron integration with 3-file patch (text-only working, ~78 tok/s) - GPU comparative benchmark: L40S at 52 tok/s vs trn2 at 111 tok/s (2.13x speedup) - modular_isaac.py perceptron import fix (nuke_perceptron_import.py) - execute_model override for logits-to-token-ID conversion - Known limitation: image+text via vLLM not yet supported (pixel_values format mismatch)
Previous benchmark used enforce_eager=True which handicapped GPU to 52 tok/s. With CUDA graphs + torch.compile + FlashAttention v2, L40S achieves 174 tok/s. GPU is 1.5x faster per-core than single NeuronCore, but trn2 DP=4 is 2.5x faster at device level.
NxDI implementation of poolside/Laguna-XS.2 (33B total / 3B active MoE) for agentic coding on trn2.3xlarge (TP=4, SDK 2.29). Architecture features: - Softplus attention gating (per-head output gating) - Variable GQA heads (48 for full-attn, 64 for SWA) - Mixed SWA/full attention with per-layer dispatch - Dual RoPE (YaRN for full-attn, default for SWA) - Sigmoid MoE routing with e_score_correction_bias - MoE fused TKG NKI kernel (sigmoid routing) - CTE NKI flash attention kernel Performance: BS=1/8K=91 tok/s, BS=4/4K=223 tok/s, BS=8/2K=310 tok/s Accuracy: logit_validation() passes (CTE + TKG modes)
- Add vLLM serving support (serve_laguna.py, start-vllm-server.sh) - Fix TKG attention mask padding for vLLM continuous-batching mode - Requires pre-sharded weights for trn2.3xlarge (128GB host RAM)
Implement batch-folding approach to work around the attention_block_tkg kernel's kv_heads=1 limitation. Folds kv_heads into the batch dimension so the kernel sees (B*kv_heads, q_heads_per_kv) instead of (B, q_heads). Changes: - Add attention_block_tokengen_nki_kernel override with mask reshaping from (S_ctx, B, q_heads, S_tkg) to (S_ctx, B*kv_heads, q_per_kv, S_tkg) - Unfold batch-folded output back to standard shape after kernel returns - Add test_mega_kernel.py integration test (compile, accuracy, perf) - Requires companion nki-library patch (multi-KV-head GQA in attention_block_tkg.py) Performance: 89.9 tok/s at BS=1/2K (comparable to non-mega-kernel path). The mega-kernel fuses RMSNorm+QKV+RoPE+Attention, eliminating HBM round-trips between these ops.
tejasamx-aws
approved these changes
May 10, 2026
tejasamx-aws
left a comment
There was a problem hiding this comment.
unit test dir missing we should add this eventually? + please squash commits for cleaner history in future
| # | ||
| # Default tol_map: K5=0.01, K50=0.02, K1000=0.03, All=0.05 | ||
| # Laguna requires ~20x relaxation for K1000/All due to MoE routing. | ||
| moe_tol_map = { |
There was a problem hiding this comment.
The 0.40-0.50 tolerances do make sense for 256-expert sigmoid MoE, but they're 8-50x more relaxed than typical models. could be worth adding a supplementary assertion: assert top1_match_rate >= 0.85
| def load_reference_logits(): | ||
| """Load pre-generated CPU reference logits.""" | ||
| print(f" Loading reference logits from {REFERENCE_PATH}...") | ||
| data = torch.load(REFERENCE_PATH, weights_only=False) |
There was a problem hiding this comment.
nit: why is weights_only false here?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.
Description
NxDI implementation of poolside/Laguna-XS.2, a 33B-parameter / 3B-active Mixture-of-Experts decoder model designed for agentic coding tasks. Runs on trn2.3xlarge with TP=4 (SDK 2.29).
Key architectural novelties implemented:
F.softplus(g_proj(hidden_states))e_score_correction_biasand L1 normalizationModel Information
Model Name: Laguna-XS.2
Model Architecture: Decoder-only transformer (MoE, 256 routed experts + 1 shared expert, top-8 routing)
Purpose: Code generation and agentic coding tasks
Checklist
Required Components
Accuracy Test (
test/integration/test_logit_validation.py)logit_validation()frameworkREADME.md with the following sections:
poolside/Laguna-XS.2on HuggingFaceSource Code (
src/)src/__init__.pyexportsNeuronLagunaForCausalLMandLagunaInferenceConfigsrc/modeling_laguna.py— full NxDI implementation (~1266 lines)Optional Components
Folder Structure
Confirm your contribution follows this structure:
Testing
How did you test this change?
Tested on trn2.3xlarge (LNC=2, 4 NeuronCores) with Neuron SDK 2.29. Tests include:
test_laguna.py— end-to-end compilation + generationtest_logit_validation.py— logit validation against CPU reference (CTE + TKG modes)benchmark_batch.py/benchmark_workloads.py— throughput benchmarks across batch sizesTest Results:
test_laguna.py(compile + generate)test_logit_validation.py --cte-onlytest_logit_validation.py(full)Benchmark Results:
Compatibility
Tested with:
Additional Information
transformers >= 5.7.0due to custom modeling code (trust_remote_code=True).Related Issues
None.
vLLM Integration
By submitting this PR, I confirm that: