Add GLM-4.7-Flash contrib model (30B-A3B MoE, MLA attention) by jimburtoft · Pull Request #168 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-05-23T15:42:57Z

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

Adds NxD Inference support for GLM-4.7-Flash, a 30B total / 3B active MoE model using DeepSeek-V3-style Multi-head Latent Attention (MLA) with compressed KV cache, 64 routed experts with sigmoid routing, and shared expert.

Key implementation highlights:

MLA attention with weight absorption trick (94% KV cache reduction)
FP8 E4M3 quantization for MoE expert weights (attention/embeddings remain BF16)
NKI bwmm_shard_on_block CTE kernel for optimized prefill (PING_PONG strategy)
Custom Glm4MoeLiteRouter with sigmoid + e_score_correction_bias + L1 normalization
Glm4MoeLiteGenerationAdapter for transformers >= 5.0 compatibility (position_ids fix)
vLLM 0.16.0 + vllm-neuron 0.5.0 serving support

Model Information

Model Name: GLM-4.7-Flash

Model Architecture: Decoder-only MoE transformer with Multi-head Latent Attention (MLA), 47 layers, 64 experts top-4, shared expert

Purpose: Text generation (chat/instruction following with chain-of-thought reasoning)

Checklist

Please ensure your PR includes the following items. Refer to the contrib/CONTRIBUTING.md for detailed guidelines.

Required Components

Accuracy Test (ex. test/integration/test_model.py)
- At least one integration test that validates model accuracy
- Uses exact token ID matching against CPU FP32 reference (3 prompts, all EXACT MATCH)
- Test can compile and run the model on Neuron
README.md with the following sections:
- Usage Example: Clear code example showing how to use the model
- Compatibility Matrix: Table showing tested Neuron SDK versions and instance types (Trn1/Trn2/Inf2)
- Example Checkpoints: Links to compatible model checkpoints (e.g., HuggingFace Hub)
- Testing Instructions: Command to run the test suite for the model
Source Code (src/)
- Modeling code following NxD Inference patterns
- Properly structured in the contrib folder hierarchy

Optional Components

Unit Tests (CPU or Neuron-based)
- Tests for individual modeling components
- Located in test/unit/ directory
- 4 test files: config, rope, router, weight conversion (41 tests total)

Folder Structure

Confirm your contribution follows this structure:

/contrib/models/GLM-4.7-Flash/
  README.md
  /src
    __init__.py
    modeling_glm4_moe_lite.py
    compat.py
    rope_util.py
  /test
    __init__.py
    /unit
      __init__.py
      test_config.py
      test_rope.py
      test_router.py
      test_weight_conversion.py
    /integration
      __init__.py
      test_model.py
      compile_fp8.py

Testing

How did you test this change?

Tested on trn2.3xlarge (TP=4, LNC=2) with SDK 2.29.1. Full model (30B params) compiled, loaded, and validated against CPU FP32 reference outputs. All 3 accuracy prompts produce exact token ID matches. Multi-token generation produces coherent, non-repetitive text. Deterministic across runs. Batch consistency validated (all sequences in batch produce identical output for same prompt).

Test Results:

Unit tests: 41/41 pass (CPU only, no device required)
Integration tests: 5/5 pass on trn2.3xlarge
- test_model_loads: Model loads successfully
- test_first_token_accuracy: 3/3 exact token ID matches vs CPU reference
- test_coherent_generation: "Paris" in output, length > 10
- test_deterministic_outputs: Two runs produce identical output
- test_batch_consistency: All batch sequences identical

Compatibility

Tested with:

Neuron SDK Version(s): 2.29.1 (neuronx-cc 2.24.8799)
Instance Type(s): trn2.3xlarge (TP=4, LNC=2)
PyTorch Version: 2.9.0
Python Version: 3.12

Additional Information

Forked from DeepSeek-V3 NxDI contrib with adaptations for GLM-4.7-Flash's different dimensions (qk_nope_head_dim=192, v_head_dim=256), simplified routing (n_group=1), and standard RoPE (no YaRN)
Critical dimension fix: out_absorb = wkv_b[:, qk_nope_head_dim:, :] (split at 192, not v_head_dim=256) — DeepSeek worked by coincidence since both dims were 128
BS > 8 requires SDK 2.29.1 (DGE OOB fix in neuronx-cc 2.24.8799)
glm4_moe_lite model_type requires manual AutoConfig registration (not in transformers 4.57.6)
NKI MLA attention kernel was researched but found 1.84x slower than XLA due to graph fusion barriers — XLA path is used

Related Issues

N/A

vLLM Integration

This model/feature is intended for use with vLLM
Documentation includes vLLM registration instructions

vLLM serving validated with vLLM 0.16.0 + vllm-neuron 0.5.0. Requires:

Changing config.json model_type to glm4_moe (HF AutoConfig compatibility)
Registering glm4moelite in NxDI constants.py MODEL_TYPES
Pre-compiled artifacts via NEURON_COMPILED_ARTIFACTS env var

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

GLM-4.7-Flash (zai-org/GLM-4.7-Flash) is a 30B-A3B MoE model using DeepSeek-V3-style Multi-head Latent Attention (MLA) with compressed KV cache, 64 routed experts with sigmoid routing, and shared expert. Key features: - MLA attention with 94% KV cache reduction (576 dims vs 10,240) - FP8 E4M3 quantization for MoE expert weights - NKI bwmm_shard_on_block CTE kernel for optimized prefill - 16K context support on trn2.3xlarge TP=4 LNC=2 - vLLM 0.16.0 serving support (48 tok/s at concurrency=4) - 51.7 tok/s throughput at BS=4, 99.8 tok/s at BS=16/SEQ=4096 Validated on trn2.3xlarge with SDK 2.29.1 (neuronx-cc 2.24.8799). Includes 4 unit tests (CPU) and 5 integration tests (Neuron device).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GLM-4.7-Flash contrib model (30B-A3B MoE, MLA attention)#168

Add GLM-4.7-Flash contrib model (30B-A3B MoE, MLA attention)#168
jimburtoft wants to merge 1 commit into
aws-neuron:mainfrom
jimburtoft:contrib/glm4-moe-lite

jimburtoft commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jimburtoft commented May 23, 2026

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Additional Information

Related Issues

vLLM Integration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant