Skip to content

Add GLM-4.7-Flash contrib model (30B-A3B MoE, MLA attention)#168

Open
jimburtoft wants to merge 1 commit into
aws-neuron:mainfrom
jimburtoft:contrib/glm4-moe-lite
Open

Add GLM-4.7-Flash contrib model (30B-A3B MoE, MLA attention)#168
jimburtoft wants to merge 1 commit into
aws-neuron:mainfrom
jimburtoft:contrib/glm4-moe-lite

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

Adds NxD Inference support for GLM-4.7-Flash, a 30B total / 3B active MoE model using DeepSeek-V3-style Multi-head Latent Attention (MLA) with compressed KV cache, 64 routed experts with sigmoid routing, and shared expert.

Key implementation highlights:

  • MLA attention with weight absorption trick (94% KV cache reduction)
  • FP8 E4M3 quantization for MoE expert weights (attention/embeddings remain BF16)
  • NKI bwmm_shard_on_block CTE kernel for optimized prefill (PING_PONG strategy)
  • Custom Glm4MoeLiteRouter with sigmoid + e_score_correction_bias + L1 normalization
  • Glm4MoeLiteGenerationAdapter for transformers >= 5.0 compatibility (position_ids fix)
  • vLLM 0.16.0 + vllm-neuron 0.5.0 serving support

Model Information

Model Name: GLM-4.7-Flash

Model Architecture: Decoder-only MoE transformer with Multi-head Latent Attention (MLA), 47 layers, 64 experts top-4, shared expert

Purpose: Text generation (chat/instruction following with chain-of-thought reasoning)

Checklist

Please ensure your PR includes the following items. Refer to the contrib/CONTRIBUTING.md for detailed guidelines.

Required Components

  • Accuracy Test (ex. test/integration/test_model.py)

    • At least one integration test that validates model accuracy
    • Uses exact token ID matching against CPU FP32 reference (3 prompts, all EXACT MATCH)
    • Test can compile and run the model on Neuron
  • README.md with the following sections:

    • Usage Example: Clear code example showing how to use the model
    • Compatibility Matrix: Table showing tested Neuron SDK versions and instance types (Trn1/Trn2/Inf2)
    • Example Checkpoints: Links to compatible model checkpoints (e.g., HuggingFace Hub)
    • Testing Instructions: Command to run the test suite for the model
  • Source Code (src/)

    • Modeling code following NxD Inference patterns
    • Properly structured in the contrib folder hierarchy

Optional Components

  • Unit Tests (CPU or Neuron-based)
    • Tests for individual modeling components
    • Located in test/unit/ directory
    • 4 test files: config, rope, router, weight conversion (41 tests total)

Folder Structure

Confirm your contribution follows this structure:

/contrib/models/GLM-4.7-Flash/
  README.md
  /src
    __init__.py
    modeling_glm4_moe_lite.py
    compat.py
    rope_util.py
  /test
    __init__.py
    /unit
      __init__.py
      test_config.py
      test_rope.py
      test_router.py
      test_weight_conversion.py
    /integration
      __init__.py
      test_model.py
      compile_fp8.py

Testing

How did you test this change?

Tested on trn2.3xlarge (TP=4, LNC=2) with SDK 2.29.1. Full model (30B params) compiled, loaded, and validated against CPU FP32 reference outputs. All 3 accuracy prompts produce exact token ID matches. Multi-token generation produces coherent, non-repetitive text. Deterministic across runs. Batch consistency validated (all sequences in batch produce identical output for same prompt).

Test Results:

  • Unit tests: 41/41 pass (CPU only, no device required)
  • Integration tests: 5/5 pass on trn2.3xlarge
    • test_model_loads: Model loads successfully
    • test_first_token_accuracy: 3/3 exact token ID matches vs CPU reference
    • test_coherent_generation: "Paris" in output, length > 10
    • test_deterministic_outputs: Two runs produce identical output
    • test_batch_consistency: All batch sequences identical

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.29.1 (neuronx-cc 2.24.8799)
  • Instance Type(s): trn2.3xlarge (TP=4, LNC=2)
  • PyTorch Version: 2.9.0
  • Python Version: 3.12

Additional Information

  • Forked from DeepSeek-V3 NxDI contrib with adaptations for GLM-4.7-Flash's different dimensions (qk_nope_head_dim=192, v_head_dim=256), simplified routing (n_group=1), and standard RoPE (no YaRN)
  • Critical dimension fix: out_absorb = wkv_b[:, qk_nope_head_dim:, :] (split at 192, not v_head_dim=256) — DeepSeek worked by coincidence since both dims were 128
  • BS > 8 requires SDK 2.29.1 (DGE OOB fix in neuronx-cc 2.24.8799)
  • glm4_moe_lite model_type requires manual AutoConfig registration (not in transformers 4.57.6)
  • NKI MLA attention kernel was researched but found 1.84x slower than XLA due to graph fusion barriers — XLA path is used

Related Issues

N/A

vLLM Integration

  • This model/feature is intended for use with vLLM
  • Documentation includes vLLM registration instructions

vLLM serving validated with vLLM 0.16.0 + vllm-neuron 0.5.0. Requires:

  1. Changing config.json model_type to glm4_moe (HF AutoConfig compatibility)
  2. Registering glm4moelite in NxDI constants.py MODEL_TYPES
  3. Pre-compiled artifacts via NEURON_COMPILED_ARTIFACTS env var

By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

GLM-4.7-Flash (zai-org/GLM-4.7-Flash) is a 30B-A3B MoE model using
DeepSeek-V3-style Multi-head Latent Attention (MLA) with compressed KV
cache, 64 routed experts with sigmoid routing, and shared expert.

Key features:
- MLA attention with 94% KV cache reduction (576 dims vs 10,240)
- FP8 E4M3 quantization for MoE expert weights
- NKI bwmm_shard_on_block CTE kernel for optimized prefill
- 16K context support on trn2.3xlarge TP=4 LNC=2
- vLLM 0.16.0 serving support (48 tok/s at concurrency=4)
- 51.7 tok/s throughput at BS=4, 99.8 tok/s at BS=16/SEQ=4096

Validated on trn2.3xlarge with SDK 2.29.1 (neuronx-cc 2.24.8799).
Includes 4 unit tests (CPU) and 5 integration tests (Neuron device).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant