Add GLM-4.7-Flash contrib model (30B-A3B MoE, MLA attention)#168
Open
jimburtoft wants to merge 1 commit into
Open
Add GLM-4.7-Flash contrib model (30B-A3B MoE, MLA attention)#168jimburtoft wants to merge 1 commit into
jimburtoft wants to merge 1 commit into
Conversation
GLM-4.7-Flash (zai-org/GLM-4.7-Flash) is a 30B-A3B MoE model using DeepSeek-V3-style Multi-head Latent Attention (MLA) with compressed KV cache, 64 routed experts with sigmoid routing, and shared expert. Key features: - MLA attention with 94% KV cache reduction (576 dims vs 10,240) - FP8 E4M3 quantization for MoE expert weights - NKI bwmm_shard_on_block CTE kernel for optimized prefill - 16K context support on trn2.3xlarge TP=4 LNC=2 - vLLM 0.16.0 serving support (48 tok/s at concurrency=4) - 51.7 tok/s throughput at BS=4, 99.8 tok/s at BS=16/SEQ=4096 Validated on trn2.3xlarge with SDK 2.29.1 (neuronx-cc 2.24.8799). Includes 4 unit tests (CPU) and 5 integration tests (Neuron device).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.
Description
Adds NxD Inference support for GLM-4.7-Flash, a 30B total / 3B active MoE model using DeepSeek-V3-style Multi-head Latent Attention (MLA) with compressed KV cache, 64 routed experts with sigmoid routing, and shared expert.
Key implementation highlights:
bwmm_shard_on_blockCTE kernel for optimized prefill (PING_PONG strategy)Glm4MoeLiteRouterwith sigmoid + e_score_correction_bias + L1 normalizationGlm4MoeLiteGenerationAdapterfor transformers >= 5.0 compatibility (position_ids fix)Model Information
Model Name: GLM-4.7-Flash
Model Architecture: Decoder-only MoE transformer with Multi-head Latent Attention (MLA), 47 layers, 64 experts top-4, shared expert
Purpose: Text generation (chat/instruction following with chain-of-thought reasoning)
Checklist
Please ensure your PR includes the following items. Refer to the contrib/CONTRIBUTING.md for detailed guidelines.
Required Components
Accuracy Test (ex.
test/integration/test_model.py)README.md with the following sections:
Source Code (
src/)Optional Components
test/unit/directoryFolder Structure
Confirm your contribution follows this structure:
Testing
How did you test this change?
Tested on trn2.3xlarge (TP=4, LNC=2) with SDK 2.29.1. Full model (30B params) compiled, loaded, and validated against CPU FP32 reference outputs. All 3 accuracy prompts produce exact token ID matches. Multi-token generation produces coherent, non-repetitive text. Deterministic across runs. Batch consistency validated (all sequences in batch produce identical output for same prompt).
Test Results:
test_model_loads: Model loads successfullytest_first_token_accuracy: 3/3 exact token ID matches vs CPU referencetest_coherent_generation: "Paris" in output, length > 10test_deterministic_outputs: Two runs produce identical outputtest_batch_consistency: All batch sequences identicalCompatibility
Tested with:
Additional Information
out_absorb = wkv_b[:, qk_nope_head_dim:, :](split at 192, not v_head_dim=256) — DeepSeek worked by coincidence since both dims were 128glm4_moe_litemodel_type requires manual AutoConfig registration (not in transformers 4.57.6)Related Issues
N/A
vLLM Integration
vLLM serving validated with vLLM 0.16.0 + vllm-neuron 0.5.0. Requires:
glm4_moe(HF AutoConfig compatibility)glm4moelitein NxDI constants.py MODEL_TYPESBy submitting this PR, I confirm that: