Qwen3-Omni-30B-A3B-Instruct by qingzwang · Pull Request #167 · aws-neuron/neuronx-distributed-inference

qingzwang · 2026-05-19T06:21:40Z

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

Add Qwen3-Omni-30B-A3B-Instruct to the contrib models folder. This is a full multimodal (audio-in + audio-out) implementation running all
five neural network modules on Neuron with TP=8: Thinker MoE (48 layers), Vision encoder, Audio encoder, Talker MoE (20 layers), and
Unified Code Predictor (15-step unrolled). Includes progressive TTFB optimizations (pipelined thinker↔talker, streaming code2wav) that
bring audio-output TTFB from 2.7s down to 1.76s on real conversations.

Model Information

Model Name: Qwen3-Omni-30B-A3B-Instruct

Model Architecture: Mixture-of-Experts multimodal transformer (thinker 48L MoE + talker 20L MoE + vision encoder + audio encoder +
code predictor)

Purpose: Multimodal inference — ASR (speech→text) and TTS (text→speech / audio-in→audio-out)

Checklist

Required Components

Accuracy Test (test/integration/test_model.py)
- Integration test validates thinker text model compilation and inference on Neuron
- Uses logit validation via get_generate_outputs
- Test compiles and runs on Neuron with TP=32
README.md with the following sections:
- Usage Example: Multiple usage examples (ASR, TTS, streaming)
- Compatibility Matrix: Tested on trn2.48xlarge with Neuron SDK 2.24
- Example Checkpoints: Qwen/Qwen3-Omni-30B-A3B-Instruct on HuggingFace Hub
- Testing Instructions: Commands to run ASR and audio-output pipelines
Source Code (src/)
- Complete modeling code for all five modules
- Properly structured in contrib folder hierarchy

Optional Components

Unit Tests (CPU-based)
- test/unit/test_config_and_state_dict.py: config loading and HF→Neuron state dict conversion tests
- Runs on CPU without Neuron devices

Folder Structure

/contrib/models/Qwen3-Omni-30B-A3B-Instruct/
README.md
BENCHMARK_OMNI2_TTFB.md
/src
modeling_qwen3_omni.py
modeling_qwen3_omni_text.py
modeling_qwen3_omni_audio.py
modeling_qwen3_omni_talker.py
modeling_qwen3_omni_code_predictor.py
modeling_qwen3_omni_vision.py
modeling_qwen3_omni_moe.py
_upstream_compat.py
_model_path.py
/test
/unit
test_config_and_state_dict.py
/integration
test_model.py

Testing

How did you test this change?

Tested on trn2.48xlarge (8 Neuron cores, TP=8). Ran:

ASR on LibriSpeech test-clean (100 samples): 18.5% WER, RTF 0.12x
Full audio-output pipeline: 3.8s end-to-end for 7.9s audio (RTF 0.48x)
Streaming TTFB benchmark on 100 real conversations: mean 1759ms, p95 1822ms
Unit tests on CPU for config/state-dict validation

Test Results:

Pipeline	Metric	Value
ASR (100 samples)	WER	18.5%
ASR (100 samples)	RTF	0.12x
TTS (short prompt)	Total latency	3.8s for 7.9s audio
TTS (short prompt)	RTF	0.48x
Streaming TTFB (100 convs)	Mean	1759 ms
Streaming TTFB (100 convs)	p95	1822 ms

Compatibility

Tested with:

Neuron SDK Version(s): 2.24
Instance Type(s): trn2.48xlarge
PyTorch Version: 2.9.1
Python Version: 3.10

Additional Information

All five modules (thinker, vision, audio encoder, talker, code predictor) run on Neuron
Code2Wav (codec→waveform) stays on CPU (~1s) as the overhead of Neuron dispatch would negate the win for this small model
First compilation takes ~45 min total; subsequent runs reuse cached NEFFs
Requires ~60 GB CPU RAM for HF model (projections still on CPU)
Pipelined thinker↔talker optimization overlaps text generation with talker startup

Related Issues

N/A

vLLM Integration

This model/feature is intended for use with vLLM
Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

Full end-to-end support for Qwen3-Omni multimodal (text/audio/vision in, text/audio out) on AWS Trainium/Inferentia via NxDI. All five NN modules run on Neuron with TP=8: - Thinker MoE text decoder (48 layers, 128 experts) - Vision encoder (Qwen3-VL ViT) - Audio encoder (32-layer transformer) - Talker MoE (20 layers, 128 experts) + shared_expert - Unified Code Predictor (5-layer dense, 15-step unrolled) Code2Wav stays on CPU. Streaming code2wav path drops TTFB by up to 66%. End-to-end audio output for 7.9s wav: ~3.8s on Neuron vs 91s CPU baseline (24x speedup). ASR on LibriSpeech test-clean 100 samples: 66s for 670s audio (RTF 0.12x). See contrib/models/Qwen3-Omni-30B-A3B-Instruct/README.md for details. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Benchmark: 100 multi-turn conversations from omni2 dataset (audio user input, prompt 1164-1494 tokens). TTFB measured request-start → first audio chunk. Fixes applied: 1. Patch TensorRegistry.clear() to preserve modules_to_capture across bucket traces — upstream wiped it, leaving non-first-bucket captures as empty (1,) fallbacks. 2. Recompile talker with TensorCaptureConfig(["norm"]); shim now uses the real post-RMSNorm hidden instead of re-embedding argmax'd tokens. Without this, greedy decoding loops on [318, 318, ...] and never emits codec_eos. 3. Match HF reference talker params (do_sample=True, top_k=50, top_p=0.8, suppress_tokens=[non-codec range]). Cuts hit-max rate from 100/100 to 15/100. 4. CHUNK_SIZE=25/LEFT_CTX=5 (was 50/10) — TTFB −487 ms. 5. Compile code2wav on Neuron (bit-exact vs CPU, 3× faster per chunk). New files: - BENCHMARK_OMNI2_TTFB.md — full progression table and fix writeups - test_thinker_ttft_bench.py — thinker-only TTFT/ITL/tokens-per-s - test_ttfb_rtf_bench.py — full streaming TTFB/RTF bench (--neuron-c2w flag) - compile_talker.py — talker with norm capture - compile_code2wav.py — code2wav per-bucket NEFF compile - code2wav_neuron.py — runtime shim that dispatches code2wav by chunk size Final TTFB: mean 2000 ms / p50 1915 / p90 2389. 100/100 samples succeed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Reader asked: if TTFT is 668 ms and ITL is 10 ms, why does the TTFB breakdown list thinker at 1346 ms? Answer: the talker cannot start until the full thinker output is available, since _build_talker_inputs needs the entire token sequence and layer-23 hidden tensor to assemble the talker's prompt. So the "thinker" row covers prefill (668 ms) + all 68 decode steps (~680 ms) ≈ 1346 ms. Add a section with the step-by-step timeline, the formula, and six concrete options to push TTFB below 2 s — led by thinker/talker pipelining (~600 ms headroom, no device contention since the two models live on disjoint NEFFs). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…45%) The talker no longer waits for the full thinker output before starting. Thinker runs in a background thread and streams tokens through a custom StoppingCriteria into a condition-variable buffer; the talker's prepare_inputs_for_generation is wrapped to read trailing_text_hidden[k] from the buffer on demand at each decode step, blocking only if the (k+4)-th thinker token isn't out yet. Implementation in test_ttfb_pipelined_bench.py: - ThinkerStream: NxDI's _sample ignores the streamer kwarg, but it does call stopping_criteria(input_ids, ...) on every decode step. Hijack that callback to push tokens to PipelineState; tensor_capture_hook captures layer-23 hidden in the same path. - StreamingTalkerInputs.build_prefill: assembles only the prefill slice (user parts + first 4 assistant tokens + codec specials), then unblocks talker.generate. - install_pipelined_prepare_inputs: layers a wrapper on top of the existing streaming-c2w wrapper so each decode step gets its trailing_text_hidden row pulled from the live buffer. Results across 100 omni2 conversations (CHUNK_SIZE=25, --neuron-c2w): - TTFB mean: 2000 → 1759 ms (−12 %) - TTFB p50: 1915 → 1778 ms - TTFB p90: 2389 → 1811 ms - TTFB p95: 3316 → 1822 ms (−45 %) The biggest win is on the tail: TTFB no longer scales with thinker output length, since the talker hides the thinker's decode cost behind its own. Mean-improvement is more modest than the naive estimate (~600 ms) for two reasons documented in BENCHMARK_OMNI2_TTFB.md: - thinker and talker share TP=8 cores 0-7 — the Neuron driver serializes their forwards instead of running them in true parallel - ~100 ms of GIL/cv/single-token text_projection overhead per request A real ~600 ms win would require running the talker on a disjoint Neuron core group; the doc lists that and other six follow-up options. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

These test result JSONs and compiler logs are not needed for the model contribution and would bloat the PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Ubuntu and others added 5 commits April 29, 2026 09:27

Remove large data/log files from contrib submission

548d0dd

These test result JSONs and compiler logs are not needed for the model contribution and would bloat the PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3-Omni-30B-A3B-Instruct#167

Qwen3-Omni-30B-A3B-Instruct#167
qingzwang wants to merge 5 commits into
aws-neuron:mainfrom
qingzwang:Qwen3-Omni-30B-A3B-Instruct

qingzwang commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

qingzwang commented May 19, 2026

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Additional Information

Related Issues

vLLM Integration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant