Qwen3-Omni-30B-A3B-Instruct#167
Open
qingzwang wants to merge 5 commits into
Open
Conversation
Full end-to-end support for Qwen3-Omni multimodal (text/audio/vision in, text/audio out) on AWS Trainium/Inferentia via NxDI. All five NN modules run on Neuron with TP=8: - Thinker MoE text decoder (48 layers, 128 experts) - Vision encoder (Qwen3-VL ViT) - Audio encoder (32-layer transformer) - Talker MoE (20 layers, 128 experts) + shared_expert - Unified Code Predictor (5-layer dense, 15-step unrolled) Code2Wav stays on CPU. Streaming code2wav path drops TTFB by up to 66%. End-to-end audio output for 7.9s wav: ~3.8s on Neuron vs 91s CPU baseline (24x speedup). ASR on LibriSpeech test-clean 100 samples: 66s for 670s audio (RTF 0.12x). See contrib/models/Qwen3-Omni-30B-A3B-Instruct/README.md for details. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Benchmark: 100 multi-turn conversations from omni2 dataset (audio user input, prompt 1164-1494 tokens). TTFB measured request-start → first audio chunk. Fixes applied: 1. Patch TensorRegistry.clear() to preserve modules_to_capture across bucket traces — upstream wiped it, leaving non-first-bucket captures as empty (1,) fallbacks. 2. Recompile talker with TensorCaptureConfig(["norm"]); shim now uses the real post-RMSNorm hidden instead of re-embedding argmax'd tokens. Without this, greedy decoding loops on [318, 318, ...] and never emits codec_eos. 3. Match HF reference talker params (do_sample=True, top_k=50, top_p=0.8, suppress_tokens=[non-codec range]). Cuts hit-max rate from 100/100 to 15/100. 4. CHUNK_SIZE=25/LEFT_CTX=5 (was 50/10) — TTFB −487 ms. 5. Compile code2wav on Neuron (bit-exact vs CPU, 3× faster per chunk). New files: - BENCHMARK_OMNI2_TTFB.md — full progression table and fix writeups - test_thinker_ttft_bench.py — thinker-only TTFT/ITL/tokens-per-s - test_ttfb_rtf_bench.py — full streaming TTFB/RTF bench (--neuron-c2w flag) - compile_talker.py — talker with norm capture - compile_code2wav.py — code2wav per-bucket NEFF compile - code2wav_neuron.py — runtime shim that dispatches code2wav by chunk size Final TTFB: mean 2000 ms / p50 1915 / p90 2389. 100/100 samples succeed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reader asked: if TTFT is 668 ms and ITL is 10 ms, why does the TTFB breakdown list thinker at 1346 ms? Answer: the talker cannot start until the full thinker output is available, since _build_talker_inputs needs the entire token sequence and layer-23 hidden tensor to assemble the talker's prompt. So the "thinker" row covers prefill (668 ms) + all 68 decode steps (~680 ms) ≈ 1346 ms. Add a section with the step-by-step timeline, the formula, and six concrete options to push TTFB below 2 s — led by thinker/talker pipelining (~600 ms headroom, no device contention since the two models live on disjoint NEFFs). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…45%) The talker no longer waits for the full thinker output before starting. Thinker runs in a background thread and streams tokens through a custom StoppingCriteria into a condition-variable buffer; the talker's prepare_inputs_for_generation is wrapped to read trailing_text_hidden[k] from the buffer on demand at each decode step, blocking only if the (k+4)-th thinker token isn't out yet. Implementation in test_ttfb_pipelined_bench.py: - ThinkerStream: NxDI's _sample ignores the streamer kwarg, but it does call stopping_criteria(input_ids, ...) on every decode step. Hijack that callback to push tokens to PipelineState; tensor_capture_hook captures layer-23 hidden in the same path. - StreamingTalkerInputs.build_prefill: assembles only the prefill slice (user parts + first 4 assistant tokens + codec specials), then unblocks talker.generate. - install_pipelined_prepare_inputs: layers a wrapper on top of the existing streaming-c2w wrapper so each decode step gets its trailing_text_hidden row pulled from the live buffer. Results across 100 omni2 conversations (CHUNK_SIZE=25, --neuron-c2w): - TTFB mean: 2000 → 1759 ms (−12 %) - TTFB p50: 1915 → 1778 ms - TTFB p90: 2389 → 1811 ms - TTFB p95: 3316 → 1822 ms (−45 %) The biggest win is on the tail: TTFB no longer scales with thinker output length, since the talker hides the thinker's decode cost behind its own. Mean-improvement is more modest than the naive estimate (~600 ms) for two reasons documented in BENCHMARK_OMNI2_TTFB.md: - thinker and talker share TP=8 cores 0-7 — the Neuron driver serializes their forwards instead of running them in true parallel - ~100 ms of GIL/cv/single-token text_projection overhead per request A real ~600 ms win would require running the talker on a disjoint Neuron core group; the doc lists that and other six follow-up options. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
These test result JSONs and compiler logs are not needed for the model contribution and would bloat the PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.
Description
Add Qwen3-Omni-30B-A3B-Instruct to the contrib models folder. This is a full multimodal (audio-in + audio-out) implementation running all
five neural network modules on Neuron with TP=8: Thinker MoE (48 layers), Vision encoder, Audio encoder, Talker MoE (20 layers), and
Unified Code Predictor (15-step unrolled). Includes progressive TTFB optimizations (pipelined thinker↔talker, streaming code2wav) that
bring audio-output TTFB from 2.7s down to 1.76s on real conversations.
Model Information
Model Name: Qwen3-Omni-30B-A3B-Instruct
Model Architecture: Mixture-of-Experts multimodal transformer (thinker 48L MoE + talker 20L MoE + vision encoder + audio encoder +
code predictor)
Purpose: Multimodal inference — ASR (speech→text) and TTS (text→speech / audio-in→audio-out)
Checklist
Required Components
Accuracy Test (
test/integration/test_model.py)get_generate_outputsREADME.md with the following sections:
Source Code (
src/)Optional Components
test/unit/test_config_and_state_dict.py: config loading and HF→Neuron state dict conversion testsFolder Structure
/contrib/models/Qwen3-Omni-30B-A3B-Instruct/
README.md
BENCHMARK_OMNI2_TTFB.md
/src
modeling_qwen3_omni.py
modeling_qwen3_omni_text.py
modeling_qwen3_omni_audio.py
modeling_qwen3_omni_talker.py
modeling_qwen3_omni_code_predictor.py
modeling_qwen3_omni_vision.py
modeling_qwen3_omni_moe.py
_upstream_compat.py
_model_path.py
/test
/unit
test_config_and_state_dict.py
/integration
test_model.py
Testing
How did you test this change?
Tested on trn2.48xlarge (8 Neuron cores, TP=8). Ran:
Test Results:
Compatibility
Tested with:
Additional Information
Related Issues
N/A
vLLM Integration
By submitting this PR, I confirm that: