Skip to content

Qwen3-Omni-30B-A3B-Instruct#167

Open
qingzwang wants to merge 5 commits into
aws-neuron:mainfrom
qingzwang:Qwen3-Omni-30B-A3B-Instruct
Open

Qwen3-Omni-30B-A3B-Instruct#167
qingzwang wants to merge 5 commits into
aws-neuron:mainfrom
qingzwang:Qwen3-Omni-30B-A3B-Instruct

Conversation

@qingzwang
Copy link
Copy Markdown

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

Add Qwen3-Omni-30B-A3B-Instruct to the contrib models folder. This is a full multimodal (audio-in + audio-out) implementation running all
five neural network modules on Neuron with TP=8: Thinker MoE (48 layers), Vision encoder, Audio encoder, Talker MoE (20 layers), and
Unified Code Predictor (15-step unrolled). Includes progressive TTFB optimizations (pipelined thinker↔talker, streaming code2wav) that
bring audio-output TTFB from 2.7s down to 1.76s on real conversations.

Model Information

Model Name: Qwen3-Omni-30B-A3B-Instruct

Model Architecture: Mixture-of-Experts multimodal transformer (thinker 48L MoE + talker 20L MoE + vision encoder + audio encoder +
code predictor)

Purpose: Multimodal inference — ASR (speech→text) and TTS (text→speech / audio-in→audio-out)

Checklist

Required Components

  • Accuracy Test (test/integration/test_model.py)

    • Integration test validates thinker text model compilation and inference on Neuron
    • Uses logit validation via get_generate_outputs
    • Test compiles and runs on Neuron with TP=32
  • README.md with the following sections:

    • Usage Example: Multiple usage examples (ASR, TTS, streaming)
    • Compatibility Matrix: Tested on trn2.48xlarge with Neuron SDK 2.24
    • Example Checkpoints: Qwen/Qwen3-Omni-30B-A3B-Instruct on HuggingFace Hub
    • Testing Instructions: Commands to run ASR and audio-output pipelines
  • Source Code (src/)

    • Complete modeling code for all five modules
    • Properly structured in contrib folder hierarchy

Optional Components

  • Unit Tests (CPU-based)
    • test/unit/test_config_and_state_dict.py: config loading and HF→Neuron state dict conversion tests
    • Runs on CPU without Neuron devices

Folder Structure

/contrib/models/Qwen3-Omni-30B-A3B-Instruct/
README.md
BENCHMARK_OMNI2_TTFB.md
/src
modeling_qwen3_omni.py
modeling_qwen3_omni_text.py
modeling_qwen3_omni_audio.py
modeling_qwen3_omni_talker.py
modeling_qwen3_omni_code_predictor.py
modeling_qwen3_omni_vision.py
modeling_qwen3_omni_moe.py
_upstream_compat.py
_model_path.py
/test
/unit
test_config_and_state_dict.py
/integration
test_model.py

Testing

How did you test this change?

Tested on trn2.48xlarge (8 Neuron cores, TP=8). Ran:

  • ASR on LibriSpeech test-clean (100 samples): 18.5% WER, RTF 0.12x
  • Full audio-output pipeline: 3.8s end-to-end for 7.9s audio (RTF 0.48x)
  • Streaming TTFB benchmark on 100 real conversations: mean 1759ms, p95 1822ms
  • Unit tests on CPU for config/state-dict validation

Test Results:

Pipeline Metric Value
ASR (100 samples) WER 18.5%
ASR (100 samples) RTF 0.12x
TTS (short prompt) Total latency 3.8s for 7.9s audio
TTS (short prompt) RTF 0.48x
Streaming TTFB (100 convs) Mean 1759 ms
Streaming TTFB (100 convs) p95 1822 ms

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.24
  • Instance Type(s): trn2.48xlarge
  • PyTorch Version: 2.9.1
  • Python Version: 3.10

Additional Information

  • All five modules (thinker, vision, audio encoder, talker, code predictor) run on Neuron
  • Code2Wav (codec→waveform) stays on CPU (~1s) as the overhead of Neuron dispatch would negate the win for this small model
  • First compilation takes ~45 min total; subsequent runs reuse cached NEFFs
  • Requires ~60 GB CPU RAM for HF model (projections still on CPU)
  • Pipelined thinker↔talker optimization overlaps text generation with talker startup

Related Issues

N/A

vLLM Integration

  • This model/feature is intended for use with vLLM
  • Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

Ubuntu and others added 5 commits April 29, 2026 09:27
Full end-to-end support for Qwen3-Omni multimodal (text/audio/vision in,
text/audio out) on AWS Trainium/Inferentia via NxDI. All five NN modules
run on Neuron with TP=8:
  - Thinker MoE text decoder (48 layers, 128 experts)
  - Vision encoder (Qwen3-VL ViT)
  - Audio encoder (32-layer transformer)
  - Talker MoE (20 layers, 128 experts) + shared_expert
  - Unified Code Predictor (5-layer dense, 15-step unrolled)

Code2Wav stays on CPU. Streaming code2wav path drops TTFB by up to 66%.

End-to-end audio output for 7.9s wav: ~3.8s on Neuron vs 91s CPU baseline
(24x speedup). ASR on LibriSpeech test-clean 100 samples: 66s for 670s
audio (RTF 0.12x).

See contrib/models/Qwen3-Omni-30B-A3B-Instruct/README.md for details.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Benchmark: 100 multi-turn conversations from omni2 dataset (audio user input,
prompt 1164-1494 tokens). TTFB measured request-start → first audio chunk.

Fixes applied:
1. Patch TensorRegistry.clear() to preserve modules_to_capture across bucket
   traces — upstream wiped it, leaving non-first-bucket captures as empty
   (1,) fallbacks.
2. Recompile talker with TensorCaptureConfig(["norm"]); shim now uses the
   real post-RMSNorm hidden instead of re-embedding argmax'd tokens. Without
   this, greedy decoding loops on [318, 318, ...] and never emits codec_eos.
3. Match HF reference talker params (do_sample=True, top_k=50, top_p=0.8,
   suppress_tokens=[non-codec range]). Cuts hit-max rate from 100/100 to
   15/100.
4. CHUNK_SIZE=25/LEFT_CTX=5 (was 50/10) — TTFB −487 ms.
5. Compile code2wav on Neuron (bit-exact vs CPU, 3× faster per chunk).

New files:
- BENCHMARK_OMNI2_TTFB.md — full progression table and fix writeups
- test_thinker_ttft_bench.py — thinker-only TTFT/ITL/tokens-per-s
- test_ttfb_rtf_bench.py — full streaming TTFB/RTF bench (--neuron-c2w flag)
- compile_talker.py — talker with norm capture
- compile_code2wav.py — code2wav per-bucket NEFF compile
- code2wav_neuron.py — runtime shim that dispatches code2wav by chunk size

Final TTFB: mean 2000 ms / p50 1915 / p90 2389. 100/100 samples succeed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reader asked: if TTFT is 668 ms and ITL is 10 ms, why does the TTFB breakdown
list thinker at 1346 ms?

Answer: the talker cannot start until the full thinker output is available,
since _build_talker_inputs needs the entire token sequence and layer-23
hidden tensor to assemble the talker's prompt. So the "thinker" row covers
prefill (668 ms) + all 68 decode steps (~680 ms) ≈ 1346 ms.

Add a section with the step-by-step timeline, the formula, and six concrete
options to push TTFB below 2 s — led by thinker/talker pipelining (~600 ms
headroom, no device contention since the two models live on disjoint NEFFs).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…45%)

The talker no longer waits for the full thinker output before starting.
Thinker runs in a background thread and streams tokens through a custom
StoppingCriteria into a condition-variable buffer; the talker's
prepare_inputs_for_generation is wrapped to read trailing_text_hidden[k]
from the buffer on demand at each decode step, blocking only if the
(k+4)-th thinker token isn't out yet.

Implementation in test_ttfb_pipelined_bench.py:
- ThinkerStream: NxDI's _sample ignores the streamer kwarg, but it does
  call stopping_criteria(input_ids, ...) on every decode step. Hijack that
  callback to push tokens to PipelineState; tensor_capture_hook captures
  layer-23 hidden in the same path.
- StreamingTalkerInputs.build_prefill: assembles only the prefill slice
  (user parts + first 4 assistant tokens + codec specials), then unblocks
  talker.generate.
- install_pipelined_prepare_inputs: layers a wrapper on top of the
  existing streaming-c2w wrapper so each decode step gets its
  trailing_text_hidden row pulled from the live buffer.

Results across 100 omni2 conversations (CHUNK_SIZE=25, --neuron-c2w):
- TTFB mean:  2000 → 1759 ms (−12 %)
- TTFB p50:   1915 → 1778 ms
- TTFB p90:   2389 → 1811 ms
- TTFB p95:   3316 → 1822 ms (−45 %)

The biggest win is on the tail: TTFB no longer scales with thinker output
length, since the talker hides the thinker's decode cost behind its own.

Mean-improvement is more modest than the naive estimate (~600 ms) for two
reasons documented in BENCHMARK_OMNI2_TTFB.md:
- thinker and talker share TP=8 cores 0-7 — the Neuron driver serializes
  their forwards instead of running them in true parallel
- ~100 ms of GIL/cv/single-token text_projection overhead per request

A real ~600 ms win would require running the talker on a disjoint Neuron
core group; the doc lists that and other six follow-up options.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
These test result JSONs and compiler logs are not needed for the
model contribution and would bloat the PR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant