Skip to content

feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag#872

Open
DingmaomaoBJTU wants to merge 28 commits into
mainfrom
dingmaomaobjtu/feat-fp16-conversion
Open

feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag#872
DingmaomaoBJTU wants to merge 28 commits into
mainfrom
dingmaomaobjtu/feat-fp16-conversion

Conversation

@DingmaomaoBJTU

@DingmaomaoBJTU DingmaomaoBJTU commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a unified --precision flag that auto-selects the quantization algorithm based on the target precision. This replaces the need to manually configure --weight-type, --activation-type, and algorithm-specific flags.

Closes #867

Precision → Algorithm Mapping

--precision Algorithm Description
fp16 FP16 conversion Weights + activations → FP16 (I/O stays FP32)
int4 / w4a32 RTN (weight-only) 4-bit weight via MatMulNBits, activation stays FP32
w4a16 RTN + FP16 4-bit weight via MatMulNBits + FP16 post-processing
int8 Static QDQ Calibrated QDQ (uint8 weight + uint8 activation)
int16 / w16a16 Static QDQ Calibrated QDQ (int16 weight + uint16 activation)
w8a16 Static QDQ Calibrated QDQ (uint8 weight + uint16 activation)
w8a8 Static QDQ Calibrated QDQ (uint8 weight + uint8 activation)

Multi-Pass vs Single-Pass Execution

quantize_onnx(precision=...) handles pass decomposition internally. The caller only needs one call.

--precision Passes Execution
fp16 1 (single) FP16 conversion only
int4 / w4a32 1 (single) RTN 4-bit quantization only
int8 / int16 / w8a8 / w8a16 / w16a16 1 (single) QDQ calibrated quantization only
w4a16 2 (multi) Pass 1: RTN int4 → Pass 2: FP16 conversion

Architecture: expand_precision("w4a16")["int4", "fp16"]. The quant layer (quantize_onnx) orchestrates intermediate files and cleanup. Command layer (build, quantize) only calls quantize_onnx once with precision= — no if-else routing or manual multi-pass loops.

Key Design Decisions

  • int4 is equivalent to w4a32 — both produce RTN 4-bit weight-only quantization with activations unchanged (FP32)
  • w4a16 is the ONLY precision that expands to multi-pass; w8a16 is a single QDQ pass (uint8 weight + uint16 activation), NOT "int8 then fp16"
  • a32 (e.g., w4a32) means "activation stays FP32 (no quantization)" and is only valid with weight-only (4-bit) precisions
  • RTN and FP16 paths skip calibration entirely — warnings shown if calibration flags are provided
  • QDQ precisions (int8, int16, w8a16, etc.) still require calibration data

Supported Commands

Command --precision support Notes
winml build Full pipeline: export → optimize → quantize → compile
winml quantize Standalone quantization on existing ONNX
winml config Config generation respects precision
winml perf Performance testing with precision-aware builds
winml eval Evaluation with precision-aware builds

E2E Test Results (convnext-tiny-224)

Command Result Notes
winml quantize --precision fp16 109→54.6MB, 4.7s
winml quantize --precision int4 109→23.7MB, 3.7s (RTN 4-bit)
winml quantize --precision w4a32 109→23.7MB (same as int4)
winml quantize --precision w4a16 109→18.1MB (RTN + FP16)
winml quantize --precision int8 109→28.0MB, 46s (QDQ)
winml build --precision fp16 Full pipeline, 87s
winml build --precision int4 Full pipeline with RTN
winml build --precision w4a16 Full pipeline, RTN + FP16, 208s

TODO (follow-up PRs)

  • Mixed precision: QDQ on top of FP16 (e.g., --precision int8 --fp16)
  • RTN tuning flags: expose --block-size, --symmetric, --accuracy-level
  • Dynamic quantization: wire algorithm="dynamic" path (no calibration, quantize at runtime)

@DingmaomaoBJTU DingmaomaoBJTU requested a review from a team as a code owner June 11, 2026 03:04
Comment thread tests/unit/optim/pipes/test_pipe_fp16.py Fixed
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 8f5a1d2 to 9e7d8fd Compare June 11, 2026 04:15
@DingmaomaoBJTU DingmaomaoBJTU changed the title feat: add --enable-fp16-conversion to winml optimize feat: add --precision fp16 to optimize, build, and export commands Jun 11, 2026
Comment thread tests/unit/optim/test_fp16.py Fixed
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 9e7d8fd to 7d7a0ae Compare June 11, 2026 04:22
Comment thread src/winml/modelkit/optim/fp16.py Fixed
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 7d7a0ae to 328b5ab Compare June 11, 2026 04:32

@timenick timenick left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three findings on PR #872.

🤖 Generated with GitHub Copilot CLI

Comment thread src/winml/modelkit/commands/build.py Outdated
Comment thread src/winml/modelkit/commands/build.py Outdated
Comment thread tests/unit/optim/test_fp16.py Outdated
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 328b5ab to b859627 Compare June 11, 2026 05:26
Comment thread tests/unit/optim/test_fp16.py Fixed
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch 2 times, most recently from 837330d to fede96c Compare June 11, 2026 07:43
@DingmaomaoBJTU DingmaomaoBJTU changed the title feat: add --precision fp16 to optimize, build, and export commands feat: FP16 precision support via quantize stage + extended build --precision Jun 23, 2026
DingmaomaoBJTU and others added 7 commits June 23, 2026 15:16
Add FP16 precision conversion support across all model pipeline commands:

- Create optim/fp16.py with convert_to_fp16() utility (wraps ORT float16)
- optimize: --precision fp16 with --fp16-keep-io-types and --fp16-op-block-list
- build: --precision fp16 stage between optimize and quantize
- export: --precision fp16 as post-export conversion
- Add shared precision_option() CLI decorator in utils/cli.py

Design: FP16 is a precision transformation (not a graph optimization), so it
lives as a command-layer utility rather than an optimizer pipe. All three
commands share the same convert_to_fp16() function.

Fixes #867
- Add algorithm, fp16, fp16_only, fp16_keep_io_types, fp16_op_block_list,
  and RTN fields to WinMLQuantizationConfig
- quantize_onnx now supports pure-FP16 fast path (fp16_only=True skips QDQ)
  and FP16 post-processing after QDQ (fp16=True, fp16_only=False)
- resolve_quant_compile_config returns fp16_only quant config for precision=fp16
- Remove _run_fp16_stage and skip-quantize hack from build.py pipelines
- Build pipeline unified: Export -> Optimize -> Quantize Stage -> Compile
  where Quantize Stage handles both QDQ and FP16 conversion
- Update tests to reflect new behavior (fp16 produces quant config, not None)
- Remove --precision flag and FP16 conversion from export command
- Remove --precision, --fp16-keep-io-types, --fp16-op-block-list from
  optimize command and all FP16 conversion logic
- Add --precision fp16 support to quantize command (creates fp16_only
  config, uses quantize_onnx FP16 fast path)
- FP16 precision is now only available through:
  - winml quantize --precision fp16 (standalone)
  - winml build --precision fp16 (E2E pipeline)
  - winml perf/eval --precision fp16 (E2E commands)
Expand build's --precision from fp32/fp16 only to the full precision
range: auto, fp32, fp16, int8, int16, and w{x}a{y} format (e.g., w8a8,
w8a16). This unifies the build and quantize CLI experience.

Changes:
- Update precision_option() to accept free-form string instead of
  click.Choice restricted to fp32/fp16
- Pass precision to generate_build_config() for proper quant config
  resolution at config generation time
- Pass precision to resolve_quant_compile_config() in _patch_device
  for config-file builds with --precision override
- Propagate fp16/fp16_only fields when patching existing quant config
- Add early validation using _is_valid_precision() for clear error
  messages
- Add precision examples to build command help text
Replace 'import onnx' + 'from onnx import ...' dual-import pattern
with consistent 'from onnx import ...' style to satisfy CodeQL's
'Module is imported with import and import from' check.
- Remove duplicate old precision_option (main already has expanded version)
- Update test_precision_fp16_clears_quant to expect fp16_only quant
  config instead of quant=None (matches our FP16-in-quantize design)
- Remove duplicate --precision fp16 build example (main already has one)
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 82c92cb to 75be8d3 Compare June 23, 2026 07:37
Comment thread src/winml/modelkit/commands/build.py Fixed
github-actions Bot added 6 commits June 23, 2026 15:43
When --precision fp16 is used, calibration-related flags (--samples,
--method, --weight-type, --activation-type) have no effect. Add
explicit warnings in both the CLI layer (quantize command) and the
API layer (quantize_onnx) so users are not silently surprised.
FP16-only quantization configs do not perform calibration, so they
do not need task or model_name fields. The validation now treats
fp16_only the same as ONNX builds and submodule builds.
Only static QDQ quantization requires calibration data (and thus
task/model_name). RTN (weight-only) and dynamic quantization do not
need calibration, so they should not require these fields.
- Add int4 to named precisions, support w4a{8,16} as weight-only RTN
- Add is_weight_only_precision() and extract_weight_bits() helpers
- resolve_quant_compile_config creates RTN config for weight-only
- quantize command: add RTN fast path between FP16 and QDQ paths
- quantize_onnx: implement RTN path using ORT MatMulNBitsQuantizer
- Update tests for new valid precision values (int4, w4a16)
…tion

- _patch_device now propagates algorithm/rtn_bits to existing quant config
- _run_quantize_stage: add RTN path with proper StageLive output
- quantizer: extract .model (ModelProto) from ONNXModel wrapper
@DingmaomaoBJTU DingmaomaoBJTU changed the title feat: FP16 precision support via quantize stage + extended build --precision feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag Jun 23, 2026
github-actions Bot added 5 commits June 23, 2026 18:12
- Add type annotation to fp16.py convert result (no-any-return)
- Add assert for precision not None in quantize.py (union-attr)
- Remove duplicate imports in build.py _run_quantize_stage
- Add RTN branch in generate_hf_build_config (int4/w4a16 was silently skipped)
- Pass use_external_data to save_onnx in FP16 and RTN paths (quantizer.py)
- Extract _warn_ignored_calibration_options helper to remove duplication
- QDQ FP16 post-processing: apply convert_to_fp16 in-memory instead of
  save-reload-save round-trip (matches RTN pattern)
- Pass use_external_data consistently to all save_onnx calls
- extract_weight_bits: validate bit-widths against supported sets
- Add test for unsupported bit-width combinations (w4a4, w3a8, etc.)
- Clarify dynamic algorithm as planned-not-yet-wired in config comment
…t-processing

- Add 32 to _VALID_ACTIVATION_BITS (a32 = activation stays FP32)
- w4a32 is treated as weight-only RTN, equivalent to int4
- w4a16 now correctly sets fp16=True for FP16 post-processing after RTN
- Add extract_activation_bits() helper to derive activation bit-width
- Validate a32 only valid with weight-only (4-bit) — w8a32 is rejected
- Add tests for w4a32, extract_activation_bits, and a32 validation
- Remove fp16_only field from WinMLQuantizationConfig
- Rename fp16 field to fp16_postprocess (post-quant FP16 conversion)
- Use algorithm='fp16' for pure FP16 mode (replaces fp16_only=True)
- Update all references in commands, config builders, and tests
- Backward compat: from_dict still reads old 'fp16' key
Comment thread src/winml/modelkit/quant/config.py Outdated
github-actions Bot added 6 commits June 24, 2026 14:56
- Remove fp16_postprocess from WinMLQuantizationConfig
- Add expand_precision() to decompose w4a16 into [int4, fp16] passes
- Refactor _run_quantize_stage into multi-pass loop with helper functions
- Each quantize_onnx call now does exactly one operation (single responsibility)
- Update standalone quantize command for two-pass w4a16 flow
- Add precision field to WinMLBuildConfig for pass expansion
- Add expand_precision tests
- Add 'precision' parameter to quantize_onnx() that handles multi-pass
  expansion internally (e.g., w4a16 → [int4, fp16])
- Simplify _run_quantize_stage in build.py to a single quantize_onnx()
  call — no more _make_step_config or _run_single_quantize_pass helpers
- Simplify commands/quantize.py RTN path — remove manual expand_precision
  loop and intermediate file management
- Delete unused _should_run_quantization() dead code from quantizer.py
- All multi-pass orchestration (intermediate files, cleanup, pass config
  construction) now lives in the quant layer where it belongs
Move calibration warning logic from commands/quantize.py into
utils/cli.py as warn_ignored_calibration_options() so any command
that needs the check can reuse it without duplicating the logic.
FP16 conversion is exclusively used by the quantizer's algorithm='fp16'
path. It's not an optimizer pipe — move it to quant/fp16.py where it
logically belongs. Remove optim/fp16.py entirely.
Comment thread src/winml/modelkit/quant/config.py Outdated
Address reviewer comment: mode and algorithm are redundant.
algorithm is the active routing field; mode is kept only for
serialization backward-compatibility and marked deprecated.
Comment thread src/winml/modelkit/commands/quantize.py Outdated
Comment thread src/winml/modelkit/commands/quantize.py Outdated
Comment thread src/winml/modelkit/quant/quantizer.py
Comment thread src/winml/modelkit/quant/quantizer.py Outdated
Comment thread src/winml/modelkit/quant/quantizer.py Outdated
Comment thread src/winml/modelkit/quant/quantizer.py
Comment thread src/winml/modelkit/commands/build.py Outdated
Comment thread src/winml/modelkit/commands/quantize.py Outdated
Remove redundant 'algorithm' field. Expand 'mode' to cover all
quantization modes: static, dynamic, rtn, fp16. The old 'qdq'
value is mapped to 'static' for backward compatibility.

from_dict() prefers the old 'algorithm' key over 'mode' when both
are present (old to_dict emitted both), preventing silent data loss
when deserializing configs with algorithm='rtn' or 'fp16'.
Comment thread src/winml/modelkit/quant/config.py
github-actions Bot added 2 commits June 24, 2026 17:21
…ize command paths

- Split _quantize_single_pass into 3 focused methods: _quantize_fp16,
  _quantize_rtn, _quantize_qdq with a dispatch dict
- Consolidate 3 separate FP16/RTN/QDQ paths in commands/quantize.py into
  a single if/elif/else that builds config then shares execution logic
- Remove duplicated try/except, console output, and output path logic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add --enable-fp16-conversion to winml optimize and --precision to winml build/export

5 participants