Skip to content

Add MGP-STR (alibaba-damo/mgp-str-base) image-to-text task support#952

Draft
ssss141414 wants to merge 1 commit into
mainfrom
shzhen/add-mgp-str-base
Draft

Add MGP-STR (alibaba-damo/mgp-str-base) image-to-text task support#952
ssss141414 wants to merge 1 commit into
mainfrom
shzhen/add-mgp-str-base

Conversation

@ssss141414

Copy link
Copy Markdown
Contributor

Summary

Adds Effort-L1-light registration so MGP-STR scene-text-recognition models resolve under the user-facing image-to-text task label. The vendor MgpstrOnnxConfig (Optimum) already exposes the 3-head outputs (char_logits, bpe_logits, wp_logits) correctly, but is registered ONLY under feature-extraction. This PR adds a task-label alias + MODEL_CLASS_MAPPING binding to MgpstrForSceneTextRecognition (the head-bearing class — MGP-STR is NOT a generic Vision2Seq).

Files changed (5)

  • src/winml/modelkit/models/hf/mgp_str.py (NEW, 58 lines) — MgpstrImage2TextOnnxConfig(MgpstrOnnxConfig) subclass
  • src/winml/modelkit/models/hf/__init__.py — 3-line wiring
  • examples/recipes/alibaba-damo_mgp-str-base/image-to-text_config.json (NEW, 49 lines) — recipe
  • examples/recipes/README.md — catalog row
  • research/adding-model-support/model_knowledge/mgp_str.jsonmgp_str-004 post-mortem finding

Goal-ladder verdict

alibaba-damo/mgp-str-base @ image-to-text @ fp32 @ cpu

Tier Verdict Evidence
L0 build PASS 83.7s, 374 nodes, 564.5 MB optimized; autoconf converged in 2 iters
L1 perf PASS avg=100.76ms, P90=123.26ms, 9.92 samples/sec (20 iters CPU)
L2 numerical PASS cosine vs PT: char=0.99999999999992, bpe=0.99999999999974, wp=0.99999999999860; max-abs 5.7e-05 / 2.4e-04 / 2.1e-04
L3 eval CLI-BLOCKED image-to-text task has no default dataset (same as vit-gpt2)

Step 1b verification — real engineering vs catalog-only

  • Gate 1 (auto-config-diff): identical to winml config --task image-to-text (recipe is autoconf-faithful)
  • Gate 2 (baseline build on main): FAILS with mgp-str doesn't support task image-to-text for the onnx backend. → real engineering delta, NOT catalog-only.

Known gotchas

  • HF model card declares legacy architectures: ['MGPSTRModel'] but current transformers exports MgpstrModel (CamelCase rename). Without --task image-to-text explicit, winml inspect/config/build fail with Cannot import MGPSTRModel from transformers. CLI robustness gap separate from this PR.
  • 3 Einsum ops in a3_module heads are non-fatal on CPU.

Verification

uv run winml build -c examples/recipes/alibaba-damo_mgp-str-base/image-to-text_config.json -m alibaba-damo/mgp-str-base -o temp/mgp_build --ep cpu --device cpu --rebuild
uv run winml perf -m temp/mgp_build/model.onnx --ep cpu --device cpu --iterations 20

Adds Effort-L1-light registration so MGP-STR scene-text-recognition models
resolve under the user-facing 'image-to-text' task label. The vendor
MgpstrOnnxConfig (Optimum) already exposes the 3-head outputs (char_logits,
bpe_logits, wp_logits) correctly but is registered only under feature-extraction.
This PR adds a task-label alias plus MODEL_CLASS_MAPPING binding to
MgpstrForSceneTextRecognition.

Files:
- src/winml/modelkit/models/hf/mgp_str.py: MgpstrImage2TextOnnxConfig subclass (58 lines)
- src/winml/modelkit/models/hf/__init__.py: 3-line wiring
- examples/recipes/alibaba-damo_mgp-str-base/image-to-text_config.json: recipe (49 lines)
- examples/recipes/README.md: catalog row
- research/adding-model-support/model_knowledge/mgp_str.json: mgp_str-004 finding

Goal-ladder (alibaba-damo/mgp-str-base @ image-to-text @ fp32 @ cpu):
- L0 PASS: build 83.7s, 374 nodes, 564.5 MB optimized
- L1 PASS: avg=100.76ms, P90=123.26ms, 9.92 samples/sec (20 iters)
- L2 PASS: cosine vs PyTorch reference all 3 heads >=0.999999 (max-abs <3e-4)
- L3 CLI-BLOCKED: image-to-text task has no default dataset (same as
  nlpconnect/vit-gpt2-image-captioning per known limitation)

Step 1b verification: baseline 'winml build' on main fails with
'mgp-str doesn't support task image-to-text' (real engineering delta, not
catalog-only).
@ssss141414

Copy link
Copy Markdown
Contributor Author

Reviewer verification: OV cpu / gpu / npu — branch \shzhen/add-mgp-str-base\

Commands

\\powershell

config

uv run winml config -m alibaba-damo/mgp-str-base --task image-to-text -o temp/verify_pr952_mgpstr_config.json

build (OV CPU, fp32, using recipe)

uv run winml build -c examples/recipes/alibaba-damo_mgp-str-base/image-to-text_config.json -m alibaba-damo/mgp-str-base -o temp/verify_pr952_mgpstr_build --ep openvino --device cpu --precision fp32 --no-quant --no-compile --rebuild

perf — cpu / gpu / npu (from built ONNX, 5 iters + 2 warmup)

uv run winml perf -m temp/verify_pr952_mgpstr_build/model.onnx --ep openvino --device cpu --iterations 5 --warmup 2 --skip-build -f json
uv run winml perf -m temp/verify_pr952_mgpstr_build/model.onnx --ep openvino --device gpu --iterations 5 --warmup 2 --skip-build -f json
uv run winml perf -m temp/verify_pr952_mgpstr_build/model.onnx --ep openvino --device npu --iterations 5 --warmup 2 --skip-build -f json

eval

uv run winml eval -m alibaba-damo/mgp-str-base --task image-to-text --device cpu --ep openvino --samples 1
\\

Results

Command cpu gpu npu
config ✅ PASS
build ✅ PASS (79s, 564.5 MB, autoconf converged in 2 iters)
perf mean ✅ 305 ms/iter ✅ 9.1 ms/iter ✅ 22 ms/iter
perf throughput 3.27 samples/s 109.38 samples/s 45.48 samples/s
eval ❌ CLI-BLOCKED ❌ CLI-BLOCKED ❌ CLI-BLOCKED

Notes:

  • \config\ / \�uild\ / \perf\ pass on all three OV devices. OV sessions created successfully for cpu, gpu, and npu.
  • Build emits 3 \Einsum\ op warnings (\OpUnsupportedError: Node Einsum is not supported\ for \char_a3_module, \�pe_a3_module, \wp_a3_module) — consistent with the 'non-fatal on CPU' note in the PR. OV EP handles these via fallback.
  • \�val\ returns \No dataset provided and no default for task 'image-to-text'. Use --dataset.\ — same CLI-BLOCKED verdict as described in the PR (same as vit-gpt2). Not an OV EP limitation.
  • ONNX artifact: 374 nodes (post-optimize), opset 17, fp32, input: \pixel_values[1,3,32,128], outputs: \char_logits[1,27,38], \�pe_logits[1,27,50257], \wp_logits[1,27,30522].

@ssss141414

Copy link
Copy Markdown
Contributor Author

Validation results (2026-06-25) for PR #952 on this Windows ARM64 host.

Scope

  • Compare main vs PR branch behavior
  • Verify winml config on QNN NPU/GPU

Main branch baseline (before PR)

  • Command: uv run winml config -m alibaba-damo/mgp-str-base --task image-to-text --ep cpu --device cpu
  • Result: FAIL
  • Error: mgp-str doesn't support task image-to-text for the onnx backend. Supported tasks are: feature-extraction.

PR #952 branch

  • CPU config: PASS
    • uv run winml config -m alibaba-damo/mgp-str-base --task image-to-text --ep cpu --device cpu
    • Resolved to Device=CPU, EP=CPUExecutionProvider
  • QNN NPU config: PASS
    • uv run winml config -m alibaba-damo/mgp-str-base --task image-to-text --ep qnn --device npu
    • Resolved to Device=NPU, EP=QNNExecutionProvider
  • QNN GPU config: PASS
    • uv run winml config -m alibaba-damo/mgp-str-base --task image-to-text --ep qnn --device gpu
    • Resolved to Device=GPU, EP=QNNExecutionProvider

Conclusion

  • Confirmed: this PR adds real image-to-text task support for mgp-str (main fails, PR passes), including QNN NPU/GPU configuration resolution.

@ssss141414

Copy link
Copy Markdown
Contributor Author

ADDENDUM: main branch baseline (NO support)

On current \main\ @ HEAD:
\\powershell
uv run winml config -m alibaba-damo/mgp-str-base --task image-to-text
\
Returns:
\
Error: mgp-str doesn't support task image-to-text for the onnx backend. Supported tasks are: feature-extraction.
\\

Conclusion: This PR adds \image-to-text\ task support (via \MgpstrImage2TextOnnxConfig\ alias + \MODEL_CLASS_MAPPING\ binding). Without this PR, mgp-str only works under \ eature-extraction. The engineering delta is real (not catalog-only). All OV devices now pass config/build/perf validation.

@xieofxie

Copy link
Copy Markdown
Contributor

the exported model are same as the current supported task?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants