feat(keypoint-detection): add COCO OKS-AP evaluation by jeon185 · Pull Request #949 · microsoft/winml-cli

jeon185 · 2026-06-23T18:37:28Z

Adds the eval stage for keypoint-detection (ViTPose), so the COCO-keypoint models from #284 now go through config -> build -> perf -> eval. Stacked on #905 (the config/build/perf enablement) - that one should go in first.

What's here:

metrics/keypoint.py - KeypointAPMetric. Computes the COCO keypoint score (OKS-based AP over 0.50:0.95) with pycocotools COCOeval, the same way object-detection already reuses the COCO mAP protocol.
keypoint_detection_evaluator.py - top-down evaluator. transformers has no keypoint-detection pipeline, so it runs the image processor and ONNX model directly: for each ground-truth person box it does preprocess -> model -> post_process_pose_estimation and scores against the GT keypoints. ViTPose is exported with a static batch of 1, so each person crop runs separately and the heatmaps are stacked back together for post-processing. It uses the GT person boxes (standard COCO top-down protocol - keeps the score about pose accuracy, not detection).
scripts/build_coco_keypoints.py - builds a local COCO val keypoints dataset. COCO has no script-free HF mirror for person keypoints, so this downloads the annotations once and fetches images individually, which means a small subset doesn't need the full image zip.
Schema, evaluator registry, default dataset, and unit tests for the metric and evaluator.

Verified on the five COCO 17-keypoint models (vitpose-base-simple and vitpose-plus-{small,base,large,huge}): config -> build -> perf -> eval all pass and return COCO AP/AR. AP rises with model size as you'd expect. Absolute numbers are on the low side right now because the build quantizes with random calibration data, but relative comparison holds.

synthpose-vitpose-huge-hf - not covered yet

This is the one model from #284 that this PR does not evaluate. It predicts 52 anatomical keypoints instead of COCO's 17, so it can't be scored against COCO ground truth - the keypoint sets don't line up and OKS is only defined when they do.

How it's handled for now: the metric checks the keypoint count up front and raises a clear, actionable error instead of failing with a numpy broadcast error deep inside pycocotools.

Idea for finishing it: KeypointAPMetric already takes sigmas and keypoint_names as arguments, so the main missing piece is a dataset with SynthPose's 52-keypoint ground truth plus the matching OKS sigmas. I'd rather agree on the dataset and sigmas in review before adding that - happy to land it in this PR or as a follow-up, whichever you prefer.

Refs #284.

Adds the eval stage for keypoint-detection (ViTPose), completing config -> build -> perf -> eval for the COCO-keypoint models in #284. - metrics/keypoint.py: KeypointAPMetric computes the COCO keypoint score (OKS-based AP over 0.50:0.95) via pycocotools COCOeval, the same way object-detection reuses the COCO mAP protocol. - keypoint_detection_evaluator.py: top-down evaluator. transformers has no keypoint-detection pipeline, so it drives the image processor and ONNX model directly - per ground-truth person box it runs preprocess -> model -> post_process_pose_estimation and scores against GT keypoints. ViTPose exports a static batch of 1, so each person crop runs separately and the heatmaps are stacked for post-processing. Uses GT person boxes (standard COCO top-down, isolates pose accuracy from detection). - scripts/build_coco_keypoints.py: builds a local COCO val keypoints dataset; downloads annotations once and fetches images individually so a subset does not need the full image zip. - Schema, evaluator registry, default dataset, unit tests. Verified on the five COCO 17-keypoint models (vitpose-base-simple, vitpose-plus-{small,base,large,huge}): config -> build -> perf -> eval all pass and return COCO AP/AR. synthpose-vitpose-huge-hf is not covered yet. It predicts 52 anatomical keypoints rather than COCO's 17, so it can't be scored against COCO ground truth - the keypoint sets don't line up, and OKS is only defined when they do. Right now the metric detects this mismatch and raises a clear error instead of failing deep inside pycocotools. KeypointAPMetric already takes sigmas and keypoint_names as arguments, so supporting SynthPose mainly needs a dataset with its 52-keypoint ground truth plus the matching OKS sigmas; I'd rather confirm the dataset/sigmas choice in review before adding that. Open to suggestions on whether to land it here or as a follow-up. Refs #284.

zhenchaoni · 2026-06-24T08:12:48Z

+logger = logging.getLogger(__name__)
+
+
+class WinMLKeypointDetectionEvaluator(WinMLEvaluator):


Is synthpose-vitpose-huge-hf not supported mainly because we haven't found a good dataset yet? The reason why I asked is that we need to ensure the WinMLKeypointDetectionEvaluator is extensible for various key point detection models. Otherwise, we would have to implement different evaluators.

zhenchaoni · 2026-06-24T08:15:50Z

+DEFAULT_CACHE = Path.home() / ".cache" / "winml" / "coco_build"
+
+
+def _download(url: str, dest: Path) -> None:


Is this dataset very enormous? If so, you can consider downloading 5000 samples in steaming, shuffle and take 1000 samples. We did this for other large dataset in build_ai4privacy.py

It would be fine if the dataset is not huge like below 5GB.

zhenchaoni · 2026-06-24T08:17:04Z

+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+"""Build a local COCO keypoints dataset for ``winml eval`` keypoint-detection.


For the models which you verified and can be evaluated, can you help to update C:\Users\zhenni\repos\wmk\scripts\e2e_eval\testsets\models_with_acc.json

and post the evaluation data you run locally?

zhenchaoni · 2026-06-24T08:23:39Z

+        # Built locally by scripts/build_coco_keypoints.py (COCO has no
+        # script-free HF mirror for person keypoints). Run that script first,
+        # or pass --dataset-path to point at your own build.
+        "path": "~/.cache/winml/datasets/coco_keypoints_val2017",


If no existing HF dataset can be used directly, it is OK not to document a default dataset

zhenchaoni · 2026-06-24T08:30:41Z

+
+        stats = coco_eval.stats
+        return {
+            "AP": float(stats[0]),


Do you think we can have consistent metric as mean_average_precision.py? It is coco based object detection metrics. The main different is that it uses a predicted box. But I think we can also report the map in keypoint.

jeon185 requested a review from a team as a code owner June 23, 2026 18:37

zhenchaoni reviewed Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(keypoint-detection): add COCO OKS-AP evaluation#949

feat(keypoint-detection): add COCO OKS-AP evaluation#949
jeon185 wants to merge 1 commit into
feat/keypoint-detection-enablementfrom
feat/keypoint-detection-eval

jeon185 commented Jun 23, 2026

Uh oh!

zhenchaoni Jun 24, 2026

Uh oh!

zhenchaoni Jun 24, 2026

Uh oh!

zhenchaoni Jun 24, 2026

Uh oh!

zhenchaoni Jun 24, 2026

Uh oh!

zhenchaoni Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		logger = logging.getLogger(__name__)


		class WinMLKeypointDetectionEvaluator(WinMLEvaluator):

		DEFAULT_CACHE = Path.home() / ".cache" / "winml" / "coco_build"


		def _download(url: str, dest: Path) -> None:

Conversation

jeon185 commented Jun 23, 2026

synthpose-vitpose-huge-hf - not covered yet

Uh oh!

zhenchaoni Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

zhenchaoni Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

zhenchaoni Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

zhenchaoni Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

zhenchaoni Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants