feat(keypoint-detection): add COCO OKS-AP evaluation#949
Conversation
Adds the eval stage for keypoint-detection (ViTPose), completing config -> build -> perf -> eval for the COCO-keypoint models in #284. - metrics/keypoint.py: KeypointAPMetric computes the COCO keypoint score (OKS-based AP over 0.50:0.95) via pycocotools COCOeval, the same way object-detection reuses the COCO mAP protocol. - keypoint_detection_evaluator.py: top-down evaluator. transformers has no keypoint-detection pipeline, so it drives the image processor and ONNX model directly - per ground-truth person box it runs preprocess -> model -> post_process_pose_estimation and scores against GT keypoints. ViTPose exports a static batch of 1, so each person crop runs separately and the heatmaps are stacked for post-processing. Uses GT person boxes (standard COCO top-down, isolates pose accuracy from detection). - scripts/build_coco_keypoints.py: builds a local COCO val keypoints dataset; downloads annotations once and fetches images individually so a subset does not need the full image zip. - Schema, evaluator registry, default dataset, unit tests. Verified on the five COCO 17-keypoint models (vitpose-base-simple, vitpose-plus-{small,base,large,huge}): config -> build -> perf -> eval all pass and return COCO AP/AR. synthpose-vitpose-huge-hf is not covered yet. It predicts 52 anatomical keypoints rather than COCO's 17, so it can't be scored against COCO ground truth - the keypoint sets don't line up, and OKS is only defined when they do. Right now the metric detects this mismatch and raises a clear error instead of failing deep inside pycocotools. KeypointAPMetric already takes sigmas and keypoint_names as arguments, so supporting SynthPose mainly needs a dataset with its 52-keypoint ground truth plus the matching OKS sigmas; I'd rather confirm the dataset/sigmas choice in review before adding that. Open to suggestions on whether to land it here or as a follow-up. Refs #284.
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class WinMLKeypointDetectionEvaluator(WinMLEvaluator): |
There was a problem hiding this comment.
Is synthpose-vitpose-huge-hf not supported mainly because we haven't found a good dataset yet? The reason why I asked is that we need to ensure the WinMLKeypointDetectionEvaluator is extensible for various key point detection models. Otherwise, we would have to implement different evaluators.
| DEFAULT_CACHE = Path.home() / ".cache" / "winml" / "coco_build" | ||
|
|
||
|
|
||
| def _download(url: str, dest: Path) -> None: |
There was a problem hiding this comment.
Is this dataset very enormous? If so, you can consider downloading 5000 samples in steaming, shuffle and take 1000 samples. We did this for other large dataset in build_ai4privacy.py
It would be fine if the dataset is not huge like below 5GB.
| # Copyright (c) Microsoft Corporation. All rights reserved. | ||
| # Licensed under the MIT License. | ||
| # -------------------------------------------------------------------------- | ||
| """Build a local COCO keypoints dataset for ``winml eval`` keypoint-detection. |
There was a problem hiding this comment.
For the models which you verified and can be evaluated, can you help to update C:\Users\zhenni\repos\wmk\scripts\e2e_eval\testsets\models_with_acc.json
and post the evaluation data you run locally?
| # Built locally by scripts/build_coco_keypoints.py (COCO has no | ||
| # script-free HF mirror for person keypoints). Run that script first, | ||
| # or pass --dataset-path to point at your own build. | ||
| "path": "~/.cache/winml/datasets/coco_keypoints_val2017", |
There was a problem hiding this comment.
If no existing HF dataset can be used directly, it is OK not to document a default dataset
|
|
||
| stats = coco_eval.stats | ||
| return { | ||
| "AP": float(stats[0]), |
There was a problem hiding this comment.
Do you think we can have consistent metric as mean_average_precision.py? It is coco based object detection metrics. The main different is that it uses a predicted box. But I think we can also report the map in keypoint.
Adds the eval stage for keypoint-detection (ViTPose), so the COCO-keypoint models from #284 now go through
config -> build -> perf -> eval. Stacked on #905 (the config/build/perf enablement) - that one should go in first.What's here:
metrics/keypoint.py- KeypointAPMetric. Computes the COCO keypoint score (OKS-based AP over 0.50:0.95) with pycocotools COCOeval, the same way object-detection already reuses the COCO mAP protocol.keypoint_detection_evaluator.py- top-down evaluator. transformers has no keypoint-detection pipeline, so it runs the image processor and ONNX model directly: for each ground-truth person box it does preprocess -> model -> post_process_pose_estimation and scores against the GT keypoints. ViTPose is exported with a static batch of 1, so each person crop runs separately and the heatmaps are stacked back together for post-processing. It uses the GT person boxes (standard COCO top-down protocol - keeps the score about pose accuracy, not detection).scripts/build_coco_keypoints.py- builds a local COCO val keypoints dataset. COCO has no script-free HF mirror for person keypoints, so this downloads the annotations once and fetches images individually, which means a small subset doesn't need the full image zip.Verified on the five COCO 17-keypoint models (vitpose-base-simple and vitpose-plus-{small,base,large,huge}): config -> build -> perf -> eval all pass and return COCO AP/AR. AP rises with model size as you'd expect. Absolute numbers are on the low side right now because the build quantizes with random calibration data, but relative comparison holds.
synthpose-vitpose-huge-hf - not covered yet
This is the one model from #284 that this PR does not evaluate. It predicts 52 anatomical keypoints instead of COCO's 17, so it can't be scored against COCO ground truth - the keypoint sets don't line up and OKS is only defined when they do.
How it's handled for now: the metric checks the keypoint count up front and raises a clear, actionable error instead of failing with a numpy broadcast error deep inside pycocotools.
Idea for finishing it: KeypointAPMetric already takes
sigmasandkeypoint_namesas arguments, so the main missing piece is a dataset with SynthPose's 52-keypoint ground truth plus the matching OKS sigmas. I'd rather agree on the dataset and sigmas in review before adding that - happy to land it in this PR or as a follow-up, whichever you prefer.Refs #284.