Eval metrics refactor#1412
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the basic LTX2 audio evaluation script to conditionally generate the video only if it does not already exist, and improves the handling of both per-sample and corpus-level metrics. It also updates the documentation for the audio.desync metric. A critical issue was identified where evaluating corpus-level metrics like audio.frechet_distance with a single-sample call raises a ValueError, and using the list form causes per-sample metrics to be incorrectly reported as missing. A suggestion was provided to use the list-based evaluation format and check the first sample's results.
| results = evaluator.evaluate(audio=output_path, text_prompt=PROMPT) | ||
|
|
||
| print("\n=== Audio scores ===") | ||
| for name in METRICS: | ||
| r = results[name] | ||
| r = None | ||
|
|
||
| # per-sample metric | ||
| if hasattr(results, "__contains__") and name in results: | ||
| r = results[name] | ||
|
|
||
| # corpus-level metric (e.g. audio.frechet_distance) | ||
| elif hasattr(results, "corpus") and name in results.corpus: | ||
| r = results.corpus[name] |
There was a problem hiding this comment.
The current implementation has two issues when supporting both per-sample and corpus-level metrics (like audio.frechet_distance):
- ValueError on evaluation: If
audio.frechet_distanceis added toMETRICS, callingevaluator.evaluate(audio=...)(single-sample form) raises aValueErrorbecause set-vs-set metrics require the list/samples form. - Per-sample metrics reported as MISSING: If the list form
samples=[...]is used,resultsbecomes anEvalResults(list subclass). The checkname in resultswill evaluate toFalsebecausenameis a string and the list contains dictionaries, causing all per-sample metrics to be reported asMISSING.
Using the list form in evaluate and checking results[0] for per-sample metrics resolves both issues.
| results = evaluator.evaluate(audio=output_path, text_prompt=PROMPT) | |
| print("\n=== Audio scores ===") | |
| for name in METRICS: | |
| r = results[name] | |
| r = None | |
| # per-sample metric | |
| if hasattr(results, "__contains__") and name in results: | |
| r = results[name] | |
| # corpus-level metric (e.g. audio.frechet_distance) | |
| elif hasattr(results, "corpus") and name in results.corpus: | |
| r = results.corpus[name] | |
| results = evaluator.evaluate(samples=[{"audio": output_path, "text_prompt": PROMPT}]) | |
| print("\n=== Audio scores ===") | |
| for name in METRICS: | |
| r = None | |
| # corpus-level metric (e.g. audio.frechet_distance) | |
| if hasattr(results, "corpus") and name in results.corpus: | |
| r = results.corpus[name] | |
| # per-sample metric | |
| elif len(results) > 0 and name in results[0]: | |
| r = results[0][name] |
|
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 PR merge requirementsWaiting for
This rule is failing.
|
|
This change fixes build_eval_kwargs to match the evaluator's expected The helper previously introduced an extra batch dimension on the video |
Wan2.1-T2V-1.3B VBench BaselineSingle-video sanity evaluation using the built-in VBench metrics on a Evaluation SetupModel: Metrics:
Results
Observations
ConclusionThe Wan2.1-T2V-1.3B baseline demonstrates strong temporal stability, |
1a7d4b3 to
fdc45b1
Compare
|
@SolitaryThinker @shaoxiongduan Hi Will and Shao, I've addressed the previous comments and pushed the latest updates. Could you please take a look when you have a chance? Thanks! |
|
Hi @klhhhhh — automated review from Gob, one of @SolitaryThinker's AI reviewers. Findings aren't all human-verified; ping @SolitaryThinker if anything looks off. TL;DRThe Verdict: ship-with-fixes
Findings (formatted for upload)[S1]
|
fdc45b1 to
b31f3ba
Compare
|
@SolitaryThinker @shaoxiongduan , all done, I delete the wan eval script and add some reference in ltx2 doc string. |
Summary
This PR updates the eval input helper
build_eval_kwargsto returnvideo tensors as
(T, C, H, W)instead of adding a batch dimension.This matches the one-sample input contract used by
Evaluator.evaluate(**sample).It also updates the LTX2 eval example to use the new frame-axis convention
and adds a new Wan2.1-T2V-1.3B VBench evaluation example.
Documentation updates
This PR clarifies several audio metric input contracts and API behaviors
that were confusing during real-world smoke testing:
1.
audio.frechet_distanceis corpus-level onlyUnlike most
audio.*metrics,audio.frechet_distanceis aset-vs-set metric and must be evaluated through:
instead of:
The README is updated to clarify that the result lives under:
and to provide a complete usage example.
2.
audio.desyncrequiresfpsThe current docs only mention
videoandaudio, but the metric alsorequires
fps(orsrc_fps) to run correctly.The README now explicitly documents this requirement and adds an example.
Example improvements
examples/inference/eval/basic_ltx2_audio_eval.pyis updated to:audio.frechet_distanceThese changes make the example usable as a more reliable smoke-test
template for validating audio metrics on generated LTX2 outputs.
3. Example usability improvement
examples/inference/eval/basic_ltx2_audio_eval.pynow checks whether thetarget output video already exists before launching generation.
If the output mp4 is already present, the script skips the expensive LTX2
generation step and directly reuses the existing video for evaluation.
This makes the example significantly more convenient for iterative audio
metric debugging and smoke-testing workflows, where metrics are frequently
re-run on the same generated sample.