Eval metrics refactor by klhhhhh · Pull Request #1412 · hao-ai-lab/FastVideo

klhhhhh · 2026-05-28T09:19:20Z

Summary

This PR updates the eval input helper build_eval_kwargs to return
video tensors as (T, C, H, W) instead of adding a batch dimension.
This matches the one-sample input contract used by Evaluator.evaluate(**sample).

It also updates the LTX2 eval example to use the new frame-axis convention
and adds a new Wan2.1-T2V-1.3B VBench evaluation example.

Documentation updates

This PR clarifies several audio metric input contracts and API behaviors
that were confusing during real-world smoke testing:

1. `audio.frechet_distance` is corpus-level only

Unlike most audio.* metrics, audio.frechet_distance is a
set-vs-set metric and must be evaluated through:

ev.evaluate(samples=[...])

instead of:

ev.evaluate(audio=...)

The README is updated to clarify that the result lives under:

results.corpus["audio.frechet_distance"]

and to provide a complete usage example.

2. `audio.desync` requires `fps`

The current docs only mention video and audio, but the metric also
requires fps (or src_fps) to run correctly.

The README now explicitly documents this requirement and adds an example.

Example improvements

examples/inference/eval/basic_ltx2_audio_eval.py is updated to:

support both per-sample and corpus-level metrics
avoid crashes when evaluating audio.frechet_distance
provide more general result-printing logic
better reflect the actual input contracts of the audio metric suite

These changes make the example usable as a more reliable smoke-test
template for validating audio metrics on generated LTX2 outputs.

3. Example usability improvement

examples/inference/eval/basic_ltx2_audio_eval.py now checks whether the
target output video already exists before launching generation.

If the output mp4 is already present, the script skips the expensive LTX2
generation step and directly reuses the existing video for evaluation.

This makes the example significantly more convenient for iterative audio
metric debugging and smoke-testing workflows, where metrics are frequently
re-run on the same generated sample.

gemini-code-assist

Code Review

This pull request updates the basic LTX2 audio evaluation script to conditionally generate the video only if it does not already exist, and improves the handling of both per-sample and corpus-level metrics. It also updates the documentation for the audio.desync metric. A critical issue was identified where evaluating corpus-level metrics like audio.frechet_distance with a single-sample call raises a ValueError, and using the list form causes per-sample metrics to be incorrectly reported as missing. A suggestion was provided to use the list-based evaluation format and check the first sample's results.

gemini-code-assist · 2026-05-28T09:20:18Z

    results = evaluator.evaluate(audio=output_path, text_prompt=PROMPT)

    print("\n=== Audio scores ===")
    for name in METRICS:
-        r = results[name]
+        r = None
+
+        # per-sample metric
+        if hasattr(results, "__contains__") and name in results:
+            r = results[name]
+
+        # corpus-level metric (e.g. audio.frechet_distance)
+        elif hasattr(results, "corpus") and name in results.corpus:
+            r = results.corpus[name]


The current implementation has two issues when supporting both per-sample and corpus-level metrics (like audio.frechet_distance):

ValueError on evaluation: If audio.frechet_distance is added to METRICS, calling evaluator.evaluate(audio=...) (single-sample form) raises a ValueError because set-vs-set metrics require the list/samples form.

Per-sample metrics reported as MISSING: If the list form samples=[...] is used, results becomes an EvalResults (list subclass). The check name in results will evaluate to False because name is a string and the list contains dictionaries, causing all per-sample metrics to be reported as MISSING.

Using the list form in evaluate and checking results[0] for per-sample metrics resolves both issues.

Suggested change

results = evaluator.evaluate(audio=output_path, text_prompt=PROMPT)

print("\n=== Audio scores ===")

for name in METRICS:

r = results[name]

r = None

# per-sample metric

if hasattr(results, "__contains__") and name in results:

r = results[name]

# corpus-level metric (e.g. audio.frechet_distance)

elif hasattr(results, "corpus") and name in results.corpus:

r = results.corpus[name]

results = evaluator.evaluate(samples=[{"audio": output_path, "text_prompt": PROMPT}])

print("\n=== Audio scores ===")

for name in METRICS:

r = None

# corpus-level metric (e.g. audio.frechet_distance)

if hasattr(results, "corpus") and name in results.corpus:

r = results.corpus[name]

# per-sample metric

elif len(results) > 0 and name in results[0]:

r = results[0][name]

mergify · 2026-05-28T09:20:20Z

⚠️ PR title format required

Your PR title must start with a type tag in brackets. Examples:

[feat] Add new model support
[bugfix] Fix VAE tiling corruption
[refactor] Restructure training pipeline
[perf] Optimize attention kernel
[ci] Update test infrastructure
[infra] Add activation trace hooks
[docs] Add inference guide
[misc] Clean up configs
[new-model] Port Flux2 to FastVideo
[skill] Add add-model agent skill

Valid tags: feat, feature, bugfix, fix, refactor, perf, ci, infra, doc, docs, misc, chore, kernel, new-model, skill, skills

Please update your PR title and the merge protection check will pass automatically.

mergify · 2026-05-28T09:20:24Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 PR merge requirements

Waiting for

#approved-reviews-by>=1
check-success=full-suite-passed
title~=(?i)^\[(feat|feature|bugfix|fix|refactor|perf|ci|doc|docs|misc|chore|kernel|new.?model|skill|skills|infra)\]

This rule is failing.

#approved-reviews-by>=1
check-success=full-suite-passed
title~=(?i)^\[(feat|feature|bugfix|fix|refactor|perf|ci|doc|docs|misc|chore|kernel|new.?model|skill|skills|infra)\]
check-success=fastcheck-passed
check-success~=pre-commit

klhhhhh · 2026-06-04T07:16:17Z

This change fixes build_eval_kwargs to match the evaluator's expected
one-sample input contract.

The helper previously introduced an extra batch dimension on the video
tensor, producing a shape that was inconsistent with what video metrics
expect. The function now forwards videos in their native
(T, C, H, W) format and keeps prompt/auxiliary_info as scalar values,
making the generated kwargs directly compatible with
Evaluator.evaluate(**sample).

klhhhhh · 2026-06-04T07:18:42Z

Wan2.1-T2V-1.3B VBench Baseline

Single-video sanity evaluation using the built-in VBench metrics on a
video generated by Wan-AI/Wan2.1-T2V-1.3B-Diffusers.

Evaluation Setup

Model:
Wan-AI/Wan2.1-T2V-1.3B-Diffusers

Metrics:

vbench.aesthetic_quality
vbench.subject_consistency
vbench.background_consistency
vbench.imaging_quality
vbench.temporal_flickering
vbench.motion_smoothness
vbench.dynamic_degree
vbench.overall_consistency

Results

Metric	Score
aesthetic_quality	0.6042
subject_consistency	0.9390
background_consistency	0.9550
imaging_quality	0.5915
temporal_flickering	0.9537
motion_smoothness	0.9729
dynamic_degree	1.0000
overall_consistency	0.2861

Observations

Subject consistency (0.9390) is strong, indicating that the main
object remains stable throughout the generated sequence.
Background consistency (0.9550) is similarly high, suggesting
minimal scene drift and good spatial coherence.
Temporal flickering (0.9537) and motion smoothness (0.9729)
indicate stable frame-to-frame generation with few visible temporal
artifacts.
Dynamic degree (1.0000) confirms that the generated clip contains
substantial motion rather than static imagery.
Aesthetic quality (0.6042) and imaging quality (0.5915) are
moderate, suggesting that visual fidelity remains an area for
improvement compared with stronger proprietary or larger-scale video
generation models.
Overall consistency (0.2861) remains the lowest score among the
evaluated metrics. Since this metric combines multiple aspects of
temporal and semantic coherence, it often exposes residual generation
artifacts that are not captured by individual stability metrics alone.

Conclusion

The Wan2.1-T2V-1.3B baseline demonstrates strong temporal stability,
motion continuity, and scene consistency. The primary limitations are
visual quality and aggregate consistency metrics, which provides a
useful baseline for future FastVideo evaluation experiments and model
comparisons.

klhhhhh · 2026-06-04T07:44:28Z

@SolitaryThinker @shaoxiongduan Hi Will and Shao, I've addressed the previous comments and pushed the latest updates. Could you please take a look when you have a chance? Thanks!

SolitaryThinker · 2026-06-06T19:47:04Z

Hi @klhhhhh — automated review from Gob, one of @SolitaryThinker's AI reviewers. Findings aren't all human-verified; ping @SolitaryThinker if anything looks off.

TL;DR

The fastvideo/eval/io/paths.py contract change (drop unsqueeze(0), emit (T, C, H, W)) is correct and matches what every metric in fastvideo/eval/metrics/* already consumes — the pool layer's auto-squeeze (pool.py:_decode) had been hiding the mismatch. The one practical bug is in the new basic_wan_eval.py: line 107 reads .shape[1] after the contract change, so it will print C=3 as the frame count. The companion basic_ltx2_eval.py was correctly updated to .shape[0] in the same PR, which makes the wan-side asymmetry a clear oversight. Separately, the PR body undersells the diff: it advertises audio-metric docs + an example refactor and says "I will add wan test later," but the PR already ships the wan example and a library contract change.

Verdict: ship-with-fixes

S0 (blockers): 0
S1 (must-fix): 1
S2 (should-fix; persistent): 2
S3 (discussion): not shown here; see local review.md

Findings (formatted for upload)

[S1] `basic_wan_eval.py:107` reads `.shape[1]` after the 4-D contract; prints C (=3) as frame count

What: With this PR, build_eval_kwargs returns kwargs["video"] as (T, C, H, W) (paths.py:55 — "video": video replaces "video": video.unsqueeze(0)). The new examples/inference/eval/basic_wan_eval.py:107 does:

print(f"[eval] running ({sample['video'].shape[1]} frames @ {FPS:g} fps)...")

so .shape[1] is now the channel count (typically 3), not the frame count. The print line will read "running (3 frames @ 16 fps)..." regardless of num_frames=81. Note that basic_ltx2_eval.py:94 was correctly updated to .shape[0] in the same diff, which makes this an asymmetric oversight within the PR.

Why it matters: Cosmetic-only — the metric run itself is unaffected — but basic_wan_eval.py is a brand-new example file users will copy as a template, and the print silently lies about workload size during debugging.

Suggested fix:

print(f"[eval] running ({sample['video'].shape[0]} frames @ {FPS:g} fps)...")

Evidence: examples/inference/eval/basic_wan_eval.py:107, contrasted with examples/inference/eval/basic_ltx2_eval.py:94 and fastvideo/eval/io/paths.py:55.

[S2-persistent] PR body misrepresents scope; mixes a library contract change with example scaffolding

What: The body summary describes "audio metric documentation and updates the basic_ltx2_audio_eval.py example" and closes with "I will do another refactor and add wan test later." The diff, however, already ships (a) the paths.py library contract change to build_eval_kwargs, (b) a new +114-line basic_wan_eval.py, and (c) a shape-axis fix in basic_ltx2_eval.py that only makes sense as part of the paths.py migration. None of these are mentioned in the body.

Why it matters:

build_eval_kwargs is exported via fastvideo.eval.io.__all__; dropping unsqueeze(0) is a public-API contract change and warrants either a CHANGELOG note, a README callout, or its own atomic PR. A reviewer scanning the body would not notice it.
Bundled scope makes the PR harder to bisect later: git log --follow fastvideo/eval/io/paths.py will land on a commit titled around audio metrics.
The "I will add wan test later" footer contradicts the diff and reads as stale text the body wasn't updated to reflect.

Suggested fix (in order of preference):

Split fastvideo/eval/io/paths.py (plus the basic_ltx2_eval.py:94 shape-axis fix it implies) into its own PR titled e.g. [fix]: eval.io.build_eval_kwargs returns (T, C, H, W) and merge first.
Otherwise, rewrite the PR body to (a) explicitly call out the build_eval_kwargs contract change, (b) note the wan example is already included here, and (c) note the basic_ltx2_eval.py shape-axis update is the migration's other half.

Evidence: PR body vs. fastvideo/eval/io/paths.py:44-67, examples/inference/eval/basic_wan_eval.py (new file), examples/inference/eval/basic_ltx2_eval.py:94.

[S2-persistent] `build_eval_kwargs` contract change is undocumented; the `(T, C, H, W)` invariant is invisible to future readers

What: fastvideo/eval/io/paths.py:55 drops .unsqueeze(0) so build_eval_kwargs now returns kwargs["video"] as (T, C, H, W). The docstring (paths.py:47-50) describes the new shape but does not flag this as a change from the previous behavior. fastvideo/eval/README.md's "Public API" section (untouched here) does not mention the shape convention. The runtime auto-squeeze in fastvideo/eval/pool.py:_decode (pool.py:147-164) handles legacy 5-D callers, so this is not a silent runtime break — but the underlying invariant ("metrics see (T, C, H, W)") is now nowhere documented.

Why it matters: A future reader who notices build_eval_kwargs doesn't add a batch dim may "fix" it back to unsqueeze(0), re-introducing the inconsistency. There is no test in fastvideo/tests/ that asserts the shape (rg "build_eval_kwargs" fastvideo/tests/ → no matches), so the only thing pinning the contract is the example caller.

Suggested fix: Add a short note. Either:

A one-liner in fastvideo/eval/README.md's "Public API" section: "build_eval_kwargs returns video as (T, C, H, W) in [0, 1]."
Or a docstring note in paths.py: "As of this PR, the returned video is (T, C, H, W); the pool layer auto-squeezes legacy (1, T, C, H, W) callers for back-compat (see fastvideo/eval/pool.py:_decode)."

Optionally also add a 5-line unit test asserting build_eval_kwargs(...)["video"].dim() == 4.

Evidence: fastvideo/eval/io/paths.py:44-67, fastvideo/eval/io/__init__.py:1,10, fastvideo/eval/pool.py:147-164, fastvideo/eval/README.md "Public API" section.

Triage of @gemini-code-assist's earlier review (at SHA `bd59a68`, now stale)

Their concern was that audio.frechet_distance raises ValueError on the single-sample ev.evaluate(audio=...) call form. At HEAD (fdc45b1c), the METRICS list in basic_ltx2_audio_eval.py no longer includes audio.frechet_distance — only audio.clap_score and audio.audiobox_aesthetics remain (both per-sample). The ValueError path in fastvideo/eval/evaluator.py:128 is therefore unreachable for the current METRICS, and the new dual-mode result lookup (if name in results: ... elif hasattr(results, "corpus") and name in results.corpus: ...) is defensive scaffolding rather than a literal fix to the original complaint. Net: gemini's concern is no longer a live bug; the dual-lookup is harmless but would only matter if a corpus metric and the list form (evaluator.evaluate(samples=[...])) are reintroduced later.

— Gob (@SolitaryThinker's AI reviewer). Full review (including S3 items) is archived locally.

klhhhhh · 2026-06-11T07:09:00Z

@SolitaryThinker @shaoxiongduan , all done, I delete the wan eval script and add some reference in ltx2 doc string.

klhhhhh changed the title ~~Audio metrics in eval refactor~~ Eval metrics refactor May 28, 2026

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

mergify Bot added the scope: inference Inference pipeline, serving, CLI label May 28, 2026

klhhhhh force-pushed the kun_lin/audio-metrics branch from 1a7d4b3 to fdc45b1 Compare June 4, 2026 07:38

klhhhhh added 9 commits June 11, 2026 06:52

update audio.desync required keys

b02cb68

refactor audio eval for frechet_distance

a98e384

detect existing video path

1d89829

[feat] first version of wan eval

c90b351

[fix]change input video shape

6dda4ed

[refactor] detect whether output video exists

8d0a534

[fix] delete some info print

5c3697f

[fix] add docstring of build_eval_kwargs

f1ac7af

[fix] adjust print shape

b31f3ba

klhhhhh force-pushed the kun_lin/audio-metrics branch from fdc45b1 to b31f3ba Compare June 11, 2026 06:57

[fix] remove wan model test script, add ref in ltx2 script

cf0ca0f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval metrics refactor#1412

Eval metrics refactor#1412
klhhhhh wants to merge 10 commits into
hao-ai-lab:mainfrom
klhhhhh:kun_lin/audio-metrics

klhhhhh commented May 28, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

mergify Bot commented May 28, 2026

Uh oh!

mergify Bot commented May 28, 2026 •

edited

Loading

Uh oh!

klhhhhh commented Jun 4, 2026

Uh oh!

klhhhhh commented Jun 4, 2026 •

edited

Loading

Uh oh!

klhhhhh commented Jun 4, 2026

Uh oh!

SolitaryThinker commented Jun 6, 2026

Uh oh!

klhhhhh commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

klhhhhh commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Documentation updates

1. audio.frechet_distance is corpus-level only

2. audio.desync requires fps

Example improvements

3. Example usability improvement

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented May 28, 2026

⚠️ PR title format required

Uh oh!

mergify Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 PR merge requirements

Uh oh!

klhhhhh commented Jun 4, 2026

Uh oh!

klhhhhh commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Wan2.1-T2V-1.3B VBench Baseline

Evaluation Setup

Results

Observations

Conclusion

Uh oh!

klhhhhh commented Jun 4, 2026

Uh oh!

SolitaryThinker commented Jun 6, 2026

TL;DR

Verdict: ship-with-fixes

Findings (formatted for upload)

[S1] basic_wan_eval.py:107 reads .shape[1] after the 4-D contract; prints C (=3) as frame count

[S2-persistent] PR body misrepresents scope; mixes a library contract change with example scaffolding

[S2-persistent] build_eval_kwargs contract change is undocumented; the (T, C, H, W) invariant is invisible to future readers

Triage of @gemini-code-assist's earlier review (at SHA bd59a68, now stale)

Uh oh!

klhhhhh commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

klhhhhh commented May 28, 2026 •

edited

Loading

1. `audio.frechet_distance` is corpus-level only

2. `audio.desync` requires `fps`

mergify Bot commented May 28, 2026 •

edited

Loading

klhhhhh commented Jun 4, 2026 •

edited

Loading

[S1] `basic_wan_eval.py:107` reads `.shape[1]` after the 4-D contract; prints C (=3) as frame count

[S2-persistent] `build_eval_kwargs` contract change is undocumented; the `(T, C, H, W)` invariant is invisible to future readers

Triage of @gemini-code-assist's earlier review (at SHA `bd59a68`, now stale)