Skip to content

Eval metrics refactor#1412

Open
klhhhhh wants to merge 10 commits into
hao-ai-lab:mainfrom
klhhhhh:kun_lin/audio-metrics
Open

Eval metrics refactor#1412
klhhhhh wants to merge 10 commits into
hao-ai-lab:mainfrom
klhhhhh:kun_lin/audio-metrics

Conversation

@klhhhhh

@klhhhhh klhhhhh commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR updates the eval input helper build_eval_kwargs to return
video tensors as (T, C, H, W) instead of adding a batch dimension.
This matches the one-sample input contract used by Evaluator.evaluate(**sample).

It also updates the LTX2 eval example to use the new frame-axis convention
and adds a new Wan2.1-T2V-1.3B VBench evaluation example.

Documentation updates

This PR clarifies several audio metric input contracts and API behaviors
that were confusing during real-world smoke testing:

1. audio.frechet_distance is corpus-level only

Unlike most audio.* metrics, audio.frechet_distance is a
set-vs-set metric and must be evaluated through:

ev.evaluate(samples=[...])

instead of:

ev.evaluate(audio=...)

The README is updated to clarify that the result lives under:

results.corpus["audio.frechet_distance"]

and to provide a complete usage example.

2. audio.desync requires fps

The current docs only mention video and audio, but the metric also
requires fps (or src_fps) to run correctly.

The README now explicitly documents this requirement and adds an example.

Example improvements

examples/inference/eval/basic_ltx2_audio_eval.py is updated to:

  • support both per-sample and corpus-level metrics
  • avoid crashes when evaluating audio.frechet_distance
  • provide more general result-printing logic
  • better reflect the actual input contracts of the audio metric suite

These changes make the example usable as a more reliable smoke-test
template for validating audio metrics on generated LTX2 outputs.

3. Example usability improvement

examples/inference/eval/basic_ltx2_audio_eval.py now checks whether the
target output video already exists before launching generation.

If the output mp4 is already present, the script skips the expensive LTX2
generation step and directly reuses the existing video for evaluation.

This makes the example significantly more convenient for iterative audio
metric debugging and smoke-testing workflows, where metrics are frequently
re-run on the same generated sample.

@klhhhhh klhhhhh changed the title Audio metrics in eval refactor Eval metrics refactor May 28, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the basic LTX2 audio evaluation script to conditionally generate the video only if it does not already exist, and improves the handling of both per-sample and corpus-level metrics. It also updates the documentation for the audio.desync metric. A critical issue was identified where evaluating corpus-level metrics like audio.frechet_distance with a single-sample call raises a ValueError, and using the list form causes per-sample metrics to be incorrectly reported as missing. A suggestion was provided to use the list-based evaluation format and check the first sample's results.

Comment on lines 69 to +81
results = evaluator.evaluate(audio=output_path, text_prompt=PROMPT)

print("\n=== Audio scores ===")
for name in METRICS:
r = results[name]
r = None

# per-sample metric
if hasattr(results, "__contains__") and name in results:
r = results[name]

# corpus-level metric (e.g. audio.frechet_distance)
elif hasattr(results, "corpus") and name in results.corpus:
r = results.corpus[name]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation has two issues when supporting both per-sample and corpus-level metrics (like audio.frechet_distance):

  1. ValueError on evaluation: If audio.frechet_distance is added to METRICS, calling evaluator.evaluate(audio=...) (single-sample form) raises a ValueError because set-vs-set metrics require the list/samples form.
  2. Per-sample metrics reported as MISSING: If the list form samples=[...] is used, results becomes an EvalResults (list subclass). The check name in results will evaluate to False because name is a string and the list contains dictionaries, causing all per-sample metrics to be reported as MISSING.

Using the list form in evaluate and checking results[0] for per-sample metrics resolves both issues.

Suggested change
results = evaluator.evaluate(audio=output_path, text_prompt=PROMPT)
print("\n=== Audio scores ===")
for name in METRICS:
r = results[name]
r = None
# per-sample metric
if hasattr(results, "__contains__") and name in results:
r = results[name]
# corpus-level metric (e.g. audio.frechet_distance)
elif hasattr(results, "corpus") and name in results.corpus:
r = results.corpus[name]
results = evaluator.evaluate(samples=[{"audio": output_path, "text_prompt": PROMPT}])
print("\n=== Audio scores ===")
for name in METRICS:
r = None
# corpus-level metric (e.g. audio.frechet_distance)
if hasattr(results, "corpus") and name in results.corpus:
r = results.corpus[name]
# per-sample metric
elif len(results) > 0 and name in results[0]:
r = results[0][name]

@mergify mergify Bot added the scope: inference Inference pipeline, serving, CLI label May 28, 2026
@mergify

mergify Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

⚠️ PR title format required

Your PR title must start with a type tag in brackets. Examples:

  • [feat] Add new model support
  • [bugfix] Fix VAE tiling corruption
  • [refactor] Restructure training pipeline
  • [perf] Optimize attention kernel
  • [ci] Update test infrastructure
  • [infra] Add activation trace hooks
  • [docs] Add inference guide
  • [misc] Clean up configs
  • [new-model] Port Flux2 to FastVideo
  • [skill] Add add-model agent skill

Valid tags: feat, feature, bugfix, fix, refactor, perf, ci, infra, doc, docs, misc, chore, kernel, new-model, skill, skills

Please update your PR title and the merge protection check will pass automatically.

@mergify

mergify Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 PR merge requirements

Waiting for

  • #approved-reviews-by>=1
  • check-success=full-suite-passed
  • title~=(?i)^\[(feat|feature|bugfix|fix|refactor|perf|ci|doc|docs|misc|chore|kernel|new.?model|skill|skills|infra)\]
This rule is failing.
  • #approved-reviews-by>=1
  • check-success=full-suite-passed
  • title~=(?i)^\[(feat|feature|bugfix|fix|refactor|perf|ci|doc|docs|misc|chore|kernel|new.?model|skill|skills|infra)\]
  • check-success=fastcheck-passed
  • check-success~=pre-commit

@klhhhhh

klhhhhh commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

This change fixes build_eval_kwargs to match the evaluator's expected
one-sample input contract.

The helper previously introduced an extra batch dimension on the video
tensor, producing a shape that was inconsistent with what video metrics
expect. The function now forwards videos in their native
(T, C, H, W) format and keeps prompt/auxiliary_info as scalar values,
making the generated kwargs directly compatible with
Evaluator.evaluate(**sample).

@klhhhhh

klhhhhh commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Wan2.1-T2V-1.3B VBench Baseline

Single-video sanity evaluation using the built-in VBench metrics on a
video generated by Wan-AI/Wan2.1-T2V-1.3B-Diffusers.

Evaluation Setup

Model:
Wan-AI/Wan2.1-T2V-1.3B-Diffusers

Metrics:

  • vbench.aesthetic_quality

  • vbench.subject_consistency

  • vbench.background_consistency

  • vbench.imaging_quality

  • vbench.temporal_flickering

  • vbench.motion_smoothness

  • vbench.dynamic_degree

  • vbench.overall_consistency

Results

Metric Score
aesthetic_quality 0.6042
subject_consistency 0.9390
background_consistency 0.9550
imaging_quality 0.5915
temporal_flickering 0.9537
motion_smoothness 0.9729
dynamic_degree 1.0000
overall_consistency 0.2861

Observations

  • Subject consistency (0.9390) is strong, indicating that the main
    object remains stable throughout the generated sequence.

  • Background consistency (0.9550) is similarly high, suggesting
    minimal scene drift and good spatial coherence.

  • Temporal flickering (0.9537) and motion smoothness (0.9729)
    indicate stable frame-to-frame generation with few visible temporal
    artifacts.

  • Dynamic degree (1.0000) confirms that the generated clip contains
    substantial motion rather than static imagery.

  • Aesthetic quality (0.6042) and imaging quality (0.5915) are
    moderate, suggesting that visual fidelity remains an area for
    improvement compared with stronger proprietary or larger-scale video
    generation models.

  • Overall consistency (0.2861) remains the lowest score among the
    evaluated metrics. Since this metric combines multiple aspects of
    temporal and semantic coherence, it often exposes residual generation
    artifacts that are not captured by individual stability metrics alone.

Conclusion

The Wan2.1-T2V-1.3B baseline demonstrates strong temporal stability,
motion continuity, and scene consistency. The primary limitations are
visual quality and aggregate consistency metrics, which provides a
useful baseline for future FastVideo evaluation experiments and model
comparisons.

@klhhhhh klhhhhh force-pushed the kun_lin/audio-metrics branch from 1a7d4b3 to fdc45b1 Compare June 4, 2026 07:38
@klhhhhh

klhhhhh commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

@SolitaryThinker @shaoxiongduan Hi Will and Shao, I've addressed the previous comments and pushed the latest updates. Could you please take a look when you have a chance? Thanks!

@SolitaryThinker

Copy link
Copy Markdown
Collaborator

Hi @klhhhhh — automated review from Gob, one of @SolitaryThinker's AI reviewers. Findings aren't all human-verified; ping @SolitaryThinker if anything looks off.

TL;DR

The fastvideo/eval/io/paths.py contract change (drop unsqueeze(0), emit (T, C, H, W)) is correct and matches what every metric in fastvideo/eval/metrics/* already consumes — the pool layer's auto-squeeze (pool.py:_decode) had been hiding the mismatch. The one practical bug is in the new basic_wan_eval.py: line 107 reads .shape[1] after the contract change, so it will print C=3 as the frame count. The companion basic_ltx2_eval.py was correctly updated to .shape[0] in the same PR, which makes the wan-side asymmetry a clear oversight. Separately, the PR body undersells the diff: it advertises audio-metric docs + an example refactor and says "I will add wan test later," but the PR already ships the wan example and a library contract change.

Verdict: ship-with-fixes

  • S0 (blockers): 0
  • S1 (must-fix): 1
  • S2 (should-fix; persistent): 2
  • S3 (discussion): not shown here; see local review.md

Findings (formatted for upload)

[S1] basic_wan_eval.py:107 reads .shape[1] after the 4-D contract; prints C (=3) as frame count

What: With this PR, build_eval_kwargs returns kwargs["video"] as (T, C, H, W) (paths.py:55 — "video": video replaces "video": video.unsqueeze(0)). The new examples/inference/eval/basic_wan_eval.py:107 does:

print(f"[eval] running ({sample['video'].shape[1]} frames @ {FPS:g} fps)...")

so .shape[1] is now the channel count (typically 3), not the frame count. The print line will read "running (3 frames @ 16 fps)..." regardless of num_frames=81. Note that basic_ltx2_eval.py:94 was correctly updated to .shape[0] in the same diff, which makes this an asymmetric oversight within the PR.

Why it matters: Cosmetic-only — the metric run itself is unaffected — but basic_wan_eval.py is a brand-new example file users will copy as a template, and the print silently lies about workload size during debugging.

Suggested fix:

print(f"[eval] running ({sample['video'].shape[0]} frames @ {FPS:g} fps)...")

Evidence: examples/inference/eval/basic_wan_eval.py:107, contrasted with examples/inference/eval/basic_ltx2_eval.py:94 and fastvideo/eval/io/paths.py:55.

[S2-persistent] PR body misrepresents scope; mixes a library contract change with example scaffolding

What: The body summary describes "audio metric documentation and updates the basic_ltx2_audio_eval.py example" and closes with "I will do another refactor and add wan test later." The diff, however, already ships (a) the paths.py library contract change to build_eval_kwargs, (b) a new +114-line basic_wan_eval.py, and (c) a shape-axis fix in basic_ltx2_eval.py that only makes sense as part of the paths.py migration. None of these are mentioned in the body.

Why it matters:

  • build_eval_kwargs is exported via fastvideo.eval.io.__all__; dropping unsqueeze(0) is a public-API contract change and warrants either a CHANGELOG note, a README callout, or its own atomic PR. A reviewer scanning the body would not notice it.
  • Bundled scope makes the PR harder to bisect later: git log --follow fastvideo/eval/io/paths.py will land on a commit titled around audio metrics.
  • The "I will add wan test later" footer contradicts the diff and reads as stale text the body wasn't updated to reflect.

Suggested fix (in order of preference):

  1. Split fastvideo/eval/io/paths.py (plus the basic_ltx2_eval.py:94 shape-axis fix it implies) into its own PR titled e.g. [fix]: eval.io.build_eval_kwargs returns (T, C, H, W) and merge first.
  2. Otherwise, rewrite the PR body to (a) explicitly call out the build_eval_kwargs contract change, (b) note the wan example is already included here, and (c) note the basic_ltx2_eval.py shape-axis update is the migration's other half.

Evidence: PR body vs. fastvideo/eval/io/paths.py:44-67, examples/inference/eval/basic_wan_eval.py (new file), examples/inference/eval/basic_ltx2_eval.py:94.

[S2-persistent] build_eval_kwargs contract change is undocumented; the (T, C, H, W) invariant is invisible to future readers

What: fastvideo/eval/io/paths.py:55 drops .unsqueeze(0) so build_eval_kwargs now returns kwargs["video"] as (T, C, H, W). The docstring (paths.py:47-50) describes the new shape but does not flag this as a change from the previous behavior. fastvideo/eval/README.md's "Public API" section (untouched here) does not mention the shape convention. The runtime auto-squeeze in fastvideo/eval/pool.py:_decode (pool.py:147-164) handles legacy 5-D callers, so this is not a silent runtime break — but the underlying invariant ("metrics see (T, C, H, W)") is now nowhere documented.

Why it matters: A future reader who notices build_eval_kwargs doesn't add a batch dim may "fix" it back to unsqueeze(0), re-introducing the inconsistency. There is no test in fastvideo/tests/ that asserts the shape (rg "build_eval_kwargs" fastvideo/tests/ → no matches), so the only thing pinning the contract is the example caller.

Suggested fix: Add a short note. Either:

  • A one-liner in fastvideo/eval/README.md's "Public API" section: "build_eval_kwargs returns video as (T, C, H, W) in [0, 1]."
  • Or a docstring note in paths.py: "As of this PR, the returned video is (T, C, H, W); the pool layer auto-squeezes legacy (1, T, C, H, W) callers for back-compat (see fastvideo/eval/pool.py:_decode)."

Optionally also add a 5-line unit test asserting build_eval_kwargs(...)["video"].dim() == 4.

Evidence: fastvideo/eval/io/paths.py:44-67, fastvideo/eval/io/__init__.py:1,10, fastvideo/eval/pool.py:147-164, fastvideo/eval/README.md "Public API" section.


Triage of @gemini-code-assist's earlier review (at SHA bd59a68, now stale)

Their concern was that audio.frechet_distance raises ValueError on the single-sample ev.evaluate(audio=...) call form. At HEAD (fdc45b1c), the METRICS list in basic_ltx2_audio_eval.py no longer includes audio.frechet_distance — only audio.clap_score and audio.audiobox_aesthetics remain (both per-sample). The ValueError path in fastvideo/eval/evaluator.py:128 is therefore unreachable for the current METRICS, and the new dual-mode result lookup (if name in results: ... elif hasattr(results, "corpus") and name in results.corpus: ...) is defensive scaffolding rather than a literal fix to the original complaint. Net: gemini's concern is no longer a live bug; the dual-lookup is harmless but would only matter if a corpus metric and the list form (evaluator.evaluate(samples=[...])) are reintroduced later.


— Gob (@SolitaryThinker's AI reviewer). Full review (including S3 items) is archived locally.

@klhhhhh klhhhhh force-pushed the kun_lin/audio-metrics branch from fdc45b1 to b31f3ba Compare June 11, 2026 06:57
@klhhhhh

klhhhhh commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@SolitaryThinker @shaoxiongduan , all done, I delete the wan eval script and add some reference in ltx2 doc string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

scope: inference Inference pipeline, serving, CLI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants