Skip to content

[feat] Add Kandinsky-5 pipeline support#1471

Open
aryan5v wants to merge 2 commits into
hao-ai-lab:mainfrom
aryan5v:aryan/kandinsky5-draft-pr
Open

[feat] Add Kandinsky-5 pipeline support#1471
aryan5v wants to merge 2 commits into
hao-ai-lab:mainfrom
aryan5v:aryan/kandinsky5-draft-pr

Conversation

@aryan5v

@aryan5v aryan5v commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds first-class Kandinsky-5 Lite T2V support through the normal FastVideo model-support path:

  • Kandinsky-5 pipeline config and default preset
  • basic/kandinsky5 composed pipeline wiring
  • Kandinsky-specific latent prep, denoising, and latent decode stages
  • registry/default-preset wiring for kandinskylab/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusers
  • loader fixes needed by Kandinsky's CLIP text encoder layout and text-encoder CPU offload
  • transformer parity fixes in the existing Kandinsky5 DiT implementation

Validation

  • pre-commit run --files <12 implementation files>: passed locally
  • uv run pytest tests/local_tests/kandinsky5/test_kandinsky5_lite_transformer_parity.py -q -s -rs: passed on B200 GPU, 1 passed, 14 warnings
  • CUDA_VISIBLE_DEVICES=0 uv run python tests/local_tests/kandinsky5/run_kandinsky5_lite_pipeline_smoke.py: passed, one-step latent smoke generated successfully
  • High-quality generation validation on B200 GPU 4: W&B run fuczhqid, 768x512, 121 frames, 80 inference steps, guidance scale 5.0, output outputs/kandinsky5_validation/kandinsky5_red_motorcycle_best_quality_512x768_121f_80s.mp4

aryan5v and others added 2 commits June 18, 2026 15:38
Fixed 2 file(s) based on 2 unresolved review comments.

Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
@mergify mergify Bot added type: feat New feature or capability scope: inference Inference pipeline, serving, CLI scope: model Model architecture (DiTs, encoders, VAEs) labels Jun 19, 2026
@mergify

mergify Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

This PR has merge conflicts with the base branch. Please rebase:

git fetch origin main
git rebase origin/main
# Resolve any conflicts, then:
git push --force-with-lease

@mergify mergify Bot added the needs-rebase PR has merge conflicts label Jun 19, 2026
@mergify

mergify Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 PR merge requirements

Waiting for

  • #approved-reviews-by>=1
  • check-success=full-suite-passed
  • check-success~=pre-commit
This rule is failing.
  • #approved-reviews-by>=1
  • check-success=full-suite-passed
  • check-success~=pre-commit
  • check-success=fastcheck-passed
  • title~=(?i)^\[(feat|feature|bugfix|fix|refactor|perf|ci|doc|docs|misc|chore|kernel|new.?model|skill|skills|infra)\]

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Kandinsky-5.0 Lite text-to-video pipeline, introducing the necessary configurations, presets, pipeline stages (latent preparation, denoising, and decoding), and updates to the text encoding and model loading components. Feedback focuses on improving robustness and compatibility: preserving the original tensor dtype in _apply_rotary instead of hardcoding bfloat16, breaking early from the denoising loop on interruption, validating the shape of custom latents, verifying the lengths of text encoder precisions and max lengths, supporting asymmetric patch sizes for divisibility checks, and ensuring prompt_embeds contains at least two elements during input verification.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines 265 to +268
def _apply_rotary(x: torch.Tensor, rope: torch.Tensor) -> torch.Tensor:
orig_dtype = x.dtype
x_ = x.reshape(*x.shape[:-1], -1, 1, 2).to(torch.float32)
x_out = (rope * x_).sum(dim=-1)
return x_out.reshape(*x.shape).to(orig_dtype)
return x_out.reshape(*x.shape).to(torch.bfloat16)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding torch.bfloat16 in _apply_rotary breaks compatibility when running the model in other precisions (such as float16 or float32). Additionally, since _apply_rotary is immediately followed by .type_as(query) in Kandinsky5Attention.forward, this hardcoded cast causes redundant casting and precision loss. Preserving the input tensor's original dtype is more generic and correct.

Suggested change
def _apply_rotary(x: torch.Tensor, rope: torch.Tensor) -> torch.Tensor:
orig_dtype = x.dtype
x_ = x.reshape(*x.shape[:-1], -1, 1, 2).to(torch.float32)
x_out = (rope * x_).sum(dim=-1)
return x_out.reshape(*x.shape).to(orig_dtype)
return x_out.reshape(*x.shape).to(torch.bfloat16)
def _apply_rotary(x: torch.Tensor, rope: torch.Tensor) -> torch.Tensor:
orig_dtype = x.dtype
x_ = x.reshape(*x.shape[:-1], -1, 1, 2).to(torch.float32)
x_out = (rope * x_).sum(dim=-1)
return x_out.reshape(*x.shape).to(orig_dtype)

Comment on lines +185 to +186
if hasattr(self, "interrupt") and self.interrupt:
continue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using continue when self.interrupt is set will still iterate through all remaining timesteps in the loop, wasting CPU cycles. Replacing it with break will immediately terminate the denoising loop, which is the expected behavior for an interruption.

Suggested change
if hasattr(self, "interrupt") and self.interrupt:
continue
if hasattr(self, "interrupt") and self.interrupt:
break

Comment on lines +81 to +82
else:
latents = batch.latents.to(device=device, dtype=dtype)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When custom or pre-computed latents are provided via batch.latents, it is important to validate that their shape matches the expected latent shape to prevent runtime shape mismatch errors later in the pipeline.

Suggested change
else:
latents = batch.latents.to(device=device, dtype=dtype)
else:
if list(batch.latents.shape) != list(shape):
raise ValueError(f"Provided latents shape {list(batch.latents.shape)} does not match expected shape {list(shape)}.")
latents = batch.latents.to(device=device, dtype=dtype)

Comment on lines +78 to +83
def __post_init__(self) -> None:
if len(self.text_encoder_configs) != 2:
raise ValueError(
f"Kandinsky5 pipeline requires exactly 2 text encoders (qwen and clip), "
f"but got {len(self.text_encoder_configs)} encoder(s)."
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In addition to validating the number of text encoder configs, we should also validate that text_encoder_precisions and text_encoder_max_lengths have exactly 2 elements to prevent potential out-of-bounds index errors during loading or encoding.

    def __post_init__(self) -> None:
        if len(self.text_encoder_configs) != 2:
            raise ValueError(
                f"Kandinsky5 pipeline requires exactly 2 text encoders (qwen and clip), "
                f"but got {len(self.text_encoder_configs)} encoder(s)."
            )
        if len(self.text_encoder_precisions) != 2:
            raise ValueError(
                f"Kandinsky5 pipeline requires exactly 2 text encoder precisions, "
                f"but got {len(self.text_encoder_precisions)}."
            )
        if len(self.text_encoder_max_lengths) != 2:
            raise ValueError(
                f"Kandinsky5 pipeline requires exactly 2 text encoder max lengths, "
                f"but got {len(self.text_encoder_max_lengths)}."
            )

Comment on lines +45 to +48
required_divisor = spatial_ratio * patch_size[1]
if height % required_divisor != 0 or width % required_divisor != 0:
raise ValueError(f"Kandinsky5 height/width must be divisible by {required_divisor}; "
f"got height={height}, width={width}.")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Checking only patch_size[1] for both height and width divisibility assumes that height and width patch sizes are always identical. To support potential asymmetric patch sizes correctly, patch_size[1] should be used for height and patch_size[2] for width.

Suggested change
required_divisor = spatial_ratio * patch_size[1]
if height % required_divisor != 0 or width % required_divisor != 0:
raise ValueError(f"Kandinsky5 height/width must be divisible by {required_divisor}; "
f"got height={height}, width={width}.")
required_divisor_h = spatial_ratio * patch_size[1]
required_divisor_w = spatial_ratio * patch_size[2]
if height % required_divisor_h != 0 or width % required_divisor_w != 0:
raise ValueError(f"Kandinsky5 height must be divisible by {required_divisor_h} and width by {required_divisor_w}; "
f"got height={height}, width={width}.")

Comment on lines +232 to +236
def verify_input(self, batch: ForwardBatch, fastvideo_args: FastVideoArgs) -> VerificationResult:
result = VerificationResult()
result.add_check("latents", batch.latents, [V.is_tensor, V.with_dims(5)])
result.add_check("prompt_embeds", batch.prompt_embeds, V.list_not_empty)
return result

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since Kandinsky5DenoisingStage.forward accesses batch.prompt_embeds[0] and batch.prompt_embeds[1], we should validate that batch.prompt_embeds contains at least 2 elements in verify_input to prevent unhandled IndexError crashes.

Suggested change
def verify_input(self, batch: ForwardBatch, fastvideo_args: FastVideoArgs) -> VerificationResult:
result = VerificationResult()
result.add_check("latents", batch.latents, [V.is_tensor, V.with_dims(5)])
result.add_check("prompt_embeds", batch.prompt_embeds, V.list_not_empty)
return result
def verify_input(self, batch: ForwardBatch, fastvideo_args: FastVideoArgs) -> VerificationResult:
result = VerificationResult()
result.add_check("latents", batch.latents, [V.is_tensor, V.with_dims(5)])
result.add_check("prompt_embeds", batch.prompt_embeds, [V.is_list, lambda x: len(x) >= 2])
return result

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase PR has merge conflicts scope: inference Inference pipeline, serving, CLI scope: model Model architecture (DiTs, encoders, VAEs) type: feat New feature or capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant