Skip to content

Claude/pensive hoover 06d545#360

Open
lin285170 wants to merge 39 commits into
Wan-Video:mainfrom
lin285170:claude/pensive-hoover-06d545
Open

Claude/pensive hoover 06d545#360
lin285170 wants to merge 39 commits into
Wan-Video:mainfrom
lin285170:claude/pensive-hoover-06d545

Conversation

@lin285170

Copy link
Copy Markdown

No description provided.

lin285170 and others added 30 commits May 8, 2026 17:04
… serving docs.

- Refactor generate.py: build_parser, parse_args, args_from_job_dict for programmatic jobs
- Add generate_job.py for torchrun entrypoint
- Add serve/ package (FastAPI, Redis queue, multi-node launcher, worker)
- Add run_api_server.py, requirements_serve.txt, pyproject optional serve deps
- Add docker-compose.yml, docker/Dockerfiles, compose env example, .dockerignore
- Document deployment in DEPLOY_SERVE.md; extend .gitignore for placeholder ckpt

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…nfig flexibility

- Extract shared pipeline logic (T5 encoding, VAE init, scheduler creation,
  model config, dual-expert switching, seed handling, distributed helpers)
  into WanPipelineBase in pipeline_base.py
- Refactor WanT2V, WanI2V, WanTI2V, WanS2V, WanAnimate to inherit from
  WanPipelineBase, reducing significant code duplication
- Remove hardcoded /home/HPCBase paths from serve/config.py and
  serve/launcher.py; conda_env and conda_exe now default to empty,
  LD_LIBRARY_PATH and OMP_NUM_THREADS are configurable via environment
  variables WAN_REMOTE_LD_LIBRARY_PATH and WAN_REMOTE_OMP_NUM_THREADS
- Add conda_env/conda_exe to Settings.from_env() for environment-based
  configuration
…f SSH

Two-node architecture: master (redis+api+worker0) and worker (worker1).
Each node runs torchrun locally; NCCL/TCP rendezvous connects them.
Master signals worker1 via Redis pub/sub to coordinate job dispatch.
- DEPLOY.md: step-by-step dual-node deployment instructions
- tests/test_api.py: integration tests for all API endpoints (auth, task CRUD, file download)
- tests/test_serve.py: unit tests for config, job_build, worker routing, store signals, launcher, schemas
… s2v)

- schemas: add ModelEnum for valid model IDs, video/mask fields for animate
- api: per-model input validation (i2v requires image, animate requires video, s2v requires audio or enable_tts)
- job_build: fill default size per model when not provided
- tests: 37 unit tests covering all model types and validation rules
…ployment

job_build now auto-appends model-specific subdirectory to global ckpt_dir:
  wan2.2-t2v-a14b → /ckpt/Wan2.2-T2V-A14B
  wan2.2-i2v-a14b → /ckpt/Wan2.2-I2V-A14B
  etc.

parameters.ckpt_dir still overrides for custom paths. Just specify the
model in the request and the system finds the right weights automatically.

Updated docker-compose, .env example, DEPLOY.md for multi-model layout.
Critical fix: generate.py has --src_root_path but no --video arg.
VideoInput.video is now mapped to src_root_path in job_build.py,
preventing ValueError("Unknown job field: video") in generate_job.py.

Also fixes test_api.py:
- s2v test now includes required image field
- added per-model validation tests (i2v requires image, etc.)
- added animate and ti2v creation tests
- removed stale test_empty_prompt (now returns 400)
- fixed size format (use * instead of x)
…t issue

Docker bridge network isolates containers, preventing torchrun from
binding and exposing port 29500 for NCCL rendezvous. This causes the
"client socket timed out" error seen in production.

worker0 and worker1 now use network_mode: host (PyTorch's recommended
approach for distributed training). Redis and API stay on bridge network.

worker0 connects to Redis via localhost (WAN_REDIS_URL_LOCAL).
WAN_MASTER_ADDR must use the host's real IP (not 0.0.0.0).
…del_S2V)

The class was renamed in an earlier commit but __init__.py still imported
the old name, causing ImportError when running generate_job.py.
The class in audio_encoder.py is named AudioEncoder, not Wav2Vec2Encoder.
This ImportError prevented the s2v pipeline from loading.
Generator exists in motion_encoder.py but animate.py imports MotionEncoder.
Upstream also has this mismatch — the MotionEncoder wrapper class was
missing. It loads the Generator from checkpoint and exposes get_motion().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant