Skip to content

Commit e259c49

Browse files
Doc/qwen image (#133)
* update qwen image doc * fix tutorial_zh resolutions * fix typo --------- Co-authored-by: zhuguoxuan.zgx <[email protected]>
1 parent 6c62194 commit e259c49

File tree

2 files changed

+131
-40
lines changed

2 files changed

+131
-40
lines changed

docs/tutorial.md

Lines changed: 67 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,52 @@ We will continuously update DiffSynth-Engine to support more models. (Wan2.2 LoR
8888

8989
After the model is downloaded, load the model with the corresponding pipeline and perform inference.
9090

91+
92+
### Image Generation(Qwen-Image)
93+
94+
The following code calls `QwenImagePipeline` to load the [Qwen-Image](https://www.modelscope.cn/models/Qwen/Qwen-Image) model and generate an image. Recommended resolutions are 928×1664, 1104×1472, 1328×1328, 1472×1104, and 1664×928, with a suggested cfg_scale of 4. If no negative_prompt is provided, it defaults to a single space character (not an empty string). For multi-GPU parallelism, currently only cfg parallelism is supported (parallelism=2), with other optimization efforts underway.
95+
96+
```python
97+
from diffsynth_engine import fetch_model, QwenImagePipeline, QwenImagePipelineConfig
98+
99+
config = QwenImagePipelineConfig.basic_config(
100+
model_path=fetch_model("MusePublic/Qwen-image", revision="v1", path="transformer/*.safetensors"),
101+
encoder_path=fetch_model("MusePublic/Qwen-image", revision="v1", path="text_encoder/*.safetensors"),
102+
vae_path=fetch_model("MusePublic/Qwen-image", revision="v1", path="vae/*.safetensors"),
103+
parallelism=2,
104+
)
105+
pipe = QwenImagePipeline.from_pretrained(config)
106+
107+
prompt = """
108+
一副典雅庄重的对联悬挂于厅堂之中,房间是个安静古典的中式布置,桌子上放着一些青花瓷,对联上左书“思涌如泉万类灵感皆可触”,右书“智启于问千机代码自天成”,横批“AI脑洞力”,字体飘逸灵动,兼具传统笔意与未来感。中间挂着一幅中国风的画作,内容是岳阳楼,云雾缭绕间似有数据流光隐现,古今交融,意境深远。
109+
"""
110+
negative_prompt = " "
111+
image = pipe(
112+
prompt=prompt,
113+
negative_prompt=negative_prompt,
114+
cfg_scale=4.0,
115+
width=1104,
116+
height=1472,
117+
num_inference_steps=30,
118+
seed=42,
119+
)
120+
image.save("image.png")
121+
```
122+
123+
Please note that if some necessary modules, like text encoders, are missing from a model repository, the pipeline will automatically download the required files.
124+
125+
#### Detailed Parameters(Qwen-Image)
126+
127+
In the image generation pipeline `pipe`, we can use the following parameters for fine-grained control:
128+
129+
* `prompt`: The prompt, used to describe the content of the generated image, It supports multiple languages (Chinese, English, Japanese, etc.), e.g., “一只猫” (Chinese), "a cat" (English), or "庭を走る猫" (Japanese).
130+
* `negative_prompt`: The negative prompt, used to describe content you do not want in the image, it defaults to a single space character (not an empty string), e.g., "ugly".
131+
* `cfg_scale`: The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the image but reduces the diversity of the generated content.
132+
* `height`: Image height.
133+
* `width`: Image width.
134+
* `num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher image quality.
135+
* `seed`: The random seed. A fixed seed ensures reproducible results.
136+
91137
### Image Generation
92138

93139
The following code calls `FluxImagePipeline` to load the [MajicFlus](https://www.modelscope.cn/models/MAILAND/majicflus_v1/summary?version=v1.0) model and generate an image. To load other types of models, replace `FluxImagePipeline` and `FluxPipelineConfig` in the code with the corresponding pipeline and config.
@@ -109,16 +155,16 @@ Please note that if some necessary modules, like text encoders, are missing from
109155

110156
In the image generation pipeline `pipe`, we can use the following parameters for fine-grained control:
111157

112-
* `prompt`: The prompt, used to describe the content of the generated image, e.g., "a cat".
113-
* `negative_prompt`: The negative prompt, used to describe content you do not want in the image, e.g., "ugly".
114-
* `cfg_scale`: The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the image but reduces the diversity of the generated content.
115-
* `clip_skip`: The number of layers to skip in the [CLIP](https://arxiv.org/abs/2103.00020) text encoder. The more layers skipped, the lower the text-image correlation, but this can lead to interesting variations in the generated content.
116-
* `input_image`: Input image, used for image-to-image generation.
117-
* `denoising_strength`: The denoising strength. When set to 1, a full generation process is performed. When set to a value between 0 and 1, some information from the input image is preserved.
118-
* `height`: Image height.
119-
* `width`: Image width.
120-
* `num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher image quality.
121-
* `seed`: The random seed. A fixed seed ensures reproducible results.
158+
* `prompt`: The prompt, used to describe the content of the generated image, e.g., "a cat".
159+
* `negative_prompt`: The negative prompt, used to describe content you do not want in the image, e.g., "ugly".
160+
* `cfg_scale`: The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the image but reduces the diversity of the generated content.
161+
* `clip_skip`: The number of layers to skip in the [CLIP](https://arxiv.org/abs/2103.00020) text encoder. The more layers skipped, the lower the text-image correlation, but this can lead to interesting variations in the generated content.
162+
* `input_image`: Input image, used for image-to-image generation.
163+
* `denoising_strength`: The denoising strength. When set to 1, a full generation process is performed. When set to a value between 0 and 1, some information from the input image is preserved.
164+
* `height`: Image height.
165+
* `width`: Image width.
166+
* `num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher image quality.
167+
* `seed`: The random seed. A fixed seed ensures reproducible results.
122168

123169
#### Loading LoRA
124170

@@ -177,17 +223,17 @@ save_video(video, "video.mp4")
177223

178224
In the video generation pipeline `pipe`, we can use the following parameters for fine-grained control:
179225

180-
* `prompt`: The prompt, used to describe the content of the generated video, e.g., "a cat".
181-
* `negative_prompt`: The negative prompt, used to describe content you do not want in the video, e.g., "ugly".
182-
* `cfg_scale`: The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the video but reduces the diversity of the generated content.
183-
* `input_image`: Input image, only effective in image-to-video models, such as [Wan-AI/Wan2.1-I2V-14B-720P](https://modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P).
184-
* `input_video`: Input video, used for video-to-video generation.
185-
* `denoising_strength`: The denoising strength. When set to 1, a full generation process is performed. When set to a value between 0 and 1, some information from the input video is preserved.
186-
* `height`: Video frame height.
187-
* `width`: Video frame width.
188-
* `num_frames`: Number of video frames.
189-
* `num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher video quality.
190-
* `seed`: The random seed. A fixed seed ensures reproducible results.
226+
* `prompt`: The prompt, used to describe the content of the generated video, e.g., "a cat".
227+
* `negative_prompt`: The negative prompt, used to describe content you do not want in the video, e.g., "ugly".
228+
* `cfg_scale`: The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the video but reduces the diversity of the generated content.
229+
* `input_image`: Input image, only effective in image-to-video models, such as [Wan-AI/Wan2.1-I2V-14B-720P](https://modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P).
230+
* `input_video`: Input video, used for video-to-video generation.
231+
* `denoising_strength`: The denoising strength. When set to 1, a full generation process is performed. When set to a value between 0 and 1, some information from the input video is preserved.
232+
* `height`: Video frame height.
233+
* `width`: Video frame width.
234+
* `num_frames`: Number of video frames.
235+
* `num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher video quality.
236+
* `seed`: The random seed. A fixed seed ensures reproducible results.
191237

192238
#### Loading LoRA
193239

docs/tutorial_zh.md

Lines changed: 64 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,51 @@ Diffusion 模型包含多种多样的模型结构,每种模型由对应的流
8888

8989
模型下载完毕后,我们可以根据对应的模型类型选择流水线加载模型并进行推理。
9090

91+
### 图像生成(Qwen-Image)
92+
93+
以下代码可以调用 `QwenImagePipeline` 加载[Qwen-Image](https://www.modelscope.cn/models/Qwen/Qwen-Image)模型生成一张图。推荐分辨率为928×1664, 1104×1472, 1328×1328, 1472×1104, 1664×928,cfg_scale为4,如果没有negative_prompt默认为一个空格而不是空字符串。多卡并行目前支持cfg并行(parallelism=2),其他优化工作正在进行中。
94+
95+
```python
96+
from diffsynth_engine import fetch_model, QwenImagePipeline, QwenImagePipelineConfig
97+
98+
config = QwenImagePipelineConfig.basic_config(
99+
model_path=fetch_model("MusePublic/Qwen-image", revision="v1", path="transformer/*.safetensors"),
100+
encoder_path=fetch_model("MusePublic/Qwen-image", revision="v1", path="text_encoder/*.safetensors"),
101+
vae_path=fetch_model("MusePublic/Qwen-image", revision="v1", path="vae/*.safetensors"),
102+
parallelism=2,
103+
)
104+
pipe = QwenImagePipeline.from_pretrained(config)
105+
106+
prompt = """
107+
一副典雅庄重的对联悬挂于厅堂之中,房间是个安静古典的中式布置,桌子上放着一些青花瓷,对联上左书“思涌如泉万类灵感皆可触”,右书“智启于问千机代码自天成”,横批“AI脑洞力”,字体飘逸灵动,兼具传统笔意与未来感。中间挂着一幅中国风的画作,内容是岳阳楼,云雾缭绕间似有数据流光隐现,古今交融,意境深远。
108+
"""
109+
negative_prompt = " "
110+
image = pipe(
111+
prompt=prompt,
112+
negative_prompt=negative_prompt,
113+
cfg_scale=4.0,
114+
width=1104,
115+
height=1472,
116+
num_inference_steps=30,
117+
seed=42,
118+
)
119+
image.save("image.png")
120+
```
121+
122+
请注意,某些模型库中缺乏必要的文本编码器等模块,我们的代码会自动补充下载所需的模型文件。
123+
124+
#### 详细参数(Qwen-Image)
125+
126+
在图像生成流水线 `pipe` 中,我们可以通过以下参数进行精细的控制:
127+
128+
* `prompt`: 提示词,用于描述生成图像的内容,支持多种语言(中文/英文/日文等),例如“一只猫”/"a cat"/"庭を走る猫"。
129+
* `negative_prompt`: 负面提示词,用于描述不希望图像中出现的内容,例如“ugly”,默认为一个空格而不是空字符串, " "。
130+
* `cfg_scale`: [Classifier-free guidance](https://arxiv.org/abs/2207.12598) 的引导系数,通常更大的引导系数可以达到更强的文图相关性,但会降低生成内容的多样性,推荐值为4。
131+
* `height`: 图像高度。
132+
* `width`: 图像宽度。
133+
* `num_inference_steps`: 推理步数,通常推理步数越多,计算时间越长,图像质量越高。
134+
* `seed`: 随机种子,固定的随机种子可以使生成的内容固定。
135+
91136
### 图像生成
92137

93138
以下代码可以调用 `FluxImagePipeline` 加载[麦橘超然](https://www.modelscope.cn/models/MAILAND/majicflus_v1/summary?version=v1.0)模型生成一张图。如果要加载其他结构的模型,请将代码中的 `FluxImagePipeline``FluxPipelineConfig` 替换成对应的流水线模块及配置。
@@ -110,15 +155,15 @@ image.save("image.png")
110155
在图像生成流水线 `pipe` 中,我们可以通过以下参数进行精细的控制:
111156

112157
* `prompt`: 提示词,用于描述生成图像的内容,例如“a cat”。
113-
* `negative_prompt`负面提示词,用于描述不希望图像中出现的内容,例如“ugly”。
114-
* `cfg_scale`[Classifier-free guidance](https://arxiv.org/abs/2207.12598) 的引导系数,通常更大的引导系数可以达到更强的文图相关性,但会降低生成内容的多样性。
115-
* `clip_skip`跳过 [CLIP](https://arxiv.org/abs/2103.00020) 文本编码器的层数,跳过的层数越多,生成的图像与文本的相关性越低,但生成的图像内容可能会出现奇妙的变化。
116-
* `input_image`输入图像,用于图生图。
117-
* `denoising_strength`去噪力度,当设置为 1 时,执行完整的生成过程,当设置为 0 到 1 之间的值时,会保留输入图像中的部分信息。
118-
* `height`图像高度。
119-
* `width`图像宽度。
120-
* `num_inference_steps`推理步数,通常推理步数越多,计算时间越长,图像质量越高。
121-
* `seed`随机种子,固定的随机种子可以使生成的内容固定。
158+
* `negative_prompt`: 负面提示词,用于描述不希望图像中出现的内容,例如“ugly”。
159+
* `cfg_scale`: [Classifier-free guidance](https://arxiv.org/abs/2207.12598) 的引导系数,通常更大的引导系数可以达到更强的文图相关性,但会降低生成内容的多样性。
160+
* `clip_skip`: 跳过 [CLIP](https://arxiv.org/abs/2103.00020) 文本编码器的层数,跳过的层数越多,生成的图像与文本的相关性越低,但生成的图像内容可能会出现奇妙的变化。
161+
* `input_image`: 输入图像,用于图生图。
162+
* `denoising_strength`: 去噪力度,当设置为 1 时,执行完整的生成过程,当设置为 0 到 1 之间的值时,会保留输入图像中的部分信息。
163+
* `height`: 图像高度。
164+
* `width`: 图像宽度。
165+
* `num_inference_steps`: 推理步数,通常推理步数越多,计算时间越长,图像质量越高。
166+
* `seed`: 随机种子,固定的随机种子可以使生成的内容固定。
122167

123168
#### LoRA 加载
124169

@@ -175,16 +220,16 @@ save_video(video, "video.mp4")
175220
在视频生成流水线 `pipe` 中,我们可以通过以下参数进行精细的控制:
176221

177222
* `prompt`: 提示词,用于描述生成图像的内容,例如“a cat”。
178-
* `negative_prompt`负面提示词,用于描述不希望图像中出现的内容,例如“ugly”。
179-
* `cfg_scale`[Classifier-free guidance](https://arxiv.org/abs/2207.12598) 的引导系数,通常更大的引导系数可以达到更强的文图相关性,但会降低生成内容的多样性。
180-
* `input_image`输入图像,只在图生视频模型中有效,例如 [Wan-AI/Wan2.1-I2V-14B-720P](https://modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P)
181-
* `input_video`输入视频,用于视频生视频。
182-
* `denoising_strength`去噪力度,当设置为 1 时,执行完整的生成过程,当设置为 0 到 1 之间的值时,会保留输入视频中的部分信息。
183-
* `height`视频帧高度。
184-
* `width`视频帧宽度。
185-
* `num_frames`视频帧数。
186-
* `num_inference_steps`推理步数,通常推理步数越多,计算时间越长,图像质量越高。
187-
* `seed`随机种子,固定的随机种子可以使生成的内容固定。
223+
* `negative_prompt`: 负面提示词,用于描述不希望图像中出现的内容,例如“ugly”。
224+
* `cfg_scale`: [Classifier-free guidance](https://arxiv.org/abs/2207.12598) 的引导系数,通常更大的引导系数可以达到更强的文图相关性,但会降低生成内容的多样性。
225+
* `input_image`: 输入图像,只在图生视频模型中有效,例如 [Wan-AI/Wan2.1-I2V-14B-720P](https://modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P)
226+
* `input_video`: 输入视频,用于视频生视频。
227+
* `denoising_strength`: 去噪力度,当设置为 1 时,执行完整的生成过程,当设置为 0 到 1 之间的值时,会保留输入视频中的部分信息。
228+
* `height`: 视频帧高度。
229+
* `width`: 视频帧宽度。
230+
* `num_frames`: 视频帧数。
231+
* `num_inference_steps`: 推理步数,通常推理步数越多,计算时间越长,图像质量越高。
232+
* `seed`: 随机种子,固定的随机种子可以使生成的内容固定。
188233

189234
#### LoRA 加载
190235

0 commit comments

Comments
 (0)