You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/tutorial.md
+67-21Lines changed: 67 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -88,6 +88,52 @@ We will continuously update DiffSynth-Engine to support more models. (Wan2.2 LoR
88
88
89
89
After the model is downloaded, load the model with the corresponding pipeline and perform inference.
90
90
91
+
92
+
### Image Generation(Qwen-Image)
93
+
94
+
The following code calls `QwenImagePipeline` to load the [Qwen-Image](https://www.modelscope.cn/models/Qwen/Qwen-Image) model and generate an image. Recommended resolutions are 928×1664, 1104×1472, 1328×1328, 1472×1104, and 1664×928, with a suggested cfg_scale of 4. If no negative_prompt is provided, it defaults to a single space character (not an empty string). For multi-GPU parallelism, currently only cfg parallelism is supported (parallelism=2), with other optimization efforts underway.
95
+
96
+
```python
97
+
from diffsynth_engine import fetch_model, QwenImagePipeline, QwenImagePipelineConfig
Please note that if some necessary modules, like text encoders, are missing from a model repository, the pipeline will automatically download the required files.
124
+
125
+
#### Detailed Parameters(Qwen-Image)
126
+
127
+
In the image generation pipeline `pipe`, we can use the following parameters for fine-grained control:
128
+
129
+
*`prompt`: The prompt, used to describe the content of the generated image, It supports multiple languages (Chinese, English, Japanese, etc.), e.g., “一只猫” (Chinese), "a cat" (English), or "庭を走る猫" (Japanese).
130
+
*`negative_prompt`: The negative prompt, used to describe content you do not want in the image, it defaults to a single space character (not an empty string), e.g., "ugly".
131
+
*`cfg_scale`: The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the image but reduces the diversity of the generated content.
132
+
*`height`: Image height.
133
+
*`width`: Image width.
134
+
*`num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher image quality.
135
+
*`seed`: The random seed. A fixed seed ensures reproducible results.
136
+
91
137
### Image Generation
92
138
93
139
The following code calls `FluxImagePipeline` to load the [MajicFlus](https://www.modelscope.cn/models/MAILAND/majicflus_v1/summary?version=v1.0) model and generate an image. To load other types of models, replace `FluxImagePipeline` and `FluxPipelineConfig` in the code with the corresponding pipeline and config.
@@ -109,16 +155,16 @@ Please note that if some necessary modules, like text encoders, are missing from
109
155
110
156
In the image generation pipeline `pipe`, we can use the following parameters for fine-grained control:
111
157
112
-
*`prompt`: The prompt, used to describe the content of the generated image, e.g., "a cat".
113
-
*`negative_prompt`: The negative prompt, used to describe content you do not want in the image, e.g., "ugly".
114
-
*`cfg_scale`: The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the image but reduces the diversity of the generated content.
115
-
*`clip_skip`: The number of layers to skip in the [CLIP](https://arxiv.org/abs/2103.00020) text encoder. The more layers skipped, the lower the text-image correlation, but this can lead to interesting variations in the generated content.
116
-
*`input_image`: Input image, used for image-to-image generation.
117
-
*`denoising_strength`: The denoising strength. When set to 1, a full generation process is performed. When set to a value between 0 and 1, some information from the input image is preserved.
118
-
*`height`: Image height.
119
-
*`width`: Image width.
120
-
*`num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher image quality.
121
-
*`seed`: The random seed. A fixed seed ensures reproducible results.
158
+
*`prompt`: The prompt, used to describe the content of the generated image, e.g., "a cat".
159
+
*`negative_prompt`: The negative prompt, used to describe content you do not want in the image, e.g., "ugly".
160
+
*`cfg_scale`: The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the image but reduces the diversity of the generated content.
161
+
*`clip_skip`: The number of layers to skip in the [CLIP](https://arxiv.org/abs/2103.00020) text encoder. The more layers skipped, the lower the text-image correlation, but this can lead to interesting variations in the generated content.
162
+
*`input_image`: Input image, used for image-to-image generation.
163
+
*`denoising_strength`: The denoising strength. When set to 1, a full generation process is performed. When set to a value between 0 and 1, some information from the input image is preserved.
164
+
*`height`: Image height.
165
+
*`width`: Image width.
166
+
*`num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher image quality.
167
+
*`seed`: The random seed. A fixed seed ensures reproducible results.
In the video generation pipeline `pipe`, we can use the following parameters for fine-grained control:
179
225
180
-
*`prompt`: The prompt, used to describe the content of the generated video, e.g., "a cat".
181
-
*`negative_prompt`: The negative prompt, used to describe content you do not want in the video, e.g., "ugly".
182
-
*`cfg_scale`: The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the video but reduces the diversity of the generated content.
183
-
*`input_image`: Input image, only effective in image-to-video models, such as [Wan-AI/Wan2.1-I2V-14B-720P](https://modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P).
184
-
*`input_video`: Input video, used for video-to-video generation.
185
-
*`denoising_strength`: The denoising strength. When set to 1, a full generation process is performed. When set to a value between 0 and 1, some information from the input video is preserved.
186
-
*`height`: Video frame height.
187
-
*`width`: Video frame width.
188
-
*`num_frames`: Number of video frames.
189
-
*`num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher video quality.
190
-
*`seed`: The random seed. A fixed seed ensures reproducible results.
226
+
*`prompt`: The prompt, used to describe the content of the generated video, e.g., "a cat".
227
+
*`negative_prompt`: The negative prompt, used to describe content you do not want in the video, e.g., "ugly".
228
+
*`cfg_scale`: The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the video but reduces the diversity of the generated content.
229
+
*`input_image`: Input image, only effective in image-to-video models, such as [Wan-AI/Wan2.1-I2V-14B-720P](https://modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P).
230
+
*`input_video`: Input video, used for video-to-video generation.
231
+
*`denoising_strength`: The denoising strength. When set to 1, a full generation process is performed. When set to a value between 0 and 1, some information from the input video is preserved.
232
+
*`height`: Video frame height.
233
+
*`width`: Video frame width.
234
+
*`num_frames`: Number of video frames.
235
+
*`num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher video quality.
236
+
*`seed`: The random seed. A fixed seed ensures reproducible results.
0 commit comments