Skip to content

UPSTREAM PR #1357: Inpaint imporvements#90

Open
loci-dev wants to merge 2 commits intomainfrom
loci/pr-1357-inpaint-imporvements
Open

UPSTREAM PR #1357: Inpaint imporvements#90
loci-dev wants to merge 2 commits intomainfrom
loci/pr-1357-inpaint-imporvements

Conversation

@loci-dev
Copy link

Note

Source pull request: leejet/stable-diffusion.cpp#1357

  • For all models in inpaint mode: Improve mask downsampling to latent size by taking the maximum over the 8x8 patch instead of a single sample in the corner

  • For inpaint models: Use masked diffusion for inpaint models too, to reduce color shift. That was previously disabled because of some artifacting near the edges of the mask, it is fixed by inflating the mask by 1 latent pixel for diffusion.

Example:

.\build\bin\sd-cli.exe --model ..\ComfyUI\models\checkpoints\sdxl\dreamshaperXL_lightningInpaint.safetensors -p "a dog sitting on a bench" --color --steps 16 --cfg-scale 1 --sampling-method euler_a --preview proj --preview-noisy --img-cfg-scale 1 -i .\bench.png --mask .\bench_mask.png --strength 1

original mask
bench bench_mask
master PR
master PR
PR just enabling masked diffusion without inflating the mask
PR no-inflate

(not very noticable difference in that specific example, but there are some noticable "floaters" around the masked areas in the image on the right if you look closely)

@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod March 21, 2026 04:56 — with GitHub Actions Inactive
@loci-review
Copy link

loci-review bot commented Mar 21, 2026

Flame Graph: build.bin.sd-server::_Z23generate_image_internalP8sd_ctx_tP12ggml_contextP11ggml_tensorNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_i20sd_guidance_params_tfiii15sample_method_tRKSt6vectorIfSaIfEEli10sd_image_tf14sd_pm_params_tSD_IPSI_SaISK_EESD_IS4_SaIS4_EEbS4_S4_PK17sd_cache_params_t

Target version:

Flame Graph: build.bin.sd-server::_Z23generate_image_internalP8sd_ctx_tP12ggml_contextP11ggml_tensorNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_i20sd_guidance_params_tfiii15sample_method_tRKSt6vectorIfSaIfEEli10sd_image_tf14sd_pm_params_tSD_IPSI_SaISK_EESD_IS4_SaIS4_EEbS4_S4_PK17sd_cache_params_t

Target version adds ggml_ext_dup_and_cpy_tensor (7.6% of execution time) for mask processing. Core operations (get_pmid_conditon 45.5%, apply 21.2%, sample 18.8%) remain unchanged, confirming performance impact is isolated to new mask preprocessing functionality.

Additional Findings

GPU/ML Operations: All changes are CPU-side mask preprocessing; GPU inference pipeline (text encoding, denoising, VAE operations) completely unaffected. The 70 µs CPU overhead does not impact GPU utilization or inference performance.

Commits: Two commits implement inpainting quality improvements: mask inflation via 3×3 max-pooling (50974ff) and max-pooling downsampling to prevent single-pixel sampling artifacts (f2fb03b). Performance cost is justified by significant visual quality improvements at mask boundaries.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants