Skip to content

[Bug] fp8 safetensors diffusion model is loaded as f16 instead of fp8, causing excessive VRAM usage #1347

@gaowanliang

Description

@gaowanliang

Environment

  • OS: Windows 11
  • GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM)
  • CUDA: 12.x
  • Commit: d6dd6d7

Models

Expected behavior

fp8 safetensors model should be loaded in fp8 precision (~5.8GB VRAM).

Actual behavior

The fp8 model is automatically converted to f16 on load, consuming ~11.7GB VRAM,
causing OOM on 8GB GPUs and producing black images.

From the log:

[DEBUG] main.cpp:515  - version: stable-diffusion.cpp version unknown, commit d6dd6d7
[DEBUG] main.cpp:516  - System Info:
    SSE3 = 1 |     AVX = 1 |     AVX2 = 1 |     AVX512 = 1 |     AVX512_VBMI = 0 |     AVX512_VNNI = 0 |     FMA = 1 |     NEON = 0 |     ARM_FMA = 0 |     F16C = 1 |     FP16_VA = 0 |     WASM_SIMD = 0 |     VSX = 0 |
[DEBUG] main.cpp:517  - SDCliParams {
  mode: img_gen,
  output_path: "output.png",
  verbose: true,
  color: false,
  canny_preprocess: false,
  convert_name: false,
  preview_method: none,
  preview_interval: 1,
  preview_path: "preview.png",
  preview_fps: 16,
  taesd_preview: false,
  preview_noisy: false
}
[DEBUG] main.cpp:518  - SDContextParams {
  n_threads: 8,
  model_path: "",
  clip_l_path: "",
  clip_g_path: "",
  clip_vision_path: "",
  t5xxl_path: "",
  llm_path: ".\models\text-encoder\Qwen3-4B-Instruct-2507-Q4_K_M.gguf",
  llm_vision_path: "",
  diffusion_model_path: ".\models\stable-diffusion\moodyRealMix_zitV4DPOFP8.safetensors",
  high_noise_diffusion_model_path: "",
  vae_path: ".\models\vae\flux1-vae.safetensors",
  taesd_path: "",
  esrgan_path: "",
  control_net_path: "",
  embedding_dir: "",
  embeddings: {
  }
  wtype: NONE,
  tensor_type_rules: "",
  lora_model_dir: ".",
  photo_maker_path: "",
  rng_type: cuda,
  sampler_rng_type: NONE,
  offload_params_to_cpu: true,
  enable_mmap: false,
  control_net_cpu: false,
  clip_on_cpu: false,
  vae_on_cpu: false,
  flash_attn: false,
  diffusion_flash_attn: false,
  diffusion_conv_direct: false,
  vae_conv_direct: false,
  circular: false,
  circular_x: false,
  circular_y: false,
  chroma_use_dit_mask: true,
  qwen_image_zero_cond_t: false,
  chroma_use_t5_mask: false,
  chroma_t5_mask_pad: 1,
  prediction: NONE,
  lora_apply_mode: auto,
  vae_tiling_params: { 0, 0, 0, 0.5, 0, 0 },
  force_sdxl_vae_conv_scale: false
}
[DEBUG] main.cpp:519  - SDGenerationParams {
  loras: "{
  }",
  high_noise_loras: "{
  }",
  prompt: "A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: 'THE CITY IS A CIRCUIT BOARD, AND I AM A BROKEN TRANSISTOR.' -- moody, atmospheric, profound, dark academic",
  negative_prompt: "",
  clip_skip: -1,
  width: 960,
  height: 640,
  batch_count: 1,
  init_image_path: "",
  end_image_path: "",
  mask_image_path: "",
  control_image_path: "",
  ref_image_paths: [],
  control_video_path: "",
  auto_resize_ref_image: true,
  increase_ref_index: false,
  pm_id_images_dir: "",
  pm_id_embed_path: "",
  pm_style_strength: 20,
  skip_layers: [7, 8, 9],
  sample_params: (txt_cfg: 1.00, img_cfg: 1.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: dpm++2m, sample_steps: 8, eta: 0.00, shifted_timestep: 0, flow_shift: 3.00),
  high_noise_skip_layers: [7, 8, 9],
  high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: 0.00, shifted_timestep: 0, flow_shift: inf),
  custom_sigmas: [],
  cache_mode: "",
  cache_option: "",
  cache: disabled (threshold=1, start=0.15, end=0.95),
  moe_boundary: 0.875,
  video_frames: 1,
  fps: 16,
  vace_strength: 1,
  strength: 0.75,
  control_strength: 0.9,
  seed: 42,
  upscale_repeats: 1,
  upscale_tile_size: 128,
}
[DEBUG] stable-diffusion.cpp:173  - Using CUDA backend
[INFO ] ggml_extend.hpp:78   - ggml_cuda_init: found 1 CUDA devices:
[INFO ] ggml_extend.hpp:78   -   Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
[INFO ] stable-diffusion.cpp:267  - loading diffusion model from '.\models\stable-diffusion\moodyRealMix_zitV4DPOFP8.safetensors'
[INFO ] model.cpp:369  - load .\models\stable-diffusion\moodyRealMix_zitV4DPOFP8.safetensors using safetensors format
[DEBUG] model.cpp:503  - init from '.\models\stable-diffusion\moodyRealMix_zitV4DPOFP8.safetensors', prefix = 'model.diffusion_model.'
[INFO ] stable-diffusion.cpp:314  - loading llm from '.\models\text-encoder\Qwen3-4B-Instruct-2507-Q4_K_M.gguf'
[INFO ] model.cpp:366  - load .\models\text-encoder\Qwen3-4B-Instruct-2507-Q4_K_M.gguf using gguf format
[DEBUG] model.cpp:412  - init from '.\models\text-encoder\Qwen3-4B-Instruct-2507-Q4_K_M.gguf'
[INFO ] stable-diffusion.cpp:328  - loading vae from '.\models\vae\flux1-vae.safetensors'
[INFO ] model.cpp:369  - load .\models\vae\flux1-vae.safetensors using safetensors format
[DEBUG] model.cpp:503  - init from '.\models\vae\flux1-vae.safetensors', prefix = 'vae.'
[INFO ] stable-diffusion.cpp:345  - Version: Z-Image
[INFO ] stable-diffusion.cpp:373  - Weight type stat:                      f32: 389  |     f16: 453  |    q4_K: 216  |    q6_K: 37
[INFO ] stable-diffusion.cpp:374  - Conditioner weight type stat:          f32: 145  |    q4_K: 216  |    q6_K: 37
[INFO ] stable-diffusion.cpp:375  - Diffusion model weight type stat:      f16: 453
[INFO ] stable-diffusion.cpp:376  - VAE weight type stat:                  f32: 244
[DEBUG] stable-diffusion.cpp:378  - ggml tensor size = 400 bytes
[DEBUG] llm.hpp:286  - merges size 151387
[DEBUG] llm.hpp:318  - vocab size: 151669
[DEBUG] llm.hpp:1140 - llm: num_layers = 36, vocab_size = 151936, hidden_size = 2560, intermediate_size = 9728
[DEBUG] ggml_extend.hpp:1994 - qwen3 params backend buffer size =  3555.38 MB(RAM) (398 tensors)
[DEBUG] ggml_extend.hpp:1994 - z_image params backend buffer size =  11743.02 MB(RAM) (453 tensors)
[DEBUG] ggml_extend.hpp:1994 - vae params backend buffer size =  94.57 MB(RAM) (138 tensors)
[DEBUG] stable-diffusion.cpp:781  - loading weights
[DEBUG] model.cpp:1350 - using 8 threads for model loading
[DEBUG] model.cpp:1372 - loading tensors from .\models\stable-diffusion\moodyRealMix_zitV4DPOFP8.safetensors
  |====================>                             | 453/1095 - 66.75it/s
[DEBUG] model.cpp:1372 - loading tensors from .\models\text-encoder\Qwen3-4B-Instruct-2507-Q4_K_M.gguf
  |======================================>           | 851/1095 - 95.90it/s
[DEBUG] model.cpp:1372 - loading tensors from .\models\vae\flux1-vae.safetensors
  |==================================================| 1095/1095 - 120.65it/s
[INFO ] model.cpp:1598 - loading tensors completed, taking 9.08s (process: 0.00s, read: 3.98s, memcpy: 0.00s, convert: 4.23s, copy_to_backend: 0.00s)
[DEBUG] stable-diffusion.cpp:816  - finished loaded file
[INFO ] stable-diffusion.cpp:889  - total params memory size = 15392.97MB (VRAM 15392.97MB, RAM 0.00MB): text_encoders 3555.38MB(VRAM), diffusion_model 11743.02MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:955  - running in FLOW mode
[DEBUG] stable-diffusion.cpp:3639 - generate_image 960x640
[INFO ] stable-diffusion.cpp:3675 - sampling using DPM++ (2M) method
[INFO ] denoiser.hpp:494  - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:3802 - TXT2IMG
[DEBUG] conditioner.hpp:1864 - parse '<|im_start|>user
A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: 'THE CITY IS A CIRCUIT BOARD, AND I AM A BROKEN TRANSISTOR.' -- moody, atmospheric, profound, dark academic<|im_end|>
<|im_start|>assistant
' to [['<|im_start|>user
', 1], ['A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: 'THE CITY IS A CIRCUIT BOARD, AND I AM A BROKEN TRANSISTOR.' -- moody, atmospheric, profound, dark academic', 1], ['<|im_end|>
<|im_start|>assistant
', 1], ]
[DEBUG] llm.hpp:260  - split prompt "<|im_start|>user
" to tokens ["<|im_start|>", "user", "Ċ", ]
[DEBUG] llm.hpp:260  - split prompt "A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: 'THE CITY IS A CIRCUIT BOARD, AND I AM A BROKEN TRANSISTOR.' -- moody, atmospheric, profound, dark academic" to tokens ["A", "Ġcinematic", ",", "Ġmelanch", "olic", "Ġphotograph", "Ġof", "Ġa", "Ġsolitary", "Ġhood", "ed", "Ġfigure", "Ġwalking", "Ġthrough", "Ġa", "Ġsprawling", ",", "Ġrain", "-s", "lick", "ed", "Ġmet", "ropolis", "Ġat", "Ġnight", ".", "ĠThe", "Ġcity", "Ġlights", "Ġare", "Ġa", "Ġchaotic", "Ġblur", "Ġof", "Ġneon", "Ġorange", "Ġand", "Ġcool", "Ġblue", ",", "Ġreflecting", "Ġon", "Ġthe", "Ġwet", "Ġasphalt", ".", "ĠThe", "Ġscene", "Ġev", "okes", "Ġa", "Ġsense", "Ġof", "Ġbeing", "Ġa", "Ġsingle", "Ġcomponent", "Ġin", "Ġa", "Ġvast", "Ġmachine", ".", "ĠSuper", "im", "posed", "Ġover", "Ġthe", "Ġimage", "Ġin", "Ġa", "Ġsleek", ",", "Ġmodern", ",", "Ġslightly", "Ġglitch", "ed", "Ġfont", "Ġis", "Ġthe", "Ġphilosophical", "Ġquote", ":", "Ġ'", "THE", "ĠCITY", "ĠIS", "ĠA", "ĠC", "IR", "CU", "IT", "ĠBOARD", ",", "ĠAND", "ĠI", "ĠAM", "ĠA", "ĠBRO", "KEN", "ĠTRANS", "IST", "OR", ".'", "Ġ--", "Ġmo", "ody", ",", "Ġatmospheric", ",", "Ġprofound", ",", "Ġdark", "Ġacademic", ]
[DEBUG] llm.hpp:260  - split prompt "<|im_end|>
<|im_start|>assistant
" to tokens ["<|im_end|>", "Ċ", "<|im_start|>", "assistant", "Ċ", ]
[INFO ] ggml_extend.hpp:1906 - qwen3 offload params (3555.38 MB, 398 tensors) to runtime backend (CUDA0), taking 0.70s
[DEBUG] ggml_extend.hpp:1806 - qwen3 compute buffer size: 13.40 MB(VRAM)
[DEBUG] conditioner.hpp:2149 - computing condition graph completed, taking 992 ms
[INFO ] stable-diffusion.cpp:3381 - get_learned_condition completed, taking 992 ms
[INFO ] stable-diffusion.cpp:3492 - generating image: 1/1 - seed 42
[INFO ] ggml_extend.hpp:1906 - z_image offload params (11743.02 MB, 453 tensors) to runtime backend (CUDA0), taking 4.27s
[DEBUG] ggml_extend.hpp:1806 - z_image compute buffer size: 1031.78 MB(VRAM)

Command used

sd-cli.exe \
  --diffusion-model moodyRealMix_zitV4DPOFP8.safetensors \
  --vae flux1-vae.safetensors \
  --llm Qwen3-4B-Instruct-2507-Q4_K_M.gguf \
  -p "a cat" -W 640 -H 960 \
  --steps 8 --cfg-scale 1.0 \
  --flow-shift 3.0 -v

Possible cause

ggml may not have native fp8 (f8_e4m3/f8_e5m2) compute support,
so the loader falls back to f16.
Would it be possible to add fp8 native inference support,
or at least add a warning when this implicit conversion happens?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions