[Bug] fp8 safetensors diffusion model is loaded as f16 instead of fp8, causing excessive VRAM usage

## Environment
- OS: Windows 11
- GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM)
- CUDA: 12.x
- Commit: d6dd6d7

## Models
- Diffusion: [moodyRealMix_zitV4DPOFP8.safetensors](https://civitai.com/models/621441?modelVersionId=2757808) (fp8, ~5.8GB) 
- LLM: Qwen3-4B-Instruct-2507-Q4_K_M.gguf
- VAE: flux1-vae.safetensors

## Expected behavior
fp8 safetensors model should be loaded in fp8 precision (~5.8GB VRAM).

## Actual behavior
The fp8 model is automatically converted to f16 on load, consuming ~11.7GB VRAM,
causing OOM on 8GB GPUs and producing black images.

From the log:
```log
[DEBUG] main.cpp:515  - version: stable-diffusion.cpp version unknown, commit d6dd6d7
[DEBUG] main.cpp:516  - System Info:
    SSE3 = 1 |     AVX = 1 |     AVX2 = 1 |     AVX512 = 1 |     AVX512_VBMI = 0 |     AVX512_VNNI = 0 |     FMA = 1 |     NEON = 0 |     ARM_FMA = 0 |     F16C = 1 |     FP16_VA = 0 |     WASM_SIMD = 0 |     VSX = 0 |
[DEBUG] main.cpp:517  - SDCliParams {
  mode: img_gen,
  output_path: "output.png",
  verbose: true,
  color: false,
  canny_preprocess: false,
  convert_name: false,
  preview_method: none,
  preview_interval: 1,
  preview_path: "preview.png",
  preview_fps: 16,
  taesd_preview: false,
  preview_noisy: false
}
[DEBUG] main.cpp:518  - SDContextParams {
  n_threads: 8,
  model_path: "",
  clip_l_path: "",
  clip_g_path: "",
  clip_vision_path: "",
  t5xxl_path: "",
  llm_path: ".\models\text-encoder\Qwen3-4B-Instruct-2507-Q4_K_M.gguf",
  llm_vision_path: "",
  diffusion_model_path: ".\models\stable-diffusion\moodyRealMix_zitV4DPOFP8.safetensors",
  high_noise_diffusion_model_path: "",
  vae_path: ".\models\vae\flux1-vae.safetensors",
  taesd_path: "",
  esrgan_path: "",
  control_net_path: "",
  embedding_dir: "",
  embeddings: {
  }
  wtype: NONE,
  tensor_type_rules: "",
  lora_model_dir: ".",
  photo_maker_path: "",
  rng_type: cuda,
  sampler_rng_type: NONE,
  offload_params_to_cpu: true,
  enable_mmap: false,
  control_net_cpu: false,
  clip_on_cpu: false,
  vae_on_cpu: false,
  flash_attn: false,
  diffusion_flash_attn: false,
  diffusion_conv_direct: false,
  vae_conv_direct: false,
  circular: false,
  circular_x: false,
  circular_y: false,
  chroma_use_dit_mask: true,
  qwen_image_zero_cond_t: false,
  chroma_use_t5_mask: false,
  chroma_t5_mask_pad: 1,
  prediction: NONE,
  lora_apply_mode: auto,
  vae_tiling_params: { 0, 0, 0, 0.5, 0, 0 },
  force_sdxl_vae_conv_scale: false
}
[DEBUG] main.cpp:519  - SDGenerationParams {
  loras: "{
  }",
  high_noise_loras: "{
  }",
  prompt: "A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: 'THE CITY IS A CIRCUIT BOARD, AND I AM A BROKEN TRANSISTOR.' -- moody, atmospheric, profound, dark academic",
  negative_prompt: "",
  clip_skip: -1,
  width: 960,
  height: 640,
  batch_count: 1,
  init_image_path: "",
  end_image_path: "",
  mask_image_path: "",
  control_image_path: "",
  ref_image_paths: [],
  control_video_path: "",
  auto_resize_ref_image: true,
  increase_ref_index: false,
  pm_id_images_dir: "",
  pm_id_embed_path: "",
  pm_style_strength: 20,
  skip_layers: [7, 8, 9],
  sample_params: (txt_cfg: 1.00, img_cfg: 1.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: dpm++2m, sample_steps: 8, eta: 0.00, shifted_timestep: 0, flow_shift: 3.00),
  high_noise_skip_layers: [7, 8, 9],
  high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: 0.00, shifted_timestep: 0, flow_shift: inf),
  custom_sigmas: [],
  cache_mode: "",
  cache_option: "",
  cache: disabled (threshold=1, start=0.15, end=0.95),
  moe_boundary: 0.875,
  video_frames: 1,
  fps: 16,
  vace_strength: 1,
  strength: 0.75,
  control_strength: 0.9,
  seed: 42,
  upscale_repeats: 1,
  upscale_tile_size: 128,
}
[DEBUG] stable-diffusion.cpp:173  - Using CUDA backend
[INFO ] ggml_extend.hpp:78   - ggml_cuda_init: found 1 CUDA devices:
[INFO ] ggml_extend.hpp:78   -   Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
[INFO ] stable-diffusion.cpp:267  - loading diffusion model from '.\models\stable-diffusion\moodyRealMix_zitV4DPOFP8.safetensors'
[INFO ] model.cpp:369  - load .\models\stable-diffusion\moodyRealMix_zitV4DPOFP8.safetensors using safetensors format
[DEBUG] model.cpp:503  - init from '.\models\stable-diffusion\moodyRealMix_zitV4DPOFP8.safetensors', prefix = 'model.diffusion_model.'
[INFO ] stable-diffusion.cpp:314  - loading llm from '.\models\text-encoder\Qwen3-4B-Instruct-2507-Q4_K_M.gguf'
[INFO ] model.cpp:366  - load .\models\text-encoder\Qwen3-4B-Instruct-2507-Q4_K_M.gguf using gguf format
[DEBUG] model.cpp:412  - init from '.\models\text-encoder\Qwen3-4B-Instruct-2507-Q4_K_M.gguf'
[INFO ] stable-diffusion.cpp:328  - loading vae from '.\models\vae\flux1-vae.safetensors'
[INFO ] model.cpp:369  - load .\models\vae\flux1-vae.safetensors using safetensors format
[DEBUG] model.cpp:503  - init from '.\models\vae\flux1-vae.safetensors', prefix = 'vae.'
[INFO ] stable-diffusion.cpp:345  - Version: Z-Image
[INFO ] stable-diffusion.cpp:373  - Weight type stat:                      f32: 389  |     f16: 453  |    q4_K: 216  |    q6_K: 37
[INFO ] stable-diffusion.cpp:374  - Conditioner weight type stat:          f32: 145  |    q4_K: 216  |    q6_K: 37
[INFO ] stable-diffusion.cpp:375  - Diffusion model weight type stat:      f16: 453
[INFO ] stable-diffusion.cpp:376  - VAE weight type stat:                  f32: 244
[DEBUG] stable-diffusion.cpp:378  - ggml tensor size = 400 bytes
[DEBUG] llm.hpp:286  - merges size 151387
[DEBUG] llm.hpp:318  - vocab size: 151669
[DEBUG] llm.hpp:1140 - llm: num_layers = 36, vocab_size = 151936, hidden_size = 2560, intermediate_size = 9728
[DEBUG] ggml_extend.hpp:1994 - qwen3 params backend buffer size =  3555.38 MB(RAM) (398 tensors)
[DEBUG] ggml_extend.hpp:1994 - z_image params backend buffer size =  11743.02 MB(RAM) (453 tensors)
[DEBUG] ggml_extend.hpp:1994 - vae params backend buffer size =  94.57 MB(RAM) (138 tensors)
[DEBUG] stable-diffusion.cpp:781  - loading weights
[DEBUG] model.cpp:1350 - using 8 threads for model loading
[DEBUG] model.cpp:1372 - loading tensors from .\models\stable-diffusion\moodyRealMix_zitV4DPOFP8.safetensors
  |====================>                             | 453/1095 - 66.75it/s
[DEBUG] model.cpp:1372 - loading tensors from .\models\text-encoder\Qwen3-4B-Instruct-2507-Q4_K_M.gguf
  |======================================>           | 851/1095 - 95.90it/s
[DEBUG] model.cpp:1372 - loading tensors from .\models\vae\flux1-vae.safetensors
  |==================================================| 1095/1095 - 120.65it/s
[INFO ] model.cpp:1598 - loading tensors completed, taking 9.08s (process: 0.00s, read: 3.98s, memcpy: 0.00s, convert: 4.23s, copy_to_backend: 0.00s)
[DEBUG] stable-diffusion.cpp:816  - finished loaded file
[INFO ] stable-diffusion.cpp:889  - total params memory size = 15392.97MB (VRAM 15392.97MB, RAM 0.00MB): text_encoders 3555.38MB(VRAM), diffusion_model 11743.02MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:955  - running in FLOW mode
[DEBUG] stable-diffusion.cpp:3639 - generate_image 960x640
[INFO ] stable-diffusion.cpp:3675 - sampling using DPM++ (2M) method
[INFO ] denoiser.hpp:494  - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:3802 - TXT2IMG
[DEBUG] conditioner.hpp:1864 - parse '<|im_start|>user
A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: 'THE CITY IS A CIRCUIT BOARD, AND I AM A BROKEN TRANSISTOR.' -- moody, atmospheric, profound, dark academic<|im_end|>
<|im_start|>assistant
' to [['<|im_start|>user
', 1], ['A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: 'THE CITY IS A CIRCUIT BOARD, AND I AM A BROKEN TRANSISTOR.' -- moody, atmospheric, profound, dark academic', 1], ['<|im_end|>
<|im_start|>assistant
', 1], ]
[DEBUG] llm.hpp:260  - split prompt "<|im_start|>user
" to tokens ["<|im_start|>", "user", "Ċ", ]
[DEBUG] llm.hpp:260  - split prompt "A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: 'THE CITY IS A CIRCUIT BOARD, AND I AM A BROKEN TRANSISTOR.' -- moody, atmospheric, profound, dark academic" to tokens ["A", "Ġcinematic", ",", "Ġmelanch", "olic", "Ġphotograph", "Ġof", "Ġa", "Ġsolitary", "Ġhood", "ed", "Ġfigure", "Ġwalking", "Ġthrough", "Ġa", "Ġsprawling", ",", "Ġrain", "-s", "lick", "ed", "Ġmet", "ropolis", "Ġat", "Ġnight", ".", "ĠThe", "Ġcity", "Ġlights", "Ġare", "Ġa", "Ġchaotic", "Ġblur", "Ġof", "Ġneon", "Ġorange", "Ġand", "Ġcool", "Ġblue", ",", "Ġreflecting", "Ġon", "Ġthe", "Ġwet", "Ġasphalt", ".", "ĠThe", "Ġscene", "Ġev", "okes", "Ġa", "Ġsense", "Ġof", "Ġbeing", "Ġa", "Ġsingle", "Ġcomponent", "Ġin", "Ġa", "Ġvast", "Ġmachine", ".", "ĠSuper", "im", "posed", "Ġover", "Ġthe", "Ġimage", "Ġin", "Ġa", "Ġsleek", ",", "Ġmodern", ",", "Ġslightly", "Ġglitch", "ed", "Ġfont", "Ġis", "Ġthe", "Ġphilosophical", "Ġquote", ":", "Ġ'", "THE", "ĠCITY", "ĠIS", "ĠA", "ĠC", "IR", "CU", "IT", "ĠBOARD", ",", "ĠAND", "ĠI", "ĠAM", "ĠA", "ĠBRO", "KEN", "ĠTRANS", "IST", "OR", ".'", "Ġ--", "Ġmo", "ody", ",", "Ġatmospheric", ",", "Ġprofound", ",", "Ġdark", "Ġacademic", ]
[DEBUG] llm.hpp:260  - split prompt "<|im_end|>
<|im_start|>assistant
" to tokens ["<|im_end|>", "Ċ", "<|im_start|>", "assistant", "Ċ", ]
[INFO ] ggml_extend.hpp:1906 - qwen3 offload params (3555.38 MB, 398 tensors) to runtime backend (CUDA0), taking 0.70s
[DEBUG] ggml_extend.hpp:1806 - qwen3 compute buffer size: 13.40 MB(VRAM)
[DEBUG] conditioner.hpp:2149 - computing condition graph completed, taking 992 ms
[INFO ] stable-diffusion.cpp:3381 - get_learned_condition completed, taking 992 ms
[INFO ] stable-diffusion.cpp:3492 - generating image: 1/1 - seed 42
[INFO ] ggml_extend.hpp:1906 - z_image offload params (11743.02 MB, 453 tensors) to runtime backend (CUDA0), taking 4.27s
[DEBUG] ggml_extend.hpp:1806 - z_image compute buffer size: 1031.78 MB(VRAM)
```
## 
## Command used
```bash
sd-cli.exe \
  --diffusion-model moodyRealMix_zitV4DPOFP8.safetensors \
  --vae flux1-vae.safetensors \
  --llm Qwen3-4B-Instruct-2507-Q4_K_M.gguf \
  -p "a cat" -W 640 -H 960 \
  --steps 8 --cfg-scale 1.0 \
  --flow-shift 3.0 -v
```
## Possible cause
ggml may not have native fp8 (f8_e4m3/f8_e5m2) compute support,
so the loader falls back to f16.
Would it be possible to add fp8 native inference support,
or at least add a warning when this implicit conversion happens?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] fp8 safetensors diffusion model is loaded as f16 instead of fp8, causing excessive VRAM usage #1347

Environment

Models

Expected behavior

Actual behavior

Command used

Possible cause

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug] fp8 safetensors diffusion model is loaded as f16 instead of fp8, causing excessive VRAM usage #1347

Description

Environment

Models

Expected behavior

Actual behavior

Command used

Possible cause

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions