UPSTREAM PR #1358: refactor: simplify f8_e5m2_to_f16 function a little bit#87
UPSTREAM PR #1358: refactor: simplify f8_e5m2_to_f16 function a little bit#87
Conversation
OverviewAnalysis of 49,619 functions (39 modified, 0 new, 0 removed) across two binaries reveals a major performance improvement from commit 7830a40 refactoring Function Analysisf8_e5m2_to_f16 (sd-cli & sd-server): 52.6x speedup (368.98ns → 7.81ns, -97.9% response/throughput time). Simplified from 40-line IEEE 754 converter with complex branching (16 blocks, 6 branches) to single bit-shift operation: std::vector::end() (sd-cli): +223.9% response time (81.87ns → 265.16ns), +306.6% throughput time (59.78ns → 243.07ns). Standard library function showing compiler-driven regression with added entry indirection. Absolute impact (+183ns) is modest despite alarming percentages. std::vector::back() (sd-cli): -42.0% response time (452.5ns → 262.7ns), -73.1% throughput time (259.7ns → 69.9ns). Compiler optimization consolidated blocks, improving cache locality. SelfAttention::post_attention() (sd-cli): +0.38% response time (8,398ns → 8,430ns), +19.9% throughput time (167ns → 200ns). Minor regression negligible in context—GPU computation dominates attention operations. Other analyzed functions showed compiler-driven variations in standard library code (destructors, logging, constructors) with minimal practical impact. Flame Graph ComparisonThe flame graphs show complete elimination of execution complexity—base version's multi-path logic replaced by single flat execution in target. Additional FindingsThe 🔎 Full breakdown: Loci Inspector |


Note
Source pull request: leejet/stable-diffusion.cpp#1358
This has be bothering me for a while. The f8_e5m2_to_f16() function is needlessly over-engineered.