UPSTREAM PR #1318: chore: replace rand and srand at the library level#76
UPSTREAM PR #1318: chore: replace rand and srand at the library level#76
Conversation
OverviewAnalysis of stable-diffusion.cpp compared 49,765 functions across two versions, identifying 107 modified functions, 18 new functions, and 0 removed functions. The changes stem from a single commit replacing C-style rand/srand with C++ random number generation for improved thread safety and reproducibility. Binaries Analyzed:
Overall performance impact is negligible, with power consumption changes under 0.2% indicating effective performance neutrality despite individual function variations. Function Analysisstd::vector::end() (build.bin.sd-cli): Throughput time increased 306.67% (59.77ns → 243.07ns, +183.30ns). Response time increased 223.91% (81.86ns → 265.16ns, +183.30ns). This STL function regression appears compiler-driven, likely from disabled inlining. While called frequently (411 uses), absolute impact remains modest. std::vector<sd_lora_t>::end() (build.bin.sd-server): Throughput time improved 75.41% (243.07ns → 59.78ns, -183.29ns). Response time improved 69.44% (263.94ns → 80.65ns, -183.29ns). Compiler optimizations improved this LoRA parameter iteration function. ggml_threadpool_params_default (build.bin.sd-cli): Throughput time improved 58.40% (217.48ns → 90.47ns, -127.01ns). Response time improved 45.46% (279.79ns → 152.59ns, -127.20ns). GGML submodule optimizations reduced threadpool initialization overhead. ggml_compute_forward_map_custom3 (build.bin.sd-server): Throughput time improved 35.05% (219.25ns → 142.41ns, -76.84ns). Response time improved 32.91% (233.99ns → 156.98ns, -77.01ns). Custom operation handling benefits from more efficient RNG implementation. apply_binary_op (build.bin.sd-cli): Throughput time improved 6.15% (1286.26ns → 1207.13ns, -79.13ns). Response time improved 4.26% (2362.80ns → 2262.11ns, -100.69ns). This frequently-called tensor addition operation shows modest but meaningful improvement. Other analyzed functions showed mixed compiler-driven optimizations in STL operations (string construction, regex handling, vector reallocation) with changes ranging from -50% to +113%, but absolute impacts remained under 100ns per call. Additional FindingsCore ML inference operations (matrix multiplication, convolution, attention) remain unchanged. Performance variations are predominantly compiler artifacts affecting peripheral functions (initialization, CLI parsing, memory management) rather than inference hot paths. The RNG replacement successfully achieves thread safety and reproducibility goals without compromising computational efficiency, as confirmed by near-zero net power consumption changes. 🔎 Full breakdown: Loci Inspector |
dd19ab8 to
98460a7
Compare
These functions have global state, so they could interfere with application behavior.
18d93ce to
1bfd831
Compare
OverviewAnalysis of 49,653 functions across two binaries reveals minimal performance impact from replacing legacy Power Consumption:
Net impact is negligible (<0.1% variation in both binaries). Function AnalysisMost Significant Changes: std::vector::begin() (TensorStorage) -
std::vector::back() -
std::shared_ptr::_M_destroy (FinalLayer) -
ggml_log_internal -
alloc_params_ctx -
Red-Black Tree Operations - Other analyzed functions (vector constructor, hashtable deallocation, initialization checks, regex operations) showed minor changes with mixed improvements and regressions, all under 160 ns absolute impact. Additional FindingsSource Code Context: Single commit replaced ML Infrastructure Impact: GGML infrastructure improvements (logging -25%, context allocation -5%) benefit inference monitoring and model initialization. TensorStorage::begin() regression (+289%) affects model loading but is one-time initialization cost, not hot inference path. Cross-Function Effects: Red-black tree regressions compound (~102 ns per map insertion). Vector operation improvements (back: -73%) offset initialization regressions. High-frequency function improvements (logging) outweigh low-frequency regressions (initialization), resulting in near-zero net power impact. 🔎 Full breakdown: Loci Inspector |
Note
Source pull request: leejet/stable-diffusion.cpp#1318
These functions have global state, so they could interfere with application behavior.
It would arguably be more correct to use
std::default_random_device, but that seemed a bit overkill for this.