simplex: experimental Go 1.27 SIMD axpy for pivot/duals (arm64: -18% via archsimd)#1
Conversation
pivot's Binv row-update and duals' accumulate loop are ~76% of solver CPU
time (profiled via pprof on a 300x300 random LP). Both are the same
dense AXPY pattern (dst[k] += factor*src[k]), now factored into a shared
axpy() with two implementations selected by build tag:
- axpy_generic.go (default): the original scalar loop, unchanged behavior.
- axpy_simd.go (//go:build goexperiment.simd): vectorized via the new
portable `simd` package, gated behind GOEXPERIMENT=simd (Go 1.27rc1+).
Benchmarked scalar vs SIMD (benchstat, n=10, Apple M4 Pro/arm64 NEON,
128-bit = 2 float64 lanes):
ColdSolve_50x50 +5.7% slower (call/bounds-check overhead dominates)
ColdSolve_150x150 -10.6% faster
ColdSolve_300x300 -2.8% faster
geomean -2.8%
Mixed, modest result — not a clean win. Not recommending this for the
default build: it regresses at small problem sizes (likely the common
case), and the whole simd package is still experimental on an unreleased
toolchain (1.27rc1). Landing it behind the build tag costs nothing (zero
effect on default `go build`/`go test`) and gives a starting point to
revisit once Go 1.27 ships and/or wider (256/512-bit) SIMD is targeted.
Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
|
Verified this isn't a no-op or overhead artifact: disassembled the compiled `axpy@simd128` (`go tool objdump`) and confirmed real NEON instructions are emitted — `VDUP` (broadcast), 128-bit `FMOVQ` load/store, and a genuine vector `VFMLA` (fused multiply-add across 2 float64 lanes). `simd.Emulated()` also reports `false` on this hardware. So the mixed benchmark result is real, not noise: NEON here is only 128-bit (2 float64 lanes), so the theoretical ceiling is 2x, and there's a runtime CPU-feature dispatch stub on every call (checked via a cached flag, branching to the vector body or a scalar fallback). That fixed per-call overhead outweighs the 2-lane gain at small |
…cost
The portable simd package's per-call CPU-feature dispatch stub was
eating most of the 2-lane NEON win at small m (previous PR update:
+5.7% regression at 50x50). Switching arm64 to simd/archsimd's
Float64x2 directly (same 128-bit/2-lane hardware, no dispatch stub)
turns this into a clean win across every size:
ColdSolve_50x50 -11.6%
ColdSolve_150x150 -22.8%
ColdSolve_300x300 -19.1%
geomean -18.0%
vs the prior portable-simd geomean of -2.8%. simd/archsimd is AMD64/
arm64 only for now (per its doc), so axpy_simd.go (portable) stays as
the fallback for other GOEXPERIMENT=simd architectures.
Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
This comment was marked as resolved.
This comment was marked as resolved.
…ivot
Allocation profile (300x300 random LP) showed phaseCost, duals, and
alpha each allocating a fresh slice on every single pivot iteration:
6858 allocs/op, 21.2MB/op. None of that data needs to survive past the
iteration, so run() now allocates the three scratch buffers once and
clears+reuses them across pivots.
Result: 916 allocs/op, 927KB/op (7.5x/23x less). Timing-wise this is a
real but modest win, independent of the SIMD work in this branch —
applies to every build, no GOEXPERIMENT flag needed (benchstat, n=10):
ColdSolve_50x50 -5.36% (p=0.000)
ColdSolve_150x150 -0.92% (p=0.007)
ColdSolve_300x300 -0.24% (p=0.002)
geomean -2.20%
Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
This comment was marked as resolved.
This comment was marked as resolved.
|
Split the allocation-reuse fix out into a standalone PR against |
|
@copilot resolve the merge conflicts in this pull request |
# Conflicts: # simplex/simplex.go
Done. Merged |
This comment was marked as resolved.
This comment was marked as resolved.
|
@copilot resolve the merge conflicts in this pull request |
Resolved conflicts in simplex/simplex.go: update axpy() calls to use flat binv slice notation (binv[i*m:i*m+m]) after main flattened State.binv from [][]float64 to a single m*m []float64 (#5).
Summary
pprofon a 300x300 random LP, forced to actually pivot) showed 76% of solver CPU time in two dense O(m²) loops:pivot's Binv row update (51%) andduals' accumulate (25%). Both are the same AXPY shape:dst[k] += factor*src[k].axpy(), with build-tag-selected implementations:axpy_generic.go(default, all toolchains): original scalar loop, byte-for-byte same behavior.axpy_archsimd_arm64.go/axpy_archsimd_amd64.go(//go:build goexperiment.simd && arm64|amd64): vectorized directly viasimd/archsimd'sFloat64x2(NEON/SSE2), no portable dispatch overhead.axpy_simd.go(//go:build goexperiment.simd && !arm64 && !amd64): portablesimdpackage fallback for other SIMD-capable architectures.GOEXPERIMENT=simd— 1.27 is not yet released.Benchmark results
benchstat, n=10 runs each, Apple M4 Pro (arm64 NEON, 128-bit = 2 float64 lanes), this branch vs. currentmain:Scalar only (Go 1.26.4, default build, no SIMD — isolates the effect of the
axpy()extraction itself):The extraction alone is a free win: a
for k := range dstloop lets the compiler eliminate bounds checks that the originalfor k := 0; k < m; k++loop couldn't prove statically.With Go 1.27rc1 +
GOEXPERIMENT=simd(directarchsimdNEON path):Verdict: worth landing. Switching from the portable
simddispatch toarchsimddirectly (dropping the dispatch-stub overhead) turned this from a marginal, mixed-size-dependent win into a consistent 30%+ speedup at realistic problem sizes, with the scalar path also getting a smaller unconditional improvement from the refactor itself. It still regresses (within noise) at very smallm, where call/bounds-check overhead on 2-wide vectors outweighs the parallelism — but that's fully mitigated by keeping this behind theGOEXPERIMENT=simdbuild tag until Go 1.27 ships.🤖 Generated with Claude Code