simplex: experimental Go 1.27 SIMD axpy for pivot/duals (arm64: -18% via archsimd) by andig · Pull Request #1 · evcc-io/solver

andig · 2026-07-03T11:01:27Z

Summary

Profiling (pprof on a 300x300 random LP, forced to actually pivot) showed 76% of solver CPU time in two dense O(m²) loops: pivot's Binv row update (51%) and duals' accumulate (25%). Both are the same AXPY shape: dst[k] += factor*src[k].
Extracted both into a shared axpy(), with build-tag-selected implementations:
- axpy_generic.go (default, all toolchains): original scalar loop, byte-for-byte same behavior.
- axpy_archsimd_arm64.go / axpy_archsimd_amd64.go (//go:build goexperiment.simd && arm64|amd64): vectorized directly via simd/archsimd's Float64x2 (NEON/SSE2), no portable dispatch overhead.
- axpy_simd.go (//go:build goexperiment.simd && !arm64 && !amd64): portable simd package fallback for other SIMD-capable architectures.
- All SIMD variants require Go 1.27rc1+ with GOEXPERIMENT=simd — 1.27 is not yet released.
Default build/test is completely unaffected — the SIMD files only compile when the experiment flag is set, so this costs nothing to land as-is.

Benchmark results

benchstat, n=10 runs each, Apple M4 Pro (arm64 NEON, 128-bit = 2 float64 lanes), this branch vs. current main:

Scalar only (Go 1.26.4, default build, no SIMD — isolates the effect of the axpy() extraction itself):

ColdSolve_50x50-12      ~ (noise, no significant difference)
ColdSolve_150x150-12   -9.53%  (p=0.000)  faster
ColdSolve_300x300-12  -11.17%  (p=0.000)  faster

The extraction alone is a free win: a for k := range dst loop lets the compiler eliminate bounds checks that the original for k := 0; k < m; k++ loop couldn't prove statically.

With Go 1.27rc1 + GOEXPERIMENT=simd (direct archsimd NEON path):

ColdSolve_50x50-12      ~ (noise; per-call overhead dominates on 2-wide vectors)
ColdSolve_150x150-12  -35.49%  (p=0.000)  faster
ColdSolve_300x300-12  -31.05%  (p=0.000)  faster

Verdict: worth landing. Switching from the portable simd dispatch to archsimd directly (dropping the dispatch-stub overhead) turned this from a marginal, mixed-size-dependent win into a consistent 30%+ speedup at realistic problem sizes, with the scalar path also getting a smaller unconditional improvement from the refactor itself. It still regresses (within noise) at very small m, where call/bounds-check overhead on 2-wide vectors outweighs the parallelism — but that's fully mitigated by keeping this behind the GOEXPERIMENT=simd build tag until Go 1.27 ships.

🤖 Generated with Claude Code

pivot's Binv row-update and duals' accumulate loop are ~76% of solver CPU time (profiled via pprof on a 300x300 random LP). Both are the same dense AXPY pattern (dst[k] += factor*src[k]), now factored into a shared axpy() with two implementations selected by build tag: - axpy_generic.go (default): the original scalar loop, unchanged behavior. - axpy_simd.go (//go:build goexperiment.simd): vectorized via the new portable `simd` package, gated behind GOEXPERIMENT=simd (Go 1.27rc1+). Benchmarked scalar vs SIMD (benchstat, n=10, Apple M4 Pro/arm64 NEON, 128-bit = 2 float64 lanes): ColdSolve_50x50 +5.7% slower (call/bounds-check overhead dominates) ColdSolve_150x150 -10.6% faster ColdSolve_300x300 -2.8% faster geomean -2.8% Mixed, modest result — not a clean win. Not recommending this for the default build: it regresses at small problem sizes (likely the common case), and the whole simd package is still experimental on an unreleased toolchain (1.27rc1). Landing it behind the build tag costs nothing (zero effect on default `go build`/`go test`) and gives a starting point to revisit once Go 1.27 ships and/or wider (256/512-bit) SIMD is targeted. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

andig · 2026-07-03T11:28:19Z

Verified this isn't a no-op or overhead artifact: disassembled the compiled `axpy@simd128` (`go tool objdump`) and confirmed real NEON instructions are emitted — `VDUP` (broadcast), 128-bit `FMOVQ` load/store, and a genuine vector `VFMLA` (fused multiply-add across 2 float64 lanes). `simd.Emulated()` also reports `false` on this hardware.

So the mixed benchmark result is real, not noise: NEON here is only 128-bit (2 float64 lanes), so the theoretical ceiling is 2x, and there's a runtime CPU-feature dispatch stub on every call (checked via a cached flag, branching to the vector body or a scalar fallback). That fixed per-call overhead outweighs the 2-lane gain at small m (regression at 50x50) and only pays off once the loop is long enough to amortize it (wins at 150x150/300x300). Consistent with the numbers, not an artifact.

…cost The portable simd package's per-call CPU-feature dispatch stub was eating most of the 2-lane NEON win at small m (previous PR update: +5.7% regression at 50x50). Switching arm64 to simd/archsimd's Float64x2 directly (same 128-bit/2-lane hardware, no dispatch stub) turns this into a clean win across every size: ColdSolve_50x50 -11.6% ColdSolve_150x150 -22.8% ColdSolve_300x300 -19.1% geomean -18.0% vs the prior portable-simd geomean of -2.8%. simd/archsimd is AMD64/ arm64 only for now (per its doc), so axpy_simd.go (portable) stays as the fallback for other GOEXPERIMENT=simd architectures. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

…ivot Allocation profile (300x300 random LP) showed phaseCost, duals, and alpha each allocating a fresh slice on every single pivot iteration: 6858 allocs/op, 21.2MB/op. None of that data needs to survive past the iteration, so run() now allocates the three scratch buffers once and clears+reuses them across pivots. Result: 916 allocs/op, 927KB/op (7.5x/23x less). Timing-wise this is a real but modest win, independent of the SIMD work in this branch — applies to every build, no GOEXPERIMENT flag needed (benchstat, n=10): ColdSolve_50x50 -5.36% (p=0.000) ColdSolve_150x150 -0.92% (p=0.007) ColdSolve_300x300 -0.24% (p=0.002) geomean -2.20% Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

andig · 2026-07-03T11:41:11Z

Split the allocation-reuse fix out into a standalone PR against main, since it doesn't depend on the SIMD work: #2. Once that merges, I'll rebase this branch — the third commit here (55e511b) will become redundant/need dropping at that point.

andig · 2026-07-03T11:55:58Z

@copilot resolve the merge conflicts in this pull request

# Conflicts: # simplex/simplex.go

Copilot · 2026-07-03T11:57:42Z

@copilot resolve the merge conflicts in this pull request

Done. Merged origin/main into this branch — the only conflict was in simplex/simplex.go in the duals function (inline loop vs. axpy() call); resolved by keeping the axpy() call. All tests pass.

andig · 2026-07-04T09:47:50Z

@copilot resolve the merge conflicts in this pull request

Resolved conflicts in simplex/simplex.go: update axpy() calls to use flat binv slice notation (binv[i*m:i*m+m]) after main flattened State.binv from [][]float64 to a single m*m []float64 (#5).

Copilot · 2026-07-04T09:49:34Z

@copilot resolve the merge conflicts in this pull request

Done — merged origin/main (through #5's binv flattening). Both conflicts were axpy() call sites in duals and pivot; resolved by updating to st.binv[i*m:i*m+m] slice notation to match the new flat layout. All tests pass.

andig changed the title ~~simplex: experimental Go 1.27 SIMD axpy for pivot/duals (mixed results)~~ simplex: experimental Go 1.27 SIMD axpy for pivot/duals (arm64: -18% via archsimd) Jul 3, 2026

This comment was marked as resolved.

Sign in to view

Copilot started work on behalf of andig July 3, 2026 11:56 View session

Merge remote-tracking branch 'origin/main' into simd-axpy-experiment

0abf325

# Conflicts: # simplex/simplex.go

Copilot finished work on behalf of andig July 3, 2026 11:57

This comment was marked as resolved.

Sign in to view

wip

aa83889

Copilot started work on behalf of andig July 4, 2026 09:48 View session

Merge remote-tracking branch 'origin/main' into simd-axpy-experiment

c05dbcd

Resolved conflicts in simplex/simplex.go: update axpy() calls to use flat binv slice notation (binv[i*m:i*m+m]) after main flattened State.binv from [][]float64 to a single m*m []float64 (#5).

Copilot finished work on behalf of andig July 4, 2026 09:49

andig merged commit 07f6c40 into main Jul 4, 2026

andig deleted the simd-axpy-experiment branch July 4, 2026 10:28

andig mentioned this pull request Jul 4, 2026

CBC-drawn performance stack: factorized basis, presolve, GMI cuts, heuristics #11

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

simplex: experimental Go 1.27 SIMD axpy for pivot/duals (arm64: -18% via archsimd)#1

simplex: experimental Go 1.27 SIMD axpy for pivot/duals (arm64: -18% via archsimd)#1
andig merged 6 commits into
mainfrom
simd-axpy-experiment

andig commented Jul 3, 2026 •

edited

Loading

Uh oh!

andig commented Jul 3, 2026

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

andig commented Jul 3, 2026

Uh oh!

andig commented Jul 3, 2026

Uh oh!

Copilot AI commented Jul 3, 2026

Uh oh!

This comment was marked as resolved.

andig commented Jul 4, 2026

Uh oh!

Copilot AI commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

andig commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark results

Uh oh!

andig commented Jul 3, 2026

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

andig commented Jul 3, 2026

Uh oh!

andig commented Jul 3, 2026

Uh oh!

Copilot AI commented Jul 3, 2026

Uh oh!

This comment was marked as resolved.

andig commented Jul 4, 2026

Uh oh!

Copilot AI commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andig commented Jul 3, 2026 •

edited

Loading