Skip to content

simplex: experimental Go 1.27 SIMD axpy for pivot/duals (arm64: -18% via archsimd)#1

Merged
andig merged 6 commits into
mainfrom
simd-axpy-experiment
Jul 4, 2026
Merged

simplex: experimental Go 1.27 SIMD axpy for pivot/duals (arm64: -18% via archsimd)#1
andig merged 6 commits into
mainfrom
simd-axpy-experiment

Conversation

@andig

@andig andig commented Jul 3, 2026

Copy link
Copy Markdown
Member

Summary

  • Profiling (pprof on a 300x300 random LP, forced to actually pivot) showed 76% of solver CPU time in two dense O(m²) loops: pivot's Binv row update (51%) and duals' accumulate (25%). Both are the same AXPY shape: dst[k] += factor*src[k].
  • Extracted both into a shared axpy(), with build-tag-selected implementations:
    • axpy_generic.go (default, all toolchains): original scalar loop, byte-for-byte same behavior.
    • axpy_archsimd_arm64.go / axpy_archsimd_amd64.go (//go:build goexperiment.simd && arm64|amd64): vectorized directly via simd/archsimd's Float64x2 (NEON/SSE2), no portable dispatch overhead.
    • axpy_simd.go (//go:build goexperiment.simd && !arm64 && !amd64): portable simd package fallback for other SIMD-capable architectures.
    • All SIMD variants require Go 1.27rc1+ with GOEXPERIMENT=simd — 1.27 is not yet released.
  • Default build/test is completely unaffected — the SIMD files only compile when the experiment flag is set, so this costs nothing to land as-is.

Benchmark results

benchstat, n=10 runs each, Apple M4 Pro (arm64 NEON, 128-bit = 2 float64 lanes), this branch vs. current main:

Scalar only (Go 1.26.4, default build, no SIMD — isolates the effect of the axpy() extraction itself):

ColdSolve_50x50-12      ~ (noise, no significant difference)
ColdSolve_150x150-12   -9.53%  (p=0.000)  faster
ColdSolve_300x300-12  -11.17%  (p=0.000)  faster

The extraction alone is a free win: a for k := range dst loop lets the compiler eliminate bounds checks that the original for k := 0; k < m; k++ loop couldn't prove statically.

With Go 1.27rc1 + GOEXPERIMENT=simd (direct archsimd NEON path):

ColdSolve_50x50-12      ~ (noise; per-call overhead dominates on 2-wide vectors)
ColdSolve_150x150-12  -35.49%  (p=0.000)  faster
ColdSolve_300x300-12  -31.05%  (p=0.000)  faster

Verdict: worth landing. Switching from the portable simd dispatch to archsimd directly (dropping the dispatch-stub overhead) turned this from a marginal, mixed-size-dependent win into a consistent 30%+ speedup at realistic problem sizes, with the scalar path also getting a smaller unconditional improvement from the refactor itself. It still regresses (within noise) at very small m, where call/bounds-check overhead on 2-wide vectors outweighs the parallelism — but that's fully mitigated by keeping this behind the GOEXPERIMENT=simd build tag until Go 1.27 ships.

🤖 Generated with Claude Code

pivot's Binv row-update and duals' accumulate loop are ~76% of solver CPU
time (profiled via pprof on a 300x300 random LP). Both are the same
dense AXPY pattern (dst[k] += factor*src[k]), now factored into a shared
axpy() with two implementations selected by build tag:

- axpy_generic.go (default): the original scalar loop, unchanged behavior.
- axpy_simd.go (//go:build goexperiment.simd): vectorized via the new
  portable `simd` package, gated behind GOEXPERIMENT=simd (Go 1.27rc1+).

Benchmarked scalar vs SIMD (benchstat, n=10, Apple M4 Pro/arm64 NEON,
128-bit = 2 float64 lanes):

    ColdSolve_50x50    +5.7% slower (call/bounds-check overhead dominates)
    ColdSolve_150x150  -10.6% faster
    ColdSolve_300x300  -2.8% faster
    geomean            -2.8%

Mixed, modest result — not a clean win. Not recommending this for the
default build: it regresses at small problem sizes (likely the common
case), and the whole simd package is still experimental on an unreleased
toolchain (1.27rc1). Landing it behind the build tag costs nothing (zero
effect on default `go build`/`go test`) and gives a starting point to
revisit once Go 1.27 ships and/or wider (256/512-bit) SIMD is targeted.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
@andig

andig commented Jul 3, 2026

Copy link
Copy Markdown
Member Author

Verified this isn't a no-op or overhead artifact: disassembled the compiled `axpy@simd128` (`go tool objdump`) and confirmed real NEON instructions are emitted — `VDUP` (broadcast), 128-bit `FMOVQ` load/store, and a genuine vector `VFMLA` (fused multiply-add across 2 float64 lanes). `simd.Emulated()` also reports `false` on this hardware.

So the mixed benchmark result is real, not noise: NEON here is only 128-bit (2 float64 lanes), so the theoretical ceiling is 2x, and there's a runtime CPU-feature dispatch stub on every call (checked via a cached flag, branching to the vector body or a scalar fallback). That fixed per-call overhead outweighs the 2-lane gain at small m (regression at 50x50) and only pays off once the loop is long enough to amortize it (wins at 150x150/300x300). Consistent with the numbers, not an artifact.

…cost

The portable simd package's per-call CPU-feature dispatch stub was
eating most of the 2-lane NEON win at small m (previous PR update:
+5.7% regression at 50x50). Switching arm64 to simd/archsimd's
Float64x2 directly (same 128-bit/2-lane hardware, no dispatch stub)
turns this into a clean win across every size:

    ColdSolve_50x50    -11.6%
    ColdSolve_150x150  -22.8%
    ColdSolve_300x300  -19.1%
    geomean            -18.0%

vs the prior portable-simd geomean of -2.8%. simd/archsimd is AMD64/
arm64 only for now (per its doc), so axpy_simd.go (portable) stays as
the fallback for other GOEXPERIMENT=simd architectures.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
@andig andig changed the title simplex: experimental Go 1.27 SIMD axpy for pivot/duals (mixed results) simplex: experimental Go 1.27 SIMD axpy for pivot/duals (arm64: -18% via archsimd) Jul 3, 2026
@andig

This comment was marked as resolved.

…ivot

Allocation profile (300x300 random LP) showed phaseCost, duals, and
alpha each allocating a fresh slice on every single pivot iteration:
6858 allocs/op, 21.2MB/op. None of that data needs to survive past the
iteration, so run() now allocates the three scratch buffers once and
clears+reuses them across pivots.

Result: 916 allocs/op, 927KB/op (7.5x/23x less). Timing-wise this is a
real but modest win, independent of the SIMD work in this branch —
applies to every build, no GOEXPERIMENT flag needed (benchstat, n=10):

    ColdSolve_50x50    -5.36% (p=0.000)
    ColdSolve_150x150  -0.92% (p=0.007)
    ColdSolve_300x300  -0.24% (p=0.002)
    geomean            -2.20%

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
@andig

This comment was marked as resolved.

@andig

andig commented Jul 3, 2026

Copy link
Copy Markdown
Member Author

Split the allocation-reuse fix out into a standalone PR against main, since it doesn't depend on the SIMD work: #2. Once that merges, I'll rebase this branch — the third commit here (55e511b) will become redundant/need dropping at that point.

@andig

andig commented Jul 3, 2026

Copy link
Copy Markdown
Member Author

@copilot resolve the merge conflicts in this pull request

Copilot AI commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@copilot resolve the merge conflicts in this pull request

Done. Merged origin/main into this branch — the only conflict was in simplex/simplex.go in the duals function (inline loop vs. axpy() call); resolved by keeping the axpy() call. All tests pass.

Copilot finished work on behalf of andig July 3, 2026 11:57
@andig

This comment was marked as resolved.

@andig

andig commented Jul 4, 2026

Copy link
Copy Markdown
Member Author

@copilot resolve the merge conflicts in this pull request

Resolved conflicts in simplex/simplex.go: update axpy() calls to use
flat binv slice notation (binv[i*m:i*m+m]) after main flattened State.binv
from [][]float64 to a single m*m []float64 (#5).

Copilot AI commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

@copilot resolve the merge conflicts in this pull request

Done — merged origin/main (through #5's binv flattening). Both conflicts were axpy() call sites in duals and pivot; resolved by updating to st.binv[i*m:i*m+m] slice notation to match the new flat layout. All tests pass.

Copilot finished work on behalf of andig July 4, 2026 09:49
@andig andig merged commit 07f6c40 into main Jul 4, 2026
@andig andig deleted the simd-axpy-experiment branch July 4, 2026 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants