Skip to content

[RVV] complete rvv reduce kernels#9692

Merged
copybara-service[bot] merged 4 commits intogoogle:masterfrom
ken-unger:reduce-rvv
Mar 24, 2026
Merged

[RVV] complete rvv reduce kernels#9692
copybara-service[bot] merged 4 commits intogoogle:masterfrom
ken-unger:reduce-rvv

Conversation

@ken-unger
Copy link
Copy Markdown
Contributor

Complete all of the rvv reduce kernels used by reduce-config.

  • f16-f32acc-rdsum
  • f16-f32acc-rdsum2
  • f16-f32acc-rsum
  • f16-f32acc-rsum2
  • f16-rminmax
  • f16-rdminmax
  • f32-rdminmax
  • f32-rdsum2
  • f32-rsum2
  • s8-rdminmax
  • s8-rminmax
  • u8-rdminmax
  • u8-rminmax

I limited the commit of generated kernels to the upper LMUL values. Performance of these simple kernels is generally memory bandwidth constrained.

Tested on BPI-F3.

Lots of files in this commit, but changes to src/config/reduce_config.c are perhaps the most important to review. But of course anything is fair game.

…16-f32acc-rsum2, f16-rminmax, f16-rdminmax, f32-rdsum, f32-rdsum2, f32-rsum2, s8-rdminmax, s8-rminmax, u8-rdminmax, u8-rminmax, f32-rdminmax
Copy link
Copy Markdown
Collaborator

@fbarchard fbarchard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when doing tail in main loop, is there a performance advantage to tail 'a' agnostic instead of 'u' for undisturbed?
if so the main loop could use 'a' and the remainder use 'u'

consider vdot for 8 bit.

note that rmax and rminmax come up for softmax

@ken-unger
Copy link
Copy Markdown
Contributor Author

when doing tail in main loop, is there a performance advantage to tail 'a' agnostic instead of 'u' for undisturbed? if so the main loop could use 'a' and the remainder use 'u'

I haven't seen a difference, although I'll take a quick look. I see what likely triggered your question e.g "__riscv_vfmax_vv_f16m8_tu(". That was in the original f32 version, while I've been trending towards using the overloaded intrinsics e.g "__riscv_vfmax(" which I think makes reading cleaner ... and maybe for the future. But I'll take a quick look and update here and elsewhere in this PR to the overloaded versions if no difference. Will also update the copyright date here and other.

consider vdot for 8 bit.

I very much wish we had that. I see that this is now Zvdot4a8i but not yet standard. https://github.com/riscv/riscv-isa-manual/pull/2576/changes. I prototyped using https://github.com/nibrunie/rvv-intrinsic-emulation, and it slides in easily (for c4 gemm), but we'll need to wait for the real thing for performance.

note that rmax and rminmax come up for softmax

Yes. I've got f16-raddstoreexpminusmax but held out from this PR since it required touching the test tolerance. I'll post that once a few of these other PRs are merged.

@ken-unger
Copy link
Copy Markdown
Contributor Author

Pushed an update

  • copyright date updated for src/f32-rdsum2/rvv.c.in and regenerated
  • replaced the explicit vfmin/vfmax intrinsics with the overloaded versions in src/f16-rminmax/rvv.c.in and regenerated.

@fbarchard I didn't see a performance delta in my test (below), although as I mentioned the slow-ish ddr performance on this platform dominates for these sorts of tests. Tests pass, as expected, as tu isn't required here.

`
previous (tu) --
reduce/xnn_f16_rmax_ukernel__rvvfp16arith_u4v/channels:32768/rows:1/real_time 8227 ns 8226 ns
reduce/xnn_f16_rmax_ukernel__rvvfp16arith_u8v/channels:32768/rows:1/real_time 8225 ns 8225 ns
reduce/xnn_f16_rmin_ukernel__rvvfp16arith_u4v/channels:32768/rows:1/real_time 8223 ns 8223 ns
reduce/xnn_f16_rmin_ukernel__rvvfp16arith_u8v/channels:32768/rows:1/real_time 8220 ns 8220 ns
reduce/xnn_f16_rminmax_ukernel__rvvfp16arith_u4v/channels:32768/rows:1/real_time 8560 ns 8560 ns
reduce/xnn_f16_rminmax_ukernel__rvvfp16arith_u8v/channels:32768/rows:1/real_time 8464 ns 8464 ns

latest (overloaded) --
reduce/xnn_f16_rmax_ukernel__rvvfp16arith_u4v/channels:32768/rows:1/real_time 8187 ns 8187 ns
reduce/xnn_f16_rmax_ukernel__rvvfp16arith_u8v/channels:32768/rows:1/real_time 8219 ns 8218 ns
reduce/xnn_f16_rmin_ukernel__rvvfp16arith_u4v/channels:32768/rows:1/real_time 8185 ns 8185 ns
reduce/xnn_f16_rmin_ukernel__rvvfp16arith_u8v/channels:32768/rows:1/real_time 8196 ns 8196 ns
reduce/xnn_f16_rminmax_ukernel__rvvfp16arith_u4v/channels:32768/rows:1/real_time 8712 ns 8712 ns
reduce/xnn_f16_rminmax_ukernel__rvvfp16arith_u8v/channels:32768/rows:1/real_time 8429 ns 8429 ns
`

BTW, I see this benchmark runs a 'channels:1' set of tests. If that is a real use case, then having these kernels simply return 'output=input' is an obvious optimization, although I assume not since I don't see that being done anywhere.

@ken-unger
Copy link
Copy Markdown
Contributor Author

@dsharletg please review and merge when you are able. Thank you.

copybara-service bot pushed a commit that referenced this pull request Mar 24, 2026
--
83d6eeb by Ken Unger <ken.j.unger@gmail.com>:

add or update f16-f32acc-rdsum, f16-f32acc-rdsum2, f16-f32acc-rsum, f16-f32acc-rsum2, f16-rminmax, f16-rdminmax, f32-rdsum, f32-rdsum2, f32-rsum2, s8-rdminmax, s8-rminmax, u8-rdminmax, u8-rminmax, f32-rdminmax

--
c3ea126 by Ken Unger <ken.j.unger@gmail.com>:

cleanup

--
2ffb3d2 by Ken Unger <ken.j.unger@gmail.com>:

cleanup per review comments

FUTURE_COPYBARA_INTEGRATE_REVIEW=#9692 from ken-unger:reduce-rvv 80c9fa7
PiperOrigin-RevId: 888756891
@copybara-service copybara-service bot merged commit 55f3a63 into google:master Mar 24, 2026
21 checks passed
@ken-unger ken-unger deleted the reduce-rvv branch March 28, 2026 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants