[RVV] complete rvv reduce kernels by ken-unger · Pull Request #9692 · google/XNNPACK

ken-unger · 2026-03-16T05:35:38Z

Complete all of the rvv reduce kernels used by reduce-config.

f16-f32acc-rdsum
f16-f32acc-rdsum2
f16-f32acc-rsum
f16-f32acc-rsum2
f16-rminmax
f16-rdminmax
f32-rdminmax
f32-rdsum2
f32-rsum2
s8-rdminmax
s8-rminmax
u8-rdminmax
u8-rminmax

I limited the commit of generated kernels to the upper LMUL values. Performance of these simple kernels is generally memory bandwidth constrained.

Tested on BPI-F3.

Lots of files in this commit, but changes to src/config/reduce_config.c are perhaps the most important to review. But of course anything is fair game.

…16-f32acc-rsum2, f16-rminmax, f16-rdminmax, f32-rdsum, f32-rdsum2, f32-rsum2, s8-rdminmax, s8-rminmax, u8-rdminmax, u8-rminmax, f32-rdminmax

fbarchard

when doing tail in main loop, is there a performance advantage to tail 'a' agnostic instead of 'u' for undisturbed?
if so the main loop could use 'a' and the remainder use 'u'

consider vdot for 8 bit.

note that rmax and rminmax come up for softmax

ken-unger · 2026-03-17T00:28:12Z

when doing tail in main loop, is there a performance advantage to tail 'a' agnostic instead of 'u' for undisturbed? if so the main loop could use 'a' and the remainder use 'u'

I haven't seen a difference, although I'll take a quick look. I see what likely triggered your question e.g "__riscv_vfmax_vv_f16m8_tu(". That was in the original f32 version, while I've been trending towards using the overloaded intrinsics e.g "__riscv_vfmax(" which I think makes reading cleaner ... and maybe for the future. But I'll take a quick look and update here and elsewhere in this PR to the overloaded versions if no difference. Will also update the copyright date here and other.

consider vdot for 8 bit.

I very much wish we had that. I see that this is now Zvdot4a8i but not yet standard. https://github.com/riscv/riscv-isa-manual/pull/2576/changes. I prototyped using https://github.com/nibrunie/rvv-intrinsic-emulation, and it slides in easily (for c4 gemm), but we'll need to wait for the real thing for performance.

note that rmax and rminmax come up for softmax

Yes. I've got f16-raddstoreexpminusmax but held out from this PR since it required touching the test tolerance. I'll post that once a few of these other PRs are merged.

ken-unger · 2026-03-17T03:17:52Z

Pushed an update

copyright date updated for src/f32-rdsum2/rvv.c.in and regenerated
replaced the explicit vfmin/vfmax intrinsics with the overloaded versions in src/f16-rminmax/rvv.c.in and regenerated.

@fbarchard I didn't see a performance delta in my test (below), although as I mentioned the slow-ish ddr performance on this platform dominates for these sorts of tests. Tests pass, as expected, as tu isn't required here.

`
previous (tu) --
reduce/xnn_f16_rmax_ukernel__rvvfp16arith_u4v/channels:32768/rows:1/real_time 8227 ns 8226 ns
reduce/xnn_f16_rmax_ukernel__rvvfp16arith_u8v/channels:32768/rows:1/real_time 8225 ns 8225 ns
reduce/xnn_f16_rmin_ukernel__rvvfp16arith_u4v/channels:32768/rows:1/real_time 8223 ns 8223 ns
reduce/xnn_f16_rmin_ukernel__rvvfp16arith_u8v/channels:32768/rows:1/real_time 8220 ns 8220 ns
reduce/xnn_f16_rminmax_ukernel__rvvfp16arith_u4v/channels:32768/rows:1/real_time 8560 ns 8560 ns
reduce/xnn_f16_rminmax_ukernel__rvvfp16arith_u8v/channels:32768/rows:1/real_time 8464 ns 8464 ns

latest (overloaded) --
reduce/xnn_f16_rmax_ukernel__rvvfp16arith_u4v/channels:32768/rows:1/real_time 8187 ns 8187 ns
reduce/xnn_f16_rmax_ukernel__rvvfp16arith_u8v/channels:32768/rows:1/real_time 8219 ns 8218 ns
reduce/xnn_f16_rmin_ukernel__rvvfp16arith_u4v/channels:32768/rows:1/real_time 8185 ns 8185 ns
reduce/xnn_f16_rmin_ukernel__rvvfp16arith_u8v/channels:32768/rows:1/real_time 8196 ns 8196 ns
reduce/xnn_f16_rminmax_ukernel__rvvfp16arith_u4v/channels:32768/rows:1/real_time 8712 ns 8712 ns
reduce/xnn_f16_rminmax_ukernel__rvvfp16arith_u8v/channels:32768/rows:1/real_time 8429 ns 8429 ns
`

BTW, I see this benchmark runs a 'channels:1' set of tests. If that is a real use case, then having these kernels simply return 'output=input' is an obvious optimization, although I assume not since I don't see that being done anywhere.

ken-unger · 2026-03-24T15:53:43Z

@dsharletg please review and merge when you are able. Thank you.

-- 83d6eeb by Ken Unger <ken.j.unger@gmail.com>: add or update f16-f32acc-rdsum, f16-f32acc-rdsum2, f16-f32acc-rsum, f16-f32acc-rsum2, f16-rminmax, f16-rdminmax, f32-rdsum, f32-rdsum2, f32-rsum2, s8-rdminmax, s8-rminmax, u8-rdminmax, u8-rminmax, f32-rdminmax -- c3ea126 by Ken Unger <ken.j.unger@gmail.com>: cleanup -- 2ffb3d2 by Ken Unger <ken.j.unger@gmail.com>: cleanup per review comments FUTURE_COPYBARA_INTEGRATE_REVIEW=#9692 from ken-unger:reduce-rvv 80c9fa7 PiperOrigin-RevId: 888756891

ken-unger added 2 commits March 15, 2026 21:06

add or update f16-f32acc-rdsum, f16-f32acc-rdsum2, f16-f32acc-rsum, f…

83d6eeb

…16-f32acc-rsum2, f16-rminmax, f16-rdminmax, f32-rdsum, f32-rdsum2, f32-rsum2, s8-rdminmax, s8-rminmax, u8-rdminmax, u8-rminmax, f32-rdminmax

cleanup

c3ea126

fbarchard approved these changes Mar 16, 2026

View reviewed changes

cleanup per review comments

2ffb3d2

Merge branch 'google:master' into reduce-rvv

80c9fa7

dsharlet approved these changes Mar 24, 2026

View reviewed changes

copybara-service bot mentioned this pull request Mar 24, 2026

Copybara import of the project: #9766

Merged

copybara-service bot merged commit 55f3a63 into google:master Mar 24, 2026
21 checks passed

ken-unger deleted the reduce-rvv branch March 28, 2026 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RVV] complete rvv reduce kernels#9692

[RVV] complete rvv reduce kernels#9692
copybara-service[bot] merged 4 commits intogoogle:masterfrom
ken-unger:reduce-rvv

ken-unger commented Mar 16, 2026

Uh oh!

fbarchard left a comment

Uh oh!

ken-unger commented Mar 17, 2026

Uh oh!

ken-unger commented Mar 17, 2026

Uh oh!

ken-unger commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ken-unger commented Mar 16, 2026

Uh oh!

fbarchard left a comment

Choose a reason for hiding this comment

Uh oh!

ken-unger commented Mar 17, 2026

Uh oh!

ken-unger commented Mar 17, 2026

Uh oh!

ken-unger commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants