[RVV] complete rvv reduce kernels#9692
Conversation
…16-f32acc-rsum2, f16-rminmax, f16-rdminmax, f32-rdsum, f32-rdsum2, f32-rsum2, s8-rdminmax, s8-rminmax, u8-rdminmax, u8-rminmax, f32-rdminmax
fbarchard
left a comment
There was a problem hiding this comment.
when doing tail in main loop, is there a performance advantage to tail 'a' agnostic instead of 'u' for undisturbed?
if so the main loop could use 'a' and the remainder use 'u'
consider vdot for 8 bit.
note that rmax and rminmax come up for softmax
I haven't seen a difference, although I'll take a quick look. I see what likely triggered your question e.g "__riscv_vfmax_vv_f16m8_tu(". That was in the original f32 version, while I've been trending towards using the overloaded intrinsics e.g "__riscv_vfmax(" which I think makes reading cleaner ... and maybe for the future. But I'll take a quick look and update here and elsewhere in this PR to the overloaded versions if no difference. Will also update the copyright date here and other.
I very much wish we had that. I see that this is now Zvdot4a8i but not yet standard. https://github.com/riscv/riscv-isa-manual/pull/2576/changes. I prototyped using https://github.com/nibrunie/rvv-intrinsic-emulation, and it slides in easily (for c4 gemm), but we'll need to wait for the real thing for performance.
Yes. I've got f16-raddstoreexpminusmax but held out from this PR since it required touching the test tolerance. I'll post that once a few of these other PRs are merged. |
|
Pushed an update
@fbarchard I didn't see a performance delta in my test (below), although as I mentioned the slow-ish ddr performance on this platform dominates for these sorts of tests. Tests pass, as expected, as tu isn't required here. ` latest (overloaded) -- BTW, I see this benchmark runs a 'channels:1' set of tests. If that is a real use case, then having these kernels simply return 'output=input' is an obvious optimization, although I assume not since I don't see that being done anywhere. |
|
@dsharletg please review and merge when you are able. Thank you. |
-- 83d6eeb by Ken Unger <ken.j.unger@gmail.com>: add or update f16-f32acc-rdsum, f16-f32acc-rdsum2, f16-f32acc-rsum, f16-f32acc-rsum2, f16-rminmax, f16-rdminmax, f32-rdsum, f32-rdsum2, f32-rsum2, s8-rdminmax, s8-rminmax, u8-rdminmax, u8-rminmax, f32-rdminmax -- c3ea126 by Ken Unger <ken.j.unger@gmail.com>: cleanup -- 2ffb3d2 by Ken Unger <ken.j.unger@gmail.com>: cleanup per review comments FUTURE_COPYBARA_INTEGRATE_REVIEW=#9692 from ken-unger:reduce-rvv 80c9fa7 PiperOrigin-RevId: 888756891
Complete all of the rvv reduce kernels used by reduce-config.
I limited the commit of generated kernels to the upper LMUL values. Performance of these simple kernels is generally memory bandwidth constrained.
Tested on BPI-F3.
Lots of files in this commit, but changes to src/config/reduce_config.c are perhaps the most important to review. But of course anything is fair game.