Skip to content

x86_64: Eliminate caddq intrinsics#905

Open
willieyz wants to merge 2 commits intomainfrom
eliminate-caddq-intrinsics
Open

x86_64: Eliminate caddq intrinsics#905
willieyz wants to merge 2 commits intomainfrom
eliminate-caddq-intrinsics

Conversation

@willieyz
Copy link
Copy Markdown
Contributor

@willieyz willieyz commented Jan 22, 2026

In this PR, we replace the AVX2 intrinsics implementation of poly_caddq with a x86_64 assembly version.
To estimate the performance impact, we compare the results shown in the two tables below.
Overall, for keypair, sign, and verify (opt), the performance difference is below 1%, which is consistent with the no-opt case.

In the component-level benchmark for mld_poly_caddq, the observed performance differences are at least 17%. After unrolling the loop by a factor of 4, the differences are reduced to approximately 10%.

  • bench components
    • Δ (%) = (asm − AVX2) / AVX2 × 100
Component Implementation Build ML-DSA-44 ML-DSA-65 ML-DSA-87 Notes
mld_poly_caddq (avg) AVX2 intrinsics no-opt 391 393 391
x86_64 asm no-opt 390 393 392
Δ (%) no-opt -0.26 0.00 +0.26
mld_poly_caddq (avg) AVX2 intrinsics opt 38 40 39
x86_64 asm opt 51 50 46
x86_64 asm (unroll) opt 42 42 42 unroll by 4
Δ (%) opt +34.21 +25.00 +17.95
Δ (%) (unroll) opt +10.53 +5.00 +7.69 unroll by 4
  • bench
    • Δ (%) = (asm − AVX2) / AVX2 × 100
Component Implementation Build ML-DSA-44 ML-DSA-65 ML-DSA-87 Notes
keypair cycles (avg) AVX2 no-opt 134355 226117 377069 baseline (main)
x86_64 asm no-opt 133831 226345 374963
Δ (%) no-opt -0.39 +0.10 -0.56
AVX2 opt 60367 105019 166676 baseline (main)
x86_64 asm opt 60535 104479 165781
x86_64 asm (unroll) opt 59921 104367 165795 unroll by 4
Δ (%) opt +0.28 -0.51 -0.54
Δ (%) (unroll) opt -0.74 -0.62 -0.53 unroll by 4
sign cycles (avg) AVX2 no-opt 473892 779091 998026 baseline (main)
x86_64 asm no-opt 473262 779359 993245
Δ (%) no-opt -0.13 +0.03 -0.48
AVX2 opt 179804 301077 364509 baseline (main)
x86_64 asm opt 180253 298598 363742
x86_64 asm (unroll) opt 178255 299153 363505 unroll by 4
Δ (%) opt +0.25 -0.82 -0.21
Δ (%) (unroll) opt -0.86 -0.64 -0.28 unroll by 4
verify cycles (avg) AVX2 no-opt 140765 228322 379244 baseline (main)
x86_64 asm no-opt 140872 228255 377091
Δ (%) no-opt +0.08 -0.03 -0.57
AVX2 opt 63674 105734 164897 baseline (main)
x86_64 asm opt 63924 105192 164131
x86_64 asm (unroll) opt 62955 105111 163861 unroll by 4
Δ (%) opt +0.39 -0.51 -0.46
Δ (%) (unroll) opt -1.13 -0.59 -0.63 unroll by 4

@willieyz willieyz force-pushed the eliminate-caddq-intrinsics branch from 00b155f to 3819863 Compare January 23, 2026 06:52
@oqs-bot
Copy link
Copy Markdown
Contributor

oqs-bot commented Jan 23, 2026

CBMC Results (ML-DSA-87)

Full Results (177 proofs)
Proof Status Current Previous Change
**TOTAL** 2789s 2703s +3.2%
sign_verify_internal 348s 334s +4%
polyvecl_pointwise_acc_montgomery_c 297s 281s +6%
mld_attempt_signature_generation 250s 240s +4%
polyvec_matrix_expand 181s 175s +3%
poly_pointwise_montgomery_c 176s 158s +11%
rej_uniform_native 148s 146s +1%
mld_invntt_layer 99s 95s +4%
mld_ct_memcmp 81s 74s +9%
polyvec_matrix_expand_serial 81s 80s +1%
polyveck_decompose 61s 60s +2%
mld_ntt_layer 56s 56s +0%
sign_signature_internal 55s 55s +0%
polymat_permute_bitrev_to_custom 49s 45s +9%
keccak_squeezeblocks_x4 43s 42s +2%
mld_compute_t0_t1_tr_from_sk_components 24s 25s -4%
fqmul 23s 19s +21%
rej_uniform 23s 22s +5%
poly_chknorm_c 22s 19s +16%
polyeta_unpack 19s 17s +12%
poly_uniform_eta_4x 16s 17s -6%
polyt0_unpack 15s 15s +0%
poly_uniform_4x 14s 17s -18%
rej_uniform_c 14s 14s +0%
keccakf1600x4_permute_native 13s 13s +0%
mld_ntt_butterfly_block 13s 13s +0%
poly_add 13s 12s +8%
polyvec_matrix_pointwise_montgomery 12s 12s +0%
polyveck_use_hint 12s 13s -8%
mld_polyvecl_permute_bitrev_to_custom_native 11s 14s -21%
polyveck_caddq 11s 8s +38%
polyveck_reduce 11s 9s +22%
polyvecl_ntt 11s 11s +0%
keccak_absorb_once_x4 10s 11s -9%
poly_decompose_c 10s 7s +43%
polyveck_pointwise_poly_montgomery 10s 6s +67%
mld_compute_pack_z 9s 7s +29%
polyveck_add 9s 9s +0%
polyveck_ntt 9s 8s +12%
sign_pk_from_sk 9s 8s +12%
keccakf1600_permute 8s 9s -11%
mld_check_pct 8s 9s -11%
polyveck_power2round 8s 10s -20%
polyz_unpack_c 8s 11s -27%
sign_keypair_internal 8s 5s +60%
keccak_absorb 7s 8s -12%
keccakf1600_permute_native 7s 8s -12%
poly_invntt_tomont_c 7s 7s +0%
polyveck_invntt_tomont 7s 8s -12%
polyveck_shiftl 7s 7s +0%
polyveck_sub 7s 7s +0%
polyvecl_unpack_z 7s 1s +600%
sign 7s 8s -12%
unpack_hints 7s 5s +40%
mld_h 6s 4s +50%
mld_sample_s1_s2_serial 6s 9s -33%
poly_caddq_c 6s 7s -14%
poly_challenge 6s 4s +50%
shake256x4_absorb_once 6s 2s +200%
sign_signature_pre_hash_internal 6s 2s +200%
sign_verify_pre_hash_shake256 6s 4s +50%
keccak_squeeze 5s 6s -17%
keccakf1600_extract_bytes (big endian) 5s 1s +400%
mld_sample_s1_s2 5s 7s -29%
poly_uniform_gamma1 5s 4s +25%
polyveck_chknorm 5s 5s +0%
polyveck_make_hint 5s 8s -38%
polyvecl_uniform_gamma1 5s 3s +67%
reduce32 5s 4s +25%
rej_eta_c 5s 5s +0%
sign_signature_pre_hash_shake256 5s 3s +67%
sign_verify 5s 5s +0%
unpack_sk 5s 5s +0%
decompose 4s 4s +0%
keccak_finalize 4s 4s +0%
keccakf1600x4_permute 4s 3s +33%
mld_value_barrier_i64 4s 2s +100%
montgomery_reduce 4s 3s +33%
ntt_native_aarch64 4s 6s -33%
pack_sig_c_h 4s 3s +33%
pack_sig_z 4s 3s +33%
poly_caddq 4s 3s +33%
poly_caddq_native 4s 4s +0%
poly_chknorm 4s 1s +300%
poly_chknorm_native 4s 5s -20%
poly_chknorm_native_aarch64 4s 2s +100%
poly_invntt_tomont 4s 3s +33%
poly_make_hint 4s 2s +100%
poly_ntt 4s 4s +0%
poly_ntt_native 4s 4s +0%
poly_pointwise_montgomery_native 4s 3s +33%
poly_power2round 4s 5s -20%
poly_shiftl 4s 3s +33%
poly_sub 4s 3s +33%
poly_uniform_eta 4s 5s -20%
poly_uniform_gamma1_4x 4s 4s +0%
polyt1_unpack 4s 2s +100%
polyveck_pack_t0 4s 4s +0%
polyveck_pack_w1 4s 4s +0%
polyveck_unpack_t0 4s 4s +0%
polyvecl_chknorm 4s 6s -33%
polyvecl_pointwise_acc_montgomery_native 4s 2s +100%
polyvecl_unpack_eta 4s 4s +0%
rej_eta 4s 4s +0%
shake128_squeeze 4s 2s +100%
shake256_init 4s 1s +300%
sign_verify_extmu 4s 3s +33%
sign_verify_pre_hash_internal 4s 4s +0%
sys_check_capability 4s 2s +100%
caddq 3s 1s +200%
intt_native_x86_64 3s 3s +0%
keccak_init 3s 4s -25%
keccakf1600x4_extract_bytes 3s 2s +50%
make_hint 3s 3s +0%
mld_ct_abs_i32 3s 2s +50%
mld_ct_get_optblocker_u32 3s 4s -25%
mld_prepare_domain_separation_prefix 3s 7s -57%
pack_sk 3s 3s +0%
poly_caddq_native_aarch64 3s 2s +50%
poly_decompose 3s 4s -25%
poly_invntt_tomont_native 3s 3s +0%
poly_ntt_c 3s 3s +0%
poly_pointwise_montgomery 3s 3s +0%
poly_uniform 3s 4s -25%
poly_use_hint 3s 2s +50%
polyeta_pack 3s 3s +0%
polyt0_pack 3s 5s -40%
polyt1_pack 3s 3s +0%
polyveck_pack_eta 3s 3s +0%
polyveck_unpack_eta 3s 3s +0%
polyvecl_permute_bitrev_to_custom 3s 4s -25%
polyvecl_pointwise_acc_montgomery 3s 2s +50%
polyvecl_uniform_gamma1_serial 3s 5s -40%
polyz_unpack 3s 3s +0%
polyz_unpack_native 3s 4s -25%
shake128_release 3s 4s -25%
shake128x4_absorb_once 3s 2s +50%
shake128x4_squeezeblocks 3s 2s +50%
shake256_absorb 3s 2s +50%
shake256_finalize 3s 3s +0%
shake256_squeeze 3s 3s +0%
sign_keypair 3s 2s +50%
sign_open 3s 7s -57%
sign_signature_extmu 3s 5s -40%
unpack_sig 3s 3s +0%
use_hint 3s 3s +0%
fqscale 2s 4s -50%
keccakf1600_xor_bytes (big endian) 2s 2s +0%
keccakf1600x4_xor_bytes 2s 3s -33%
mld_ct_cmask_neg_i32 2s 3s -33%
mld_ct_cmask_nonzero_u32 2s 3s -33%
mld_ct_cmask_nonzero_u8 2s 3s -33%
mld_ct_get_optblocker_u8 2s 2s +0%
mld_ct_sel_int32 2s 2s +0%
mld_keccakf1600_extract_bytes 2s 4s -50%
mld_value_barrier_u32 2s 2s +0%
ntt_native_x86_64 2s 4s -50%
pack_pk 2s 5s -60%
poly_decompose_native 2s 4s -50%
poly_reduce 2s 4s -50%
poly_use_hint_c 2s 3s -33%
poly_use_hint_native 2s 4s -50%
polyvecl_pack_eta 2s 4s -50%
polyw1_pack 2s 4s -50%
polyz_pack 2s 2s +0%
power2round 2s 3s -33%
rej_eta_native 2s 4s -50%
shake128_absorb 2s 3s -33%
shake128_finalize 2s 2s +0%
shake128_init 2s 3s -33%
shake256 2s 3s -33%
shake256_release 2s 5s -60%
shake256x4_squeezeblocks 2s 3s -33%
sign_signature 2s 3s -33%
unpack_pk 2s 4s -50%
keccakf1600_xor_bytes 1s 1s +0%
mld_ct_get_optblocker_i64 1s 3s -67%
mld_value_barrier_u8 1s 4s -75%

@oqs-bot
Copy link
Copy Markdown
Contributor

oqs-bot commented Jan 23, 2026

CBMC Results (ML-DSA-44)

Full Results (177 proofs)
Proof Status Current Previous Change
**TOTAL** 2162s 2066s +4.6%
polyvecl_pointwise_acc_montgomery_c 260s 233s +12%
mld_attempt_signature_generation 241s 234s +3%
poly_pointwise_montgomery_c 182s 162s +12%
rej_uniform_native 156s 145s +8%
sign_verify_internal 133s 127s +5%
mld_invntt_layer 94s 86s +9%
mld_ct_memcmp 85s 82s +4%
mld_ntt_layer 59s 57s +4%
keccak_squeezeblocks_x4 44s 42s +5%
sign_signature_internal 34s 33s +3%
polyvec_matrix_expand 30s 28s +7%
poly_chknorm_c 24s 23s +4%
rej_uniform 24s 22s +9%
fqmul 23s 21s +10%
poly_uniform_eta_4x 19s 18s +6%
polyeta_unpack 19s 18s +6%
polyt0_unpack 17s 17s +0%
mld_compute_t0_t1_tr_from_sk_components 16s 13s +23%
mld_ntt_butterfly_block 15s 14s +7%
polymat_permute_bitrev_to_custom 15s 16s -6%
rej_uniform_c 15s 18s -17%
keccakf1600x4_permute_native 14s 12s +17%
poly_uniform_4x 14s 14s +0%
polyveck_power2round 14s 14s +0%
poly_add 13s 13s +0%
polyvec_matrix_expand_serial 13s 11s +18%
polyvec_matrix_pointwise_montgomery 13s 15s -13%
polyz_unpack_c 13s 11s +18%
keccak_absorb_once_x4 11s 10s +10%
polyveck_use_hint 9s 5s +80%
keccakf1600_permute 8s 7s +14%
mld_compute_pack_z 8s 6s +33%
polyveck_add 8s 5s +60%
sign 8s 7s +14%
keccakf1600_permute_native 7s 7s +0%
mld_polyvecl_permute_bitrev_to_custom_native 7s 10s -30%
polyveck_ntt 7s 4s +75%
sign_pk_from_sk 7s 7s +0%
caddq 6s 5s +20%
keccak_absorb 6s 9s -33%
mld_check_pct 6s 8s -25%
poly_invntt_tomont_c 6s 7s -14%
poly_power2round 6s 6s +0%
poly_use_hint_c 6s 5s +20%
polyvecl_unpack_z 6s 3s +100%
unpack_hints 6s 6s +0%
mld_h 5s 6s -17%
mld_sample_s1_s2 5s 3s +67%
mld_sample_s1_s2_serial 5s 5s +0%
pack_sk 5s 2s +150%
poly_caddq_c 5s 6s -17%
poly_sub 5s 4s +25%
polyeta_pack 5s 4s +25%
polyveck_caddq 5s 7s -29%
polyveck_chknorm 5s 7s -29%
polyveck_decompose 5s 7s -29%
polyveck_invntt_tomont 5s 3s +67%
polyveck_pointwise_poly_montgomery 5s 4s +25%
polyveck_reduce 5s 5s +0%
polyveck_sub 5s 4s +25%
polyveck_unpack_t0 5s 5s +0%
polyvecl_chknorm 5s 3s +67%
polyvecl_permute_bitrev_to_custom 5s 3s +67%
rej_eta_native 5s 4s +25%
sign_signature_extmu 5s 3s +67%
sign_verify_extmu 5s 5s +0%
unpack_sig 5s 2s +150%
unpack_sk 5s 5s +0%
decompose 4s 4s +0%
keccak_init 4s 2s +100%
mld_ct_get_optblocker_i64 4s 3s +33%
mld_prepare_domain_separation_prefix 4s 6s -33%
mld_value_barrier_i64 4s 2s +100%
ntt_native_aarch64 4s 3s +33%
poly_invntt_tomont 4s 4s +0%
poly_invntt_tomont_native 4s 4s +0%
poly_ntt 4s 4s +0%
poly_ntt_native 4s 3s +33%
poly_pointwise_montgomery 4s 4s +0%
poly_pointwise_montgomery_native 4s 4s +0%
poly_uniform_eta 4s 5s -20%
poly_uniform_gamma1_4x 4s 4s +0%
poly_use_hint 4s 2s +100%
poly_use_hint_native 4s 2s +100%
polyt1_pack 4s 2s +100%
polyt1_unpack 4s 5s -20%
polyveck_make_hint 4s 2s +100%
polyvecl_pack_eta 4s 6s -33%
polyvecl_unpack_eta 4s 3s +33%
shake128_absorb 4s 2s +100%
shake256_init 4s 2s +100%
shake256_release 4s 2s +100%
shake256x4_absorb_once 4s 2s +100%
sign_keypair_internal 4s 5s -20%
sign_verify_pre_hash_shake256 4s 3s +33%
sys_check_capability 4s 3s +33%
unpack_pk 4s 2s +100%
intt_native_x86_64 3s 5s -40%
keccak_finalize 3s 2s +50%
keccak_squeeze 3s 4s -25%
keccakf1600_extract_bytes (big endian) 3s 4s -25%
keccakf1600x4_permute 3s 5s -40%
keccakf1600x4_xor_bytes 3s 3s +0%
mld_ct_abs_i32 3s 3s +0%
mld_ct_cmask_nonzero_u8 3s 3s +0%
mld_ct_get_optblocker_u32 3s 2s +50%
mld_keccakf1600_extract_bytes 3s 3s +0%
montgomery_reduce 3s 3s +0%
ntt_native_x86_64 3s 2s +50%
pack_pk 3s 7s -57%
pack_sig_c_h 3s 2s +50%
poly_caddq 3s 3s +0%
poly_caddq_native 3s 4s -25%
poly_caddq_native_aarch64 3s 3s +0%
poly_challenge 3s 5s -40%
poly_chknorm 3s 2s +50%
poly_chknorm_native 3s 3s +0%
poly_chknorm_native_aarch64 3s 3s +0%
poly_decompose_c 3s 3s +0%
poly_decompose_native 3s 3s +0%
poly_reduce 3s 4s -25%
poly_uniform 3s 5s -40%
poly_uniform_gamma1 3s 3s +0%
polyveck_pack_eta 3s 3s +0%
polyveck_pack_t0 3s 4s -25%
polyveck_pack_w1 3s 3s +0%
polyveck_shiftl 3s 6s -50%
polyveck_unpack_eta 3s 3s +0%
polyvecl_ntt 3s 3s +0%
polyvecl_pointwise_acc_montgomery 3s 3s +0%
polyvecl_pointwise_acc_montgomery_native 3s 7s -57%
polyvecl_uniform_gamma1_serial 3s 4s -25%
polyw1_pack 3s 3s +0%
polyz_pack 3s 3s +0%
power2round 3s 2s +50%
rej_eta 3s 2s +50%
rej_eta_c 3s 4s -25%
shake128_squeeze 3s 3s +0%
shake128x4_absorb_once 3s 2s +50%
shake256_finalize 3s 3s +0%
shake256_squeeze 3s 2s +50%
shake256x4_squeezeblocks 3s 1s +200%
sign_keypair 3s 3s +0%
sign_open 3s 5s -40%
sign_signature 3s 2s +50%
sign_signature_pre_hash_internal 3s 4s -25%
sign_signature_pre_hash_shake256 3s 2s +50%
sign_verify 3s 7s -57%
sign_verify_pre_hash_internal 3s 3s +0%
use_hint 3s 4s -25%
fqscale 2s 3s -33%
keccakf1600_xor_bytes 2s 4s -50%
keccakf1600_xor_bytes (big endian) 2s 2s +0%
keccakf1600x4_extract_bytes 2s 2s +0%
make_hint 2s 2s +0%
mld_ct_cmask_neg_i32 2s 3s -33%
mld_ct_cmask_nonzero_u32 2s 3s -33%
mld_ct_sel_int32 2s 2s +0%
mld_value_barrier_u32 2s 2s +0%
pack_sig_z 2s 3s -33%
poly_decompose 2s 3s -33%
poly_make_hint 2s 2s +0%
poly_ntt_c 2s 3s -33%
poly_shiftl 2s 4s -50%
polyt0_pack 2s 4s -50%
polyvecl_uniform_gamma1 2s 3s -33%
polyz_unpack 2s 6s -67%
polyz_unpack_native 2s 3s -33%
reduce32 2s 3s -33%
shake128_finalize 2s 2s +0%
shake128x4_squeezeblocks 2s 2s +0%
shake256 2s 2s +0%
shake256_absorb 2s 2s +0%
mld_ct_get_optblocker_u8 1s 2s -50%
mld_value_barrier_u8 1s 5s -80%
shake128_init 1s 2s -50%
shake128_release 1s 4s -75%

@oqs-bot
Copy link
Copy Markdown
Contributor

oqs-bot commented Jan 23, 2026

CBMC Results (ML-DSA-65)

Full Results (177 proofs)
Proof Status Current Previous Change
**TOTAL** 2571s 2488s +3.3%
sign_verify_internal 354s 341s +4%
mld_attempt_signature_generation 283s 278s +2%
polyvecl_pointwise_acc_montgomery_c 194s 190s +2%
poly_pointwise_montgomery_c 170s 160s +6%
rej_uniform_native 152s 145s +5%
polyvec_matrix_expand 130s 128s +2%
mld_invntt_layer 102s 96s +6%
mld_ct_memcmp 80s 77s +4%
polyvec_matrix_expand_serial 70s 70s +0%
mld_ntt_layer 57s 54s +6%
keccak_squeezeblocks_x4 44s 42s +5%
sign_signature_internal 41s 37s +11%
polymat_permute_bitrev_to_custom 32s 29s +10%
mld_compute_t0_t1_tr_from_sk_components 25s 27s -7%
rej_uniform 22s 23s -4%
poly_chknorm_c 21s 22s -5%
fqmul 19s 18s +6%
poly_uniform_eta_4x 19s 16s +19%
poly_uniform_4x 16s 15s +7%
polyt0_unpack 14s 14s +0%
polyveck_ntt 14s 11s +27%
polyvecl_chknorm 14s 13s +8%
rej_uniform_c 14s 17s -18%
keccakf1600x4_permute_native 13s 13s +0%
polyveck_decompose 13s 14s -7%
mld_ntt_butterfly_block 12s 13s -8%
poly_add 12s 11s +9%
polyvec_matrix_pointwise_montgomery 12s 11s +9%
keccak_absorb_once_x4 11s 9s +22%
polyveck_sub 11s 10s +10%
polyvecl_ntt 11s 8s +38%
keccakf1600_permute_native 10s 9s +11%
mld_compute_pack_z 10s 8s +25%
polyveck_power2round 10s 11s -9%
polyveck_add 9s 7s +29%
polyveck_invntt_tomont 9s 10s -10%
sign_pk_from_sk 9s 7s +29%
keccakf1600_permute 8s 10s -20%
mld_polyvecl_permute_bitrev_to_custom_native 8s 8s +0%
poly_invntt_tomont_c 8s 10s -20%
polyveck_shiftl 8s 8s +0%
mld_check_pct 7s 8s -12%
mld_sample_s1_s2 7s 6s +17%
poly_decompose_c 7s 7s +0%
poly_use_hint_c 7s 3s +133%
polyt0_pack 7s 5s +40%
polyveck_caddq 7s 7s +0%
polyveck_pointwise_poly_montgomery 7s 6s +17%
polyveck_reduce 7s 6s +17%
polyveck_use_hint 7s 8s -12%
rej_eta_native 7s 4s +75%
keccak_absorb 6s 7s -14%
mld_h 6s 5s +20%
mld_prepare_domain_separation_prefix 6s 7s -14%
poly_ntt_c 6s 3s +100%
poly_power2round 6s 4s +50%
poly_uniform_eta 6s 6s +0%
poly_use_hint_native 6s 3s +100%
polyveck_make_hint 6s 6s +0%
sign 6s 8s -25%
sign_keypair 6s 2s +200%
poly_caddq_native_aarch64 5s 4s +25%
poly_challenge 5s 5s +0%
poly_ntt_native 5s 2s +150%
polyvecl_pointwise_acc_montgomery 5s 4s +25%
polyvecl_pointwise_acc_montgomery_native 5s 3s +67%
polyvecl_uniform_gamma1_serial 5s 5s +0%
sign_keypair_internal 5s 6s -17%
sign_open 5s 5s +0%
sign_signature 5s 6s -17%
sign_signature_pre_hash_internal 5s 4s +25%
sign_verify_pre_hash_internal 5s 6s -17%
sign_verify_pre_hash_shake256 5s 5s +0%
unpack_hints 5s 6s -17%
unpack_sk 5s 6s -17%
fqscale 4s 3s +33%
keccak_squeeze 4s 4s +0%
ntt_native_aarch64 4s 3s +33%
pack_sig_z 4s 3s +33%
pack_sk 4s 5s -20%
poly_caddq 4s 3s +33%
poly_caddq_c 4s 4s +0%
poly_chknorm 4s 3s +33%
poly_pointwise_montgomery 4s 4s +0%
poly_pointwise_montgomery_native 4s 4s +0%
poly_sub 4s 4s +0%
poly_uniform 4s 2s +100%
poly_uniform_gamma1_4x 4s 3s +33%
poly_use_hint 4s 3s +33%
polyeta_unpack 4s 6s -33%
polyt1_pack 4s 3s +33%
polyveck_chknorm 4s 5s -20%
polyveck_pack_w1 4s 4s +0%
polyveck_unpack_eta 4s 5s -20%
polyveck_unpack_t0 4s 3s +33%
polyvecl_unpack_eta 4s 4s +0%
polyw1_pack 4s 4s +0%
polyz_unpack 4s 2s +100%
polyz_unpack_native 4s 2s +100%
reduce32 4s 2s +100%
rej_eta_c 4s 4s +0%
sign_verify 4s 4s +0%
sign_verify_extmu 4s 3s +33%
sys_check_capability 4s 1s +300%
decompose 3s 3s +0%
intt_native_x86_64 3s 3s +0%
keccak_init 3s 1s +200%
keccakf1600_extract_bytes (big endian) 3s 3s +0%
keccakf1600_xor_bytes (big endian) 3s 2s +50%
keccakf1600x4_extract_bytes 3s 3s +0%
make_hint 3s 2s +50%
mld_ct_cmask_nonzero_u32 3s 3s +0%
mld_ct_get_optblocker_i64 3s 2s +50%
mld_sample_s1_s2_serial 3s 6s -50%
mld_value_barrier_i64 3s 3s +0%
mld_value_barrier_u8 3s 3s +0%
montgomery_reduce 3s 3s +0%
ntt_native_x86_64 3s 4s -25%
pack_pk 3s 3s +0%
poly_caddq_native 3s 3s +0%
poly_chknorm_native 3s 3s +0%
poly_decompose_native 3s 5s -40%
poly_make_hint 3s 3s +0%
poly_ntt 3s 4s -25%
poly_reduce 3s 2s +50%
poly_shiftl 3s 3s +0%
polyveck_pack_eta 3s 4s -25%
polyveck_pack_t0 3s 4s -25%
polyvecl_pack_eta 3s 2s +50%
polyvecl_permute_bitrev_to_custom 3s 3s +0%
polyvecl_uniform_gamma1 3s 2s +50%
polyvecl_unpack_z 3s 3s +0%
polyz_pack 3s 3s +0%
polyz_unpack_c 3s 2s +50%
shake128_absorb 3s 2s +50%
shake128_init 3s 1s +200%
shake128_release 3s 1s +200%
shake128x4_absorb_once 3s 3s +0%
shake256 3s 2s +50%
shake256_absorb 3s 6s -50%
shake256x4_absorb_once 3s 4s -25%
shake256x4_squeezeblocks 3s 2s +50%
sign_signature_extmu 3s 4s -25%
sign_signature_pre_hash_shake256 3s 5s -40%
unpack_pk 3s 3s +0%
unpack_sig 3s 2s +50%
use_hint 3s 4s -25%
caddq 2s 3s -33%
keccakf1600_xor_bytes 2s 1s +100%
keccakf1600x4_permute 2s 4s -50%
keccakf1600x4_xor_bytes 2s 2s +0%
mld_ct_cmask_nonzero_u8 2s 3s -33%
mld_ct_get_optblocker_u32 2s 1s +100%
mld_ct_get_optblocker_u8 2s 2s +0%
mld_ct_sel_int32 2s 5s -60%
mld_keccakf1600_extract_bytes 2s 2s +0%
pack_sig_c_h 2s 2s +0%
poly_chknorm_native_aarch64 2s 4s -50%
poly_decompose 2s 3s -33%
poly_invntt_tomont 2s 2s +0%
poly_uniform_gamma1 2s 4s -50%
polyeta_pack 2s 2s +0%
polyt1_unpack 2s 4s -50%
power2round 2s 1s +100%
rej_eta 2s 2s +0%
shake128_finalize 2s 3s -33%
shake128_squeeze 2s 4s -50%
shake128x4_squeezeblocks 2s 2s +0%
shake256_finalize 2s 2s +0%
shake256_init 2s 3s -33%
shake256_release 2s 3s -33%
shake256_squeeze 2s 4s -50%
keccak_finalize 1s 3s -67%
mld_ct_abs_i32 1s 2s -50%
mld_ct_cmask_neg_i32 1s 1s +0%
mld_value_barrier_u32 1s 3s -67%
poly_invntt_tomont_native 1s 4s -75%

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 45681 cycles 45685 cycles 1.00
ML-DSA-44 sign 131153 cycles 131164 cycles 1.00
ML-DSA-44 verify 47527 cycles 47530 cycles 1.00
ML-DSA-65 keypair 80457 cycles 80479 cycles 1.00
ML-DSA-65 sign 215715 cycles 215740 cycles 1.00
ML-DSA-65 verify 79737 cycles 79735 cycles 1.00
ML-DSA-87 keypair 131177 cycles 131175 cycles 1.00
ML-DSA-87 sign 277048 cycles 277004 cycles 1.00
ML-DSA-87 verify 130004 cycles 129971 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (no-opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 111983 cycles 111979 cycles 1.00
ML-DSA-44 sign 403592 cycles 403622 cycles 1.00
ML-DSA-44 verify 119886 cycles 119876 cycles 1.00
ML-DSA-65 keypair 192137 cycles 192166 cycles 1.00
ML-DSA-65 sign 657120 cycles 657078 cycles 1.00
ML-DSA-65 verify 193900 cycles 193891 cycles 1.00
ML-DSA-87 keypair 317930 cycles 318010 cycles 1.00
ML-DSA-87 sign 836905 cycles 836903 cycles 1.00
ML-DSA-87 verify 322922 cycles 322994 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Details
Benchmark suite Current: e663c5f Previous: db65535 Ratio
ML-DSA-44 keypair 34477 cycles 34381 cycles 1.00
ML-DSA-44 sign 120394 cycles 120118 cycles 1.00
ML-DSA-44 verify 38039 cycles 38106 cycles 1.00
ML-DSA-65 keypair 60486 cycles 61325 cycles 0.99
ML-DSA-65 sign 201395 cycles 201746 cycles 1.00
ML-DSA-65 verify 62527 cycles 62841 cycles 1.00
ML-DSA-87 keypair 94845 cycles 92915 cycles 1.02
ML-DSA-87 sign 239636 cycles 231813 cycles 1.03
ML-DSA-87 verify 95988 cycles 94836 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 229063 cycles 232745 cycles 0.98
ML-DSA-44 sign 628858 cycles 629812 cycles 1.00
ML-DSA-44 verify 229339 cycles 229277 cycles 1.00
ML-DSA-65 keypair 378941 cycles 422090 cycles 0.90
ML-DSA-65 sign 1007370 cycles 1067756 cycles 0.94
ML-DSA-65 verify 376246 cycles 393848 cycles 0.96
ML-DSA-87 keypair 690237 cycles 673725 cycles 1.02
ML-DSA-87 sign 1396068 cycles 1405386 cycles 0.99
ML-DSA-87 verify 663094 cycles 657567 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Details
Benchmark suite Current: e663c5f Previous: db65535 Ratio
ML-DSA-44 keypair 93636 cycles 94097 cycles 1.00
ML-DSA-44 sign 332371 cycles 333264 cycles 1.00
ML-DSA-44 verify 99653 cycles 99803 cycles 1.00
ML-DSA-65 keypair 159756 cycles 160115 cycles 1.00
ML-DSA-65 sign 544298 cycles 544184 cycles 1.00
ML-DSA-65 verify 160693 cycles 160692 cycles 1.00
ML-DSA-87 keypair 266718 cycles 267433 cycles 1.00
ML-DSA-87 sign 705724 cycles 707379 cycles 1.00
ML-DSA-87 verify 269841 cycles 270279 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Details
Benchmark suite Current: e663c5f Previous: db65535 Ratio
ML-DSA-44 keypair 69099 cycles 68974 cycles 1.00
ML-DSA-44 sign 187200 cycles 187318 cycles 1.00
ML-DSA-44 verify 68987 cycles 69050 cycles 1.00
ML-DSA-65 keypair 119190 cycles 119428 cycles 1.00
ML-DSA-65 sign 299797 cycles 300617 cycles 1.00
ML-DSA-65 verify 115518 cycles 115643 cycles 1.00
ML-DSA-87 keypair 203389 cycles 203571 cycles 1.00
ML-DSA-87 sign 394191 cycles 394649 cycles 1.00
ML-DSA-87 verify 195428 cycles 195659 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Details
Benchmark suite Current: e663c5f Previous: db65535 Ratio
ML-DSA-44 keypair 56776 cycles 56817 cycles 1.00
ML-DSA-44 sign 180517 cycles 182410 cycles 0.99
ML-DSA-44 verify 60909 cycles 61615 cycles 0.99
ML-DSA-65 keypair 98542 cycles 98729 cycles 1.00
ML-DSA-65 sign 298159 cycles 298290 cycles 1.00
ML-DSA-65 verify 100252 cycles 100286 cycles 1.00
ML-DSA-87 keypair 153194 cycles 152586 cycles 1.00
ML-DSA-87 sign 355896 cycles 355720 cycles 1.00
ML-DSA-87 verify 153887 cycles 153499 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Details
Benchmark suite Current: e663c5f Previous: db65535 Ratio
ML-DSA-44 keypair 41111 cycles 42279 cycles 0.97
ML-DSA-44 sign 132463 cycles 132300 cycles 1.00
ML-DSA-44 verify 43376 cycles 43971 cycles 0.99
ML-DSA-65 keypair 71836 cycles 76769 cycles 0.94
ML-DSA-65 sign 214168 cycles 217452 cycles 0.98
ML-DSA-65 verify 72274 cycles 73895 cycles 0.98
ML-DSA-87 keypair 109272 cycles 108025 cycles 1.01
ML-DSA-87 sign 250093 cycles 252354 cycles 0.99
ML-DSA-87 verify 110204 cycles 109188 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'AMD EPYC 4th gen (c7a)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 9146fd9 Previous: db65535 Ratio
ML-DSA-44 verify 46019 cycles 43971 cycles 1.05

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Details
Benchmark suite Current: e663c5f Previous: db65535 Ratio
ML-DSA-44 keypair 135196 cycles 134983 cycles 1.00
ML-DSA-44 sign 524574 cycles 524482 cycles 1.00
ML-DSA-44 verify 147495 cycles 147385 cycles 1.00
ML-DSA-65 keypair 228079 cycles 228309 cycles 1.00
ML-DSA-65 sign 865741 cycles 864340 cycles 1.00
ML-DSA-65 verify 236319 cycles 236413 cycles 1.00
ML-DSA-87 keypair 370971 cycles 370688 cycles 1.00
ML-DSA-87 sign 1080314 cycles 1079564 cycles 1.00
ML-DSA-87 verify 382962 cycles 383220 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Details
Benchmark suite Current: e663c5f Previous: db65535 Ratio
ML-DSA-44 keypair 157458 cycles 157614 cycles 1.00
ML-DSA-44 sign 549359 cycles 551534 cycles 1.00
ML-DSA-44 verify 169292 cycles 169123 cycles 1.00
ML-DSA-65 keypair 268056 cycles 267907 cycles 1.00
ML-DSA-65 sign 904109 cycles 904333 cycles 1.00
ML-DSA-65 verify 275154 cycles 275011 cycles 1.00
ML-DSA-87 keypair 448822 cycles 448619 cycles 1.00
ML-DSA-87 sign 1157710 cycles 1157905 cycles 1.00
ML-DSA-87 verify 458420 cycles 458683 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Details
Benchmark suite Current: e663c5f Previous: db65535 Ratio
ML-DSA-44 keypair 68120 cycles 68090 cycles 1.00
ML-DSA-44 sign 202529 cycles 202380 cycles 1.00
ML-DSA-44 verify 70991 cycles 70623 cycles 1.01
ML-DSA-65 keypair 121071 cycles 121010 cycles 1.00
ML-DSA-65 sign 331858 cycles 332267 cycles 1.00
ML-DSA-65 verify 118015 cycles 117974 cycles 1.00
ML-DSA-87 keypair 198147 cycles 198259 cycles 1.00
ML-DSA-87 sign 427693 cycles 428218 cycles 1.00
ML-DSA-87 verify 194725 cycles 194635 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Details
Benchmark suite Current: e663c5f Previous: db65535 Ratio
ML-DSA-44 keypair 72367 cycles 72253 cycles 1.00
ML-DSA-44 sign 212483 cycles 212376 cycles 1.00
ML-DSA-44 verify 75753 cycles 75747 cycles 1.00
ML-DSA-65 keypair 127620 cycles 127630 cycles 1.00
ML-DSA-65 sign 351072 cycles 350882 cycles 1.00
ML-DSA-65 verify 125639 cycles 125712 cycles 1.00
ML-DSA-87 keypair 205890 cycles 208495 cycles 0.99
ML-DSA-87 sign 444711 cycles 450030 cycles 0.99
ML-DSA-87 verify 205611 cycles 205745 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Details
Benchmark suite Current: e663c5f Previous: db65535 Ratio
ML-DSA-44 keypair 120296 cycles 120340 cycles 1.00
ML-DSA-44 sign 447350 cycles 447581 cycles 1.00
ML-DSA-44 verify 130039 cycles 130373 cycles 1.00
ML-DSA-65 keypair 205264 cycles 204354 cycles 1.00
ML-DSA-65 sign 728856 cycles 728319 cycles 1.00
ML-DSA-65 verify 211012 cycles 209199 cycles 1.01
ML-DSA-87 keypair 338105 cycles 338993 cycles 1.00
ML-DSA-87 sign 924981 cycles 921541 cycles 1.00
ML-DSA-87 verify 347678 cycles 348601 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Details
Benchmark suite Current: e663c5f Previous: db65535 Ratio
ML-DSA-44 keypair 128366 cycles 128240 cycles 1.00
ML-DSA-44 sign 447669 cycles 447597 cycles 1.00
ML-DSA-44 verify 138229 cycles 144662 cycles 0.96
ML-DSA-65 keypair 220626 cycles 220500 cycles 1.00
ML-DSA-65 sign 727046 cycles 727093 cycles 1.00
ML-DSA-65 verify 222599 cycles 223077 cycles 1.00
ML-DSA-87 keypair 364591 cycles 365045 cycles 1.00
ML-DSA-87 sign 925963 cycles 925847 cycles 1.00
ML-DSA-87 verify 372876 cycles 372789 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3 (no-opt)

Details
Benchmark suite Current: e663c5f Previous: db65535 Ratio
ML-DSA-44 keypair 138665 cycles 138463 cycles 1.00
ML-DSA-44 sign 483863 cycles 483929 cycles 1.00
ML-DSA-44 verify 148471 cycles 162291 cycles 0.91
ML-DSA-65 keypair 241346 cycles 241435 cycles 1.00
ML-DSA-65 sign 792690 cycles 792312 cycles 1.00
ML-DSA-65 verify 240750 cycles 241250 cycles 1.00
ML-DSA-87 keypair 395603 cycles 396566 cycles 1.00
ML-DSA-87 sign 1013151 cycles 1012538 cycles 1.00
ML-DSA-87 verify 402960 cycles 402623 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2

Details
Benchmark suite Current: e663c5f Previous: db65535 Ratio
ML-DSA-44 keypair 113469 cycles 113410 cycles 1.00
ML-DSA-44 sign 355767 cycles 355818 cycles 1.00
ML-DSA-44 verify 118208 cycles 118279 cycles 1.00
ML-DSA-65 keypair 197272 cycles 196486 cycles 1.00
ML-DSA-65 sign 590719 cycles 588672 cycles 1.00
ML-DSA-65 verify 195355 cycles 194830 cycles 1.00
ML-DSA-87 keypair 323052 cycles 323043 cycles 1.00
ML-DSA-87 sign 754236 cycles 753644 cycles 1.00
ML-DSA-87 verify 320544 cycles 320341 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpacemiT K1 8 (Banana Pi F3) benchmarks (no-opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 827476 cycles 828088 cycles 1.00
ML-DSA-44 sign 3238353 cycles 3233170 cycles 1.00
ML-DSA-44 verify 921919 cycles 920794 cycles 1.00
ML-DSA-65 keypair 1413613 cycles 1413452 cycles 1.00
ML-DSA-65 sign 5340696 cycles 5347688 cycles 1.00
ML-DSA-65 verify 1477470 cycles 1477937 cycles 1.00
ML-DSA-87 keypair 2311391 cycles 2312894 cycles 1.00
ML-DSA-87 sign 6659117 cycles 6665352 cycles 1.00
ML-DSA-87 verify 2409640 cycles 2411069 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2 (no-opt)

Details
Benchmark suite Current: e663c5f Previous: db65535 Ratio
ML-DSA-44 keypair 213353 cycles 213406 cycles 1.00
ML-DSA-44 sign 760604 cycles 762744 cycles 1.00
ML-DSA-44 verify 241487 cycles 235007 cycles 1.03
ML-DSA-65 keypair 380880 cycles 380391 cycles 1.00
ML-DSA-65 sign 1252441 cycles 1253555 cycles 1.00
ML-DSA-65 verify 372539 cycles 371798 cycles 1.00
ML-DSA-87 keypair 606311 cycles 604988 cycles 1.00
ML-DSA-87 sign 1593094 cycles 1596422 cycles 1.00
ML-DSA-87 verify 618250 cycles 619153 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 311698 cycles 306606 cycles 1.02
ML-DSA-44 sign 1174058 cycles 1166146 cycles 1.01
ML-DSA-44 verify 333560 cycles 335430 cycles 0.99
ML-DSA-65 keypair 550737 cycles 562274 cycles 0.98
ML-DSA-65 sign 1894590 cycles 1916493 cycles 0.99
ML-DSA-65 verify 529438 cycles 533535 cycles 0.99
ML-DSA-87 keypair 872695 cycles 865006 cycles 1.01
ML-DSA-87 sign 2468410 cycles 2417913 cycles 1.02
ML-DSA-87 verify 900121 cycles 884966 cycles 1.02

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 309195 cycles 299195 cycles 1.03

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 277182 cycles 278160 cycles 1.00
ML-DSA-44 sign 816109 cycles 822535 cycles 0.99
ML-DSA-44 verify 280990 cycles 278070 cycles 1.01
ML-DSA-65 keypair 477648 cycles 476503 cycles 1.00
ML-DSA-65 sign 1398700 cycles 1347085 cycles 1.04
ML-DSA-65 verify 461181 cycles 456015 cycles 1.01
ML-DSA-87 keypair 825204 cycles 796551 cycles 1.04
ML-DSA-87 sign 1886968 cycles 1773335 cycles 1.06
ML-DSA-87 verify 803609 cycles 772360 cycles 1.04

This comment was automatically generated by workflow using github-action-benchmark.

@willieyz willieyz force-pushed the eliminate-caddq-intrinsics branch 2 times, most recently from 72bc3f8 to d186f5e Compare January 26, 2026 10:12
@willieyz willieyz force-pushed the eliminate-caddq-intrinsics branch from 7dc5f6f to 6761759 Compare February 11, 2026 03:12
@willieyz willieyz marked this pull request as ready for review February 11, 2026 09:40
@mkannwischer mkannwischer force-pushed the eliminate-caddq-intrinsics branch from 6761759 to 14097e6 Compare February 11, 2026 09:51
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-65 sign 1398700 cycles 1347085 cycles 1.04
ML-DSA-87 keypair 825204 cycles 796551 cycles 1.04
ML-DSA-87 sign 1886968 cycles 1773335 cycles 1.06
ML-DSA-87 verify 803609 cycles 772360 cycles 1.04

This comment was automatically generated by workflow using github-action-benchmark.

@jakemas jakemas self-requested a review February 13, 2026 03:49
@mkannwischer mkannwischer force-pushed the eliminate-caddq-intrinsics branch from 14097e6 to 16cbdfe Compare February 18, 2026 17:17
Copy link
Copy Markdown
Contributor

@mkannwischer mkannwischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @willieyz. I took the liberty to clean up the commit history, but this looks good to me now.

@jakemas, WDYT?

@mkannwischer
Copy link
Copy Markdown
Contributor

@jakemas, could you take a look at this PR, please so we can make some progress?

@willieyz, could you please rebase?

@willieyz willieyz force-pushed the eliminate-caddq-intrinsics branch from 16cbdfe to 8c17221 Compare March 24, 2026 07:55
@willieyz
Copy link
Copy Markdown
Contributor Author

Hello @mkannwischer, @jakemas,

Thank you for helping with the review.
I have rebased this PR on top of main. Could you please take another look when you have time?
Thank you again for your time~

willieyz added 2 commits April 1, 2026 23:38
This commit adds mld_poly_caddq to the benchmark components to evaluate
the performance impact of replacing the caddq AVX2 intrinsics
with x86_64 assembly code.

Signed-off-by: willieyz <willie.zhao@chelpis.com>
This commit replace the current caddq AVX2 intrinsic implementation with x86_64
assembly to enable formal verification using HOL-Light in a follow-up PR.

Signed-off-by: willieyz <willie.zhao@chelpis.com>
@mkannwischer mkannwischer force-pushed the eliminate-caddq-intrinsics branch 2 times, most recently from e663c5f to 9146fd9 Compare April 1, 2026 15:49
Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Intel Xeon 4th gen (c7i)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 9146fd9 Previous: db65535 Ratio
ML-DSA-87 sign 241094 cycles 231813 cycles 1.04

This comment was automatically generated by workflow using github-action-benchmark.

@mkannwischer mkannwischer force-pushed the eliminate-caddq-intrinsics branch from 9146fd9 to e663c5f Compare April 1, 2026 16:43
Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Intel Xeon 4th gen (c7i)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: e663c5f Previous: db65535 Ratio
ML-DSA-87 sign 239636 cycles 231813 cycles 1.03

This comment was automatically generated by workflow using github-action-benchmark.

@mkannwischer mkannwischer self-assigned this Apr 1, 2026
Copy link
Copy Markdown
Contributor

@mkannwischer mkannwischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @willieyz - this looks good.
I have tried yesterday not load from \offset(%rdI) twice in the caddq macro and instead just have one load first, but that resulted in much worse performance.
I think the current code is fine.

@hanno-becker, could you take another look, please?

@mkannwischer mkannwischer changed the title Eliminate caddq intrinsics x86_64: Eliminate caddq intrinsics Apr 2, 2026
@jakemas
Copy link
Copy Markdown
Contributor

jakemas commented Apr 2, 2026

I too played around with some implementations, but performance I was able to get was nothing on this. This did get me familiar with the function, set up and plumbing. I have reviewed and I am happy. Once merged, I'll write the Hol-Light proof!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AVX2: Replace intrinsics implementation of poly_caddq with assembly

5 participants