Skip to content

[BUG] matmul produces incorrect results for F-contiguous / non-row-major inputs #89

@SwayamInSync

Description

@SwayamInSync

np.matmul on QuadPrecision arrays returns incorrect results whenever either input is F-contiguous (or otherwise not row-major-contiguous in its last two axes).

Reproducer

click to expand
import numpy as np
from numpy_quaddtype import QuadPrecision, QuadPrecDType

dt = QuadPrecDType(backend='sleef')
def Q(v): return QuadPrecision(str(float(v)), backend='sleef')

A = np.asfortranarray(np.array([[Q(1), Q(2), Q(3)],
                                [Q(4), Q(5), Q(6)]], dtype=dt))
B = np.asfortranarray(np.array([[Q(1), Q(2)],
                                [Q(3), Q(4)],
                                [Q(5), Q(6)]], dtype=dt))

# A is F-contig: strides=(16, 32); B is F-contig: strides=(16, 48)
print("quad F@F:", [[float(v) for v in r] for r in (A @ B)])
# -> [[23.0, 27.0], [35.0, 32.0]]   <- WRONG

# Same call against native dtypes works fine — proving the bug is on our side:
A32 = np.asfortranarray(np.array([[1., 2., 3.], [4., 5., 6.]], dtype=np.float32))
B32 = np.asfortranarray(np.array([[1., 2.], [3., 4.], [5., 6.]], dtype=np.float32))
print("float32 F@F:", (A32 @ B32).tolist())
# -> [[22.0, 28.0], [49.0, 64.0]]   <- correct

# Workaround: force C-contig before matmul
print("quad (C@C):",
      [[float(v) for v in r] for r in (np.ascontiguousarray(A) @ np.ascontiguousarray(B))])
# -> [[22.0, 28.0], [49.0, 64.0]]   <- correct

Root cause

quad_matmul_strided_loop_aligned in src/csrc/umath/matmul.cpp unconditionally:

size_t lda = A_row_stride / sizeof(Sleef_quad);
// ...
qblas_gemm('R', 'N', 'N', m, p, n, &alpha, A_ptr, lda, B_ptr, ldb,
           &beta, C_ptr, ldc_numpy);

This silently assumes:

  • the input is row-major contiguous in its core dimensions and the right layout argument to QBLAS is always 'R'.

Possible fixes

  • Inspect each operand's last-two-axis strides: dispatch directly to qblas_gemm('R', ...) or ('C', ...) when both operands share a contiguous layout, otherwise memcpy the offending operand(s) into a row-major temp buffer (same trick the unaligned path already uses) and call qblas_gemm('R', ...).
  • Performance (needs QBLAS-side change too). For mixed-layout inputs (R @ C and C @ R) the copy above can be skipped by passing transa/transb='T' instead, which would require qblas_gemm to gain transpose support. See [FEAT] Support transpose flags transa, transb != 'N' in GEMM SwayamInSync/QBLAS#13

Related tests

On the #88 two tests already exercise this bug and are marked xfail:

  • tests/test_dot.py::TestMatmulBatched::test_batched_fortran_order
  • tests/test_dot.py::TestMatmulBatched::test_batched_transposed_view

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions