[BUG] `matmul` produces incorrect results for F-contiguous / non-row-major inputs

`np.matmul` on QuadPrecision arrays returns incorrect results whenever either input is F-contiguous (or otherwise not row-major-contiguous in its last two axes).

## Reproducer

<details>
<summary>click to expand</summary>

```python
import numpy as np
from numpy_quaddtype import QuadPrecision, QuadPrecDType

dt = QuadPrecDType(backend='sleef')
def Q(v): return QuadPrecision(str(float(v)), backend='sleef')

A = np.asfortranarray(np.array([[Q(1), Q(2), Q(3)],
                                [Q(4), Q(5), Q(6)]], dtype=dt))
B = np.asfortranarray(np.array([[Q(1), Q(2)],
                                [Q(3), Q(4)],
                                [Q(5), Q(6)]], dtype=dt))

# A is F-contig: strides=(16, 32); B is F-contig: strides=(16, 48)
print("quad F@F:", [[float(v) for v in r] for r in (A @ B)])
# -> [[23.0, 27.0], [35.0, 32.0]]   <- WRONG

# Same call against native dtypes works fine — proving the bug is on our side:
A32 = np.asfortranarray(np.array([[1., 2., 3.], [4., 5., 6.]], dtype=np.float32))
B32 = np.asfortranarray(np.array([[1., 2.], [3., 4.], [5., 6.]], dtype=np.float32))
print("float32 F@F:", (A32 @ B32).tolist())
# -> [[22.0, 28.0], [49.0, 64.0]]   <- correct

# Workaround: force C-contig before matmul
print("quad (C@C):",
      [[float(v) for v in r] for r in (np.ascontiguousarray(A) @ np.ascontiguousarray(B))])
# -> [[22.0, 28.0], [49.0, 64.0]]   <- correct
````

</details>

## Root cause

`quad_matmul_strided_loop_aligned` in `src/csrc/umath/matmul.cpp` unconditionally:

```cpp
size_t lda = A_row_stride / sizeof(Sleef_quad);
// ...
qblas_gemm('R', 'N', 'N', m, p, n, &alpha, A_ptr, lda, B_ptr, ldb,
           &beta, C_ptr, ldc_numpy);
```

This silently assumes:

- the input is **row-major contiguous** in its core dimensions and the right layout argument to QBLAS is always `'R'`.

## Possible fixes
- Inspect each operand's last-two-axis strides: dispatch directly to `qblas_gemm('R', ...)` or `('C', ...)` when both operands share a contiguous layout, otherwise memcpy the offending operand(s) into a row-major temp buffer (same trick the unaligned path already uses) and call `qblas_gemm('R', ...)`.
- **Performance (needs QBLAS-side change too).** For mixed-layout inputs (R @ C and C @ R) the copy above can be skipped by passing `transa`/`transb='T'` instead, which would require `qblas_gemm` to gain transpose support. See https://github.com/SwayamInSync/QBLAS/issues/13

## Related tests

On the #88 two tests already exercise this bug and are marked `xfail`:

- `tests/test_dot.py::TestMatmulBatched::test_batched_fortran_order`
- `tests/test_dot.py::TestMatmulBatched::test_batched_transposed_view`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] `matmul` produces incorrect results for F-contiguous / non-row-major inputs #89

Reproducer

Root cause

Possible fixes

Related tests

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] matmul produces incorrect results for F-contiguous / non-row-major inputs #89

Description

Reproducer

Root cause

Possible fixes

Related tests

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[BUG] `matmul` produces incorrect results for F-contiguous / non-row-major inputs #89