Costrbf no quadratic precompute by jb-leger · Pull Request #369 · deepcharles/ruptures

jb-leger · 2026-03-18T16:20:11Z

CostRbf: O(n) memory mode and faster Gram matrix computation

Motivation

CostRbf was working great on small datasets with Pelt. Moving to larger time series, however, turned into a painful experience: the process consumed all available RAM and was eventually killed by the OOM-killer.

The root cause is that CostRbf precomputes and stores the full n × n Gram matrix right after fit(). For a signal of length n = 100 000 with Pelt, that is a 100 000 × 100 000 float64 matrix — about 80 GiB — allocated upfront, regardless of how many entries are actually queried during segmentation. With Pelt and many small segments, the vast majority of that matrix is never read.

This PR fixes the problem with three improvements and gives users explicit control over the memory vs. speed trade-off.

What this PR does

1. Faster explicit Gram matrix computation (`quadratic_precompute=True`)

The original implementation used scipy.spatial.distance.pdist + squareform, which allocates several large intermediate arrays.

The new implementation computes the squared-distance matrix in-place using BLAS-level matrix multiplication:

D[i,j] = ||x_i - x_j||^2 = ||x_i||^2 - 2 x_i·x_j + ||x_j||^2

written as:

mat = (-2 * centered_signal) @ centered_signal.T
mat += sqnorms[:, None]
mat += sqnorms[None, :]
np.maximum(mat, 0, out=mat)
np.exp(-gamma * mat, out=mat)   # in-place, no extra copy

Please note that the signal is centered before computation to limit the numerical errors.

This is both faster (one matmul instead of looping over pairs) and more memory-efficient (fewer temporaries). The property gram is now a @functools.cached_property instead of a mutable attribute, which eliminates
the manual None-guarding pattern.

2. Subsampled median heuristic for large signals

When gamma is not specified, the original code estimated it via the median of all pairwise squared distances. For large n this requires materialising the full n×n distance matrix just for the heuristic — before even reaching the
kernel computation.

The new code limits the median computation to a regularly-spaced subsample of the distance matrix of size at most 4 096 × 4 096, regardless of n.
Empirical testing (test_cost_rbf_gamma_heuristic) shows that the subsampled estimate converges to the true value with a relative error below 0.1 % already at moderate n.

3. On-demand Gram matrix (`quadratic_precompute=False`)

The main new feature: instead of precomputing and storing the full Gram matrix, the kernel is evaluated lazily for each queried sub-block.

Two lightweight classes implement this:

_ImplicitSquareDistanceMatrix: computes D[rows, cols] on the fly via the same BLAS trick above, using @functools.cached_property for the centred signal and squared norms (both O(n)).
_ImplicitGramMatrix: wraps the above and applies the RBF kernel on the fly.

Memory complexity drops from O(n²) to O(n). At inference time, each error(start, end) call computes a block of size (end−start)², which with Pelt and small segments stays very affordable.

New parameter `quadratic_precompute`

CostRbf(quadratic_precompute=True)   # default — precompute, O(n²) memory
CostRbf(quadratic_precompute=False)  # lazy, O(n) memory
CostRbf(quadratic_precompute=None)   # auto: True if n≤4096, else False

The default is True, which preserves the existing behaviour.

Benchmark

Setup: Pelt with CostRbf, segments of size 100, dimension 100, varying signal length n. Measured wall-clock time and peak RSS memory.

Three variants are compared:

original_explicit — original master implementation
new_explicit — this PR with quadratic_precompute=True
new_implicit — this PR with quadratic_precompute=False

Linear scale

Variant	Reaches n = … before OOM or timeout	Time at that n
`original_explicit`	~35 000	~60 s
`new_explicit`	~54 000	~65 s
`new_implicit`	> 130 000	~36 s

Memory at last measured point:

original_explicit and new_explicit: ~22–23 GiB peak (OOM boundary)
new_implicit: < 1 GiB peak at n = 130 000

Log-log scale

In log-log space, original_explicit and new_explicit show a slope of ~2 (quadratic), confirming the O(n²) memory and time scaling. new_implicit shows a markedly shallower slope: time scales quasi-linearly with n for the range explored, and memory remains essentially flat.

Summary

new_explicit is strictly faster and more memory-efficient than original_explicit for the same precomputed mode, by virtue of the BLAS-based computation. It still hits the O(n²) wall, just later.
new_implicit (quadratic_precompute=False) breaks the O(n²) barrier entirely. It is slower than new_explicit for small n (the lazy evaluation has per-call overhead), but becomes advantageous beyond n ~ 1 000–10 000 depending on the number of segments, and scales to signal lengths that were previously impossible.

Tests

test_costrbf_explicit_implicit: verifies that both quadratic_precompute=True and quadratic_precompute=False produce numerically identical Gram matrix values for an arbitrary sub-block (checked against scipy).
test_cost_rbf_gamma_heuristic: verifies that the subsampled median heuristic converges to the correct value of gamma for signal lengths from 5 to 20 000 000, with at most 0.1 % relative error beyond the integer-rounding
floor.

3 improvements: - rewrite the computation of gram matrix with inplace operation for speed-up, and matrix linalg operation for distance computation - for large matrix, subsample the distance matrix for the computing the median in the heuristic for gamma - add a mode without storing the Gram matrix, and computing only on-demand the needed parts of the Gram matrix allowing to have a memory complexity in O(n) Minors adds: - tests to check computation of Gram matrix for explicit and implicit forms - tests to check the computation on the heuristic (especially for the validity of the subsampling) - typing

- Added detailed explanations for the RBF kernel cost function and its parameters in the CostRbf class. - Introduced a new section on memory usage, explaining the behavior of the quadratic_precompute parameter and its impact on memory complexity. - Updated the example code to demonstrate the on-demand computation of the Gram matrix. These changes improve clarity for users and provide guidance on optimizing memory usage for large signals.

jb-leger added 2 commits March 18, 2026 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Costrbf no quadratic precompute#369

Costrbf no quadratic precompute#369
jb-leger wants to merge 2 commits intodeepcharles:masterfrom
jb-leger:costrbf_no_quadratic_precompute

jb-leger commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jb-leger commented Mar 18, 2026

CostRbf: O(n) memory mode and faster Gram matrix computation

Motivation

What this PR does

1. Faster explicit Gram matrix computation (quadratic_precompute=True)

2. Subsampled median heuristic for large signals

3. On-demand Gram matrix (quadratic_precompute=False)

New parameter quadratic_precompute

Benchmark

Linear scale

Log-log scale

Summary

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Faster explicit Gram matrix computation (`quadratic_precompute=True`)

3. On-demand Gram matrix (`quadratic_precompute=False`)

New parameter `quadratic_precompute`