Skip to content

Costrbf no quadratic precompute#369

Open
jb-leger wants to merge 2 commits intodeepcharles:masterfrom
jb-leger:costrbf_no_quadratic_precompute
Open

Costrbf no quadratic precompute#369
jb-leger wants to merge 2 commits intodeepcharles:masterfrom
jb-leger:costrbf_no_quadratic_precompute

Conversation

@jb-leger
Copy link

CostRbf: O(n) memory mode and faster Gram matrix computation

Motivation

CostRbf was working great on small datasets with Pelt. Moving to larger time series, however, turned into a painful experience: the process consumed all available RAM and was eventually killed by the OOM-killer.

The root cause is that CostRbf precomputes and stores the full n × n Gram matrix right after fit(). For a signal of length n = 100 000 with Pelt, that is a 100 000 × 100 000 float64 matrix — about 80 GiB — allocated upfront, regardless of how many entries are actually queried during segmentation. With Pelt and many small segments, the vast majority of that matrix is never read.

This PR fixes the problem with three improvements and gives users explicit control over the memory vs. speed trade-off.


What this PR does

1. Faster explicit Gram matrix computation (quadratic_precompute=True)

The original implementation used scipy.spatial.distance.pdist + squareform, which allocates several large intermediate arrays.

The new implementation computes the squared-distance matrix in-place using BLAS-level matrix multiplication:

D[i,j] = ||x_i - x_j||^2 = ||x_i||^2 - 2 x_i·x_j + ||x_j||^2

written as:

mat = (-2 * centered_signal) @ centered_signal.T
mat += sqnorms[:, None]
mat += sqnorms[None, :]
np.maximum(mat, 0, out=mat)
np.exp(-gamma * mat, out=mat)   # in-place, no extra copy

Please note that the signal is centered before computation to limit the numerical errors.

This is both faster (one matmul instead of looping over pairs) and more memory-efficient (fewer temporaries). The property gram is now a @functools.cached_property instead of a mutable attribute, which eliminates
the manual None-guarding pattern.

2. Subsampled median heuristic for large signals

When gamma is not specified, the original code estimated it via the median of all pairwise squared distances. For large n this requires materialising the full n×n distance matrix just for the heuristic — before even reaching the
kernel computation.

The new code limits the median computation to a regularly-spaced subsample of the distance matrix of size at most 4 096 × 4 096, regardless of n.
Empirical testing (test_cost_rbf_gamma_heuristic) shows that the subsampled estimate converges to the true value with a relative error below 0.1 % already at moderate n.

3. On-demand Gram matrix (quadratic_precompute=False)

The main new feature: instead of precomputing and storing the full Gram matrix, the kernel is evaluated lazily for each queried sub-block.

Two lightweight classes implement this:

  • _ImplicitSquareDistanceMatrix: computes D[rows, cols] on the fly via the same BLAS trick above, using @functools.cached_property for the centred signal and squared norms (both O(n)).
  • _ImplicitGramMatrix: wraps the above and applies the RBF kernel on the fly.

Memory complexity drops from O(n²) to O(n). At inference time, each error(start, end) call computes a block of size (end−start)², which with Pelt and small segments stays very affordable.

New parameter quadratic_precompute

CostRbf(quadratic_precompute=True)   # default — precompute, O(n²) memory
CostRbf(quadratic_precompute=False)  # lazy, O(n) memory
CostRbf(quadratic_precompute=None)   # auto: True if n≤4096, else False

The default is True, which preserves the existing behaviour.


Benchmark

Setup: Pelt with CostRbf, segments of size 100, dimension 100, varying signal length n. Measured wall-clock time and peak RSS memory.

Three variants are compared:

  • original_explicit — original master implementation
  • new_explicit — this PR with quadratic_precompute=True
  • new_implicit — this PR with quadratic_precompute=False

Linear scale

Benchmark linear scale

Variant Reaches n = … before OOM or timeout Time at that n
original_explicit ~35 000 ~60 s
new_explicit ~54 000 ~65 s
new_implicit > 130 000 ~36 s

Memory at last measured point:

  • original_explicit and new_explicit: ~22–23 GiB peak (OOM boundary)
  • new_implicit: < 1 GiB peak at n = 130 000

Log-log scale

Benchmark log scale

In log-log space, original_explicit and new_explicit show a slope of ~2 (quadratic), confirming the O(n²) memory and time scaling. new_implicit shows a markedly shallower slope: time scales quasi-linearly with n for the range explored, and memory remains essentially flat.

Summary

  • new_explicit is strictly faster and more memory-efficient than original_explicit for the same precomputed mode, by virtue of the BLAS-based computation. It still hits the O(n²) wall, just later.
  • new_implicit (quadratic_precompute=False) breaks the O(n²) barrier entirely. It is slower than new_explicit for small n (the lazy evaluation has per-call overhead), but becomes advantageous beyond n ~ 1 000–10 000 depending on the number of segments, and scales to signal lengths that were previously impossible.

Tests

  • test_costrbf_explicit_implicit: verifies that both quadratic_precompute=True and quadratic_precompute=False produce numerically identical Gram matrix values for an arbitrary sub-block (checked against scipy).
  • test_cost_rbf_gamma_heuristic: verifies that the subsampled median heuristic converges to the correct value of gamma for signal lengths from 5 to 20 000 000, with at most 0.1 % relative error beyond the integer-rounding
    floor.

3 improvements:
 - rewrite the computation of gram matrix with inplace
   operation for speed-up, and matrix linalg operation for distance
   computation
 - for large matrix, subsample the distance matrix for the computing the
   median in the heuristic for gamma
 - add a mode without storing the Gram matrix, and computing only
   on-demand the needed parts of the Gram matrix allowing to have a
   memory complexity in O(n)

Minors adds:
 - tests to check computation of Gram matrix for explicit and implicit
   forms
 - tests to check the computation on the heuristic (especially for the
   validity of the subsampling)
 - typing
- Added detailed explanations for the RBF kernel cost function and its parameters in the CostRbf class.
- Introduced a new section on memory usage, explaining the behavior of the quadratic_precompute parameter and its impact on memory complexity.
- Updated the example code to demonstrate the on-demand computation of the Gram matrix.

These changes improve clarity for users and provide guidance on optimizing memory usage for large signals.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant