Motivation
Arrangements backed by large batches can incur cache misses or page faults when cursors seek through cold data. With a custom allocator that supports prefetching (e.g., madvise(MADV_WILLNEED) or explicit page-in), we could hide this latency — if operators can express what data they'll need next.
This issue explores where prefetch hints could be injected and what API shapes might work.
Current access pattern
The join loop (join.rs:367–396) is a synchronized merge of two cursors:
while let (Some(batch_key), Some(trace_key), ..) = (..) {
match trace_key.cmp(&batch_key) {
Less => trace.seek_key(trace_storage, batch_key),
Greater => batch.seek_key(batch_storage, trace_key),
Equal => { /* load vals/times, compute, step both */ }
}
}
Key observations:
- The batch cursor drives the access pattern. Its keys are known upfront (small, new batch). The trace cursor seeks to match.
CursorList::seek_key fans out to every inner cursor (cursor_list.rs:151–155), each doing exponential+binary search in its keys container.
- After a key match, all vals and (time, diff) pairs are loaded —
ValueHistory::load walks the full value/time structure for that key.
- No lookahead exists today. The join processes one key at a time with zero knowledge of the next.
Reduce is better positioned: it maintains a pending_keys list and could prefetch the full batch of upcoming keys.
Where prefetch helps
- Key lookup across batches — when seeking
k in the trace, prefetch the approximate landing position for the next batch key in each spine batch.
- Value/update loading — once we know
key_cursor, prefetch the vals and upds ranges before starting the Cartesian product for the previous key.
- Parallel prefetch across CursorList — issue prefetches to all batches before performing the actual seeks.
Possible API shapes
A) Cursor-level hint (operator-facing, minimal)
trait Cursor {
/// Hint that the cursor will soon seek to `key`. Default: no-op.
fn prefetch_key(&self, _storage: &Self::Storage, _key: Self::Key<'_>) {}
}
The join peeks at the next batch key and calls trace.prefetch_key(storage, next_key) one iteration ahead. CursorList fans out to all inner cursors.
Pros: Simple, opt-in, no breaking changes.
Cons: Only one key ahead; the operator must drive it.
B) Container-level hint (storage-facing, transparent)
trait BatchContainer {
/// Hint that indices in [start, end) will be accessed soon. Default: no-op.
fn prefetch(&self, _start: usize, _end: usize) {}
}
OrdValCursor::seek_key could, after computing the destination index, prefetch the associated value/update offset and data ranges. Invisible to operators.
Pros: Transparent; also benefits merge cursors.
Cons: Container must know its memory layout to compute addresses.
C) Trace-level batch hint (maximum information)
trait TraceReader {
/// Hint that these keys will be looked up soon. Default: no-op.
fn prefetch_keys<I: IntoIterator<Item = ...>>(&self, keys: I) {}
}
The operator extracts the next N keys from the batch cursor and passes them all at once. The spine fans out to each batch; the allocator gets a batch of prefetch requests.
Pros: Maximum information for the allocator; amortizable.
Cons: More invasive; trace must expose inner structure.
Pieces not yet in place
- Allocator ↔ container connection.
BatchContainer / Layout have no reference to an allocator. Containers would need either an allocator handle or a global/thread-local prefetch mechanism.
- Address computation. Translating "key index N" to a byte range is trivial for
Vec<T> (ptr + N * size_of::<T>()) but encoding-dependent for columnar containers.
- Peek-ahead in cursors. The
Cursor trait has no peek_next_key(). The join could read ahead in the batch directly (it owns the storage), or a small API addition could formalize this.
Suggested starting point
Combine A + B: add prefetch_key on Cursor (default no-op) and prefetch on BatchContainer (default no-op). In OrdValCursor::prefetch_key, estimate the key's landing position and prefetch the key/val/update byte ranges. In the join loop, peek at the next batch key and call the hint one iteration ahead. All zero-cost until a container opts in.
For reduce, prefetch the full pending_keys list upfront since it's known ahead of time.
Motivation
Arrangements backed by large batches can incur cache misses or page faults when cursors seek through cold data. With a custom allocator that supports prefetching (e.g.,
madvise(MADV_WILLNEED)or explicit page-in), we could hide this latency — if operators can express what data they'll need next.This issue explores where prefetch hints could be injected and what API shapes might work.
Current access pattern
The join loop (
join.rs:367–396) is a synchronized merge of two cursors:Key observations:
CursorList::seek_keyfans out to every inner cursor (cursor_list.rs:151–155), each doing exponential+binary search in itskeyscontainer.ValueHistory::loadwalks the full value/time structure for that key.Reduce is better positioned: it maintains a
pending_keyslist and could prefetch the full batch of upcoming keys.Where prefetch helps
kin the trace, prefetch the approximate landing position for the next batch key in each spine batch.key_cursor, prefetch thevalsandupdsranges before starting the Cartesian product for the previous key.Possible API shapes
A) Cursor-level hint (operator-facing, minimal)
The join peeks at the next batch key and calls
trace.prefetch_key(storage, next_key)one iteration ahead.CursorListfans out to all inner cursors.Pros: Simple, opt-in, no breaking changes.
Cons: Only one key ahead; the operator must drive it.
B) Container-level hint (storage-facing, transparent)
OrdValCursor::seek_keycould, after computing the destination index, prefetch the associated value/update offset and data ranges. Invisible to operators.Pros: Transparent; also benefits merge cursors.
Cons: Container must know its memory layout to compute addresses.
C) Trace-level batch hint (maximum information)
The operator extracts the next N keys from the batch cursor and passes them all at once. The spine fans out to each batch; the allocator gets a batch of prefetch requests.
Pros: Maximum information for the allocator; amortizable.
Cons: More invasive; trace must expose inner structure.
Pieces not yet in place
BatchContainer/Layouthave no reference to an allocator. Containers would need either an allocator handle or a global/thread-local prefetch mechanism.Vec<T>(ptr + N * size_of::<T>()) but encoding-dependent for columnar containers.Cursortrait has nopeek_next_key(). The join could read ahead in the batch directly (it owns the storage), or a small API addition could formalize this.Suggested starting point
Combine A + B: add
prefetch_keyonCursor(default no-op) andprefetchonBatchContainer(default no-op). InOrdValCursor::prefetch_key, estimate the key's landing position and prefetch the key/val/update byte ranges. In the join loop, peek at the next batch key and call the hint one iteration ahead. All zero-cost until a container opts in.For reduce, prefetch the full
pending_keyslist upfront since it's known ahead of time.