-
Notifications
You must be signed in to change notification settings - Fork 161
Add safety limits and diagnostics for ORDER BY + LIMIT infinite loops #1173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add iteration safeguards to prevent infinite loops that can occur when using Electric with large datasets and ORDER BY/LIMIT queries: 1. `maybeRunGraph` while loop (collection-config-builder.ts): - Can loop infinitely when data loading triggers graph updates - Happens when WHERE filters out most data, causing dataNeeded() > 0 - Loading more data triggers updates that get filtered out - Added 10,000 iteration limit with error logging 2. `requestLimitedSnapshot` while loop (subscription.ts): - Can loop if index iteration has issues - Added 10,000 iteration limit with error logging - Removed unused `insertedKeys` tracking 3. `D2.run()` while loop (d2.ts): - Can loop infinitely on circular data flow bugs - Added 100,000 iteration limit with error logging The safeguards log errors to help debug the root cause while preventing the app from freezing.
🦋 Changeset detectedLatest commit: 32bd8e8 The changes in this PR will be included in the next version bump. This PR includes changesets to release 13 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
More templates
@tanstack/angular-db
@tanstack/db
@tanstack/db-ivm
@tanstack/electric-db-collection
@tanstack/offline-transactions
@tanstack/powersync-db-collection
@tanstack/query-db-collection
@tanstack/react-db
@tanstack/rxdb-db-collection
@tanstack/solid-db
@tanstack/svelte-db
@tanstack/trailbase-db-collection
@tanstack/vue-db
commit: |
|
Size Change: +491 B (+0.54%) Total Size: 91.4 kB
ℹ️ View Unchanged
|
|
Size Change: 0 B Total Size: 3.7 kB ℹ️ View Unchanged
|
The infinite loop occurred because `loadMoreIfNeeded` kept trying to load data even when the local index was exhausted. This happened when: 1. TopK had fewer items than limit (WHERE filtered out data) 2. loadMoreIfNeeded tried to load more → no local data found 3. Loop continued indefinitely since TopK still needed data Root cause fix: - Add `localIndexExhausted` flag to CollectionSubscriber - Track when local index has no more data for current cursor - Stop calling loadMoreIfNeeded when exhausted - Reset flag when new data arrives from sync layer (inserts) - requestLimitedSnapshot now returns boolean indicating if data was found Error handling improvements (per review feedback): - D2.run() now throws Error when iteration limit exceeded - Caller catches and calls transitionToError() for proper error state - requestLimitedSnapshot returns false when iteration limit hit - Live query properly shows error state if safeguard limits are hit This fixes the issue for both eager and progressive syncMode.
f47fbd0 to
2a796ff
Compare
These tests verify the localIndexExhausted fix works correctly: 1. Does not infinite loop when WHERE filters out most data - Query wants 10 items, only 2 match - Verifies status !== 'error' (fix works, not just safeguard) 2. Resumes loading when new matching data arrives - Starts with 0 matching items - Insert new matching items - localIndexExhausted resets, loads new data 3. Handles updates that move items out of WHERE clause - Updates change values to no longer match WHERE - TopK correctly refills from remaining matching data
453e7c7 to
e698eb1
Compare
Adds a new test that directly reproduces the infinite loop bug by creating a collection with a custom loadSubset that synchronously injects updates matching the WHERE clause. This simulates Electric's behavior of sending continuous updates during graph execution. The test verifies: - Without the localIndexExhausted fix, loadSubset is called 100+ times (infinite loop) - With the fix, loadSubset is called < 10 times (loop terminates correctly) Also adds additional tests for edge cases around the localIndexExhausted flag. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Co-Authored-By: Claude Opus 4.5 <[email protected]>
The removed test synchronously injected data in loadSubset, which artificially creates an infinite loop. Electric's loadSubset is async (uses await stream.fetchSnapshot), so it can't synchronously inject data during the maybeRunGraph loop. The remaining tests verify the localIndexExhausted flag's behavior correctly: - Prevents repeated load attempts when exhausted - Resets when new inserts arrive Co-Authored-By: Claude Opus 4.5 <[email protected]>
The localIndexExhausted flag was incorrectly resetting when updates arrived because splitUpdates converts updates to delete+insert pairs for D2, and the hasInserts check was seeing those fake inserts. Fix: Check for ORIGINAL inserts before calling splitUpdates, and pass that information to sendChangesToPipelineWithTracking. Now the flag only resets for genuine new inserts from the sync layer. Also adds a test that verifies requestLimitedSnapshot calls are limited after the index is exhausted - with the fix, only 0-4 calls happen after initial load even when 20 updates arrive. Without the fix, it would be ~19 calls. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Documents the reasoning for external reviewers: 1. splitUpdates fake inserts don't reset the flag 2. Updates to existing rows don't add new rows to scan 3. Edge case "update makes row match WHERE" is handled by filterAndFlipChanges converting unseen key updates to inserts Co-Authored-By: Claude Opus 4.5 <[email protected]>
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Remove the localIndexExhausted optimization and keep only the safety limits with enhanced diagnostic error messages. This approach: 1. Prevents app freezes by capping iterations 2. Provides detailed diagnostic info when limits are hit: - Collection IDs and query structure - TopK size vs data needed - Cursor position and iteration counts - Which D2 operators have pending work The diagnostic info will help identify the actual root cause when these errors occur in production. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Instead of transitioning to error state when iteration limits are hit, now we: 1. Log a warning with diagnostic info 2. Break out of the loop 3. Continue with the data we have This makes sense because by the time we hit the limit, we likely have all available data - we just couldn't fill the TopK completely because WHERE filtered out most items. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Shows where the loop spent its time (e.g., "iterations 1-3: X, 4-10000: Y") to help identify stuck patterns when the safety limits are hit. This makes it much clearer where the infinite loop got stuck in the state machine. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Reduces code duplication across three circuit breaker implementations by creating a shared createIterationTracker utility. This makes the tracking logic easier to maintain and the circuit breaker implementations cleaner. - New: packages/db-ivm/src/iteration-tracker.ts - Simplified: D2.run(), maybeRunGraph(), requestLimitedSnapshot() - Net reduction: ~140 lines Co-Authored-By: Claude Opus 4.5 <[email protected]>
Verifies the iteration breakdown output format and state tracking logic used by circuit breaker diagnostics. Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
I would definitely make this opt-in or enable only in debug mode because i'm worried about 2 things:
For the limits being hit in a healthy application what i mean is that an application could do a To avoid these kinds of situations i would make it opt-in. Only if you encounter this bug that we are currently unable to reproduce would you enable these limits in an attempt to get tracking information that helps us identifying the root cause. I'm also worried about this debugging code remaining in the codebase forever. I'd rather ask the users that are currently facing this bug to use the preview package from this PR to help find the root cause, rather than merging this into our main codebase. If we do that, then no need to make this opt-in or enabled only in debug mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a look through, I agree with Kevin, opt in would be useful, but I'm not wedded - I think ideally we have the confidence to completely remove this at some point.
The comments repeatedly make reference to "local index is exhausted" but there is no index... I think it's referring to the collection and not finding any more matching data, but it makes a little hard to follow. I would like these to be cleared up and explain exactly what is happening.
The tracking in places is causing allocations or stringification even if not needed, if we intended to keep this in long term the parsing of values into a serialised state for warning messages should be lazy.
Also a few other things in here that seem redundant changes, suspect just churn as this PR has evolved a lot - would be nice to make it a little cleaner.
packages/db-ivm/src/d2.ts
Outdated
| const operatorsWithWork = this.#operators | ||
| .filter((op) => op.hasPendingWork()) | ||
| .map((op) => op.constructor.name) | ||
| .sort() | ||
| .join(`,`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too keen on this inside the while loop, it's stringifing a bunch of stuff on each iteration even if it's not shown. Ideally anything we pass to trackAndCheckLimit is lazily stringified so we are minimising the work.
At the minimum lift it out of the loop.
| const sendChangesInRange = ( | ||
| changes: Iterable<ChangeMessage<any, string | number>>, | ||
| ) => { | ||
| const changesArray = Array.isArray(changes) ? changes : [...changes] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why it's making this change here - splitUpdates accepts an iterator? I think it's a redundant change thats doing a wasteful allocation.
| } | ||
|
|
||
| const trackedChanges = this.trackSentValues(changes, orderByInfo.comparator) | ||
| const changesArray = Array.isArray(changes) ? changes : [...changes] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also same as above, trackSentValues accepts an iterator, this is redundant.
| private loadNextItems( | ||
| n: number, | ||
| subscription: CollectionSubscription, | ||
| ): boolean { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returning a boolean doesn't seem to be used anywhere, can we keep the same signature for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally the boolean indicates whether it's done loading more items or not. IIRC if we hit the end of the data then we return true indicating we're done loading items (even though the topK may not be completely filled). This is essential to know that we need to stop requesting for more data.
I'm confused about the comment it added about loading this locally. This could be loaded from the backend as well.
| minValues, | ||
| offset, | ||
| }: RequestLimitedSnapshotOptions) { | ||
| }: RequestLimitedSnapshotOptions): boolean { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This return values is passed through by loadNextItems but then not used anywhere.
| if (minValue instanceof Date) { | ||
| const minValuePlus1ms = new Date(minValue.getTime() + 1) | ||
| if (cursorMinValue instanceof Date) { | ||
| const minValuePlus1ms = new Date(cursorMinValue.getTime() + 1) | ||
| whereCurrentCursor = and( | ||
| gte(expression, new Value(minValue)), | ||
| gte(expression, new Value(cursorMinValue)), | ||
| lt(expression, new Value(minValuePlus1ms)), | ||
| ) | ||
| } else { | ||
| whereCurrentCursor = eq(expression, new Value(minValue)) | ||
| whereCurrentCursor = eq(expression, new Value(cursorMinValue)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renames for no reasion
| let hitIterationLimit = false | ||
|
|
||
| while (valuesNeeded() > 0 && !collectionExhausted()) { | ||
| const insertedKeys = new Set<string | number>() // Track keys we add to `changes` in this iteration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting that it's spotted this is unused and removed it.
@kevin-dp insertedKeys was introduced in one of your PRs, can you remember what it was for?
I am wandering a little bit if this could be a bug, if it was intended to be used to track sent data, but we are not doing so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I traced back the original PR that introduced this variable. Keys were added to this set but the set was never used. I think it must have been remnants of a previous approach i had tried to fix the bug and then i ended up switching to a different approach in this variable went unnoticed. (This is the PR that introduced it: #713)
- Simplify iteration tracker to lazy evaluation (only compute diagnostics when limit is actually exceeded, not on every iteration) - Revert unnecessary array conversions in collection-subscriber.ts - Revert unused boolean return from loadNextItems - Keep cursorMinValue rename (fixes ESLint shadowing warning) - Confirmed insertedKeys was dead code (never used after being set) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The iteration checker now tracks both total iterations AND iterations without state change. This catches: - True infinite loops (same state repeating) - Oscillation patterns (A→B→A→B) - Slow progress that exceeds total limit Callers pass a lightweight state key (e.g., count of operators with work, number of changes collected) that resets the same-state counter when it changes. If making progress, allows more total iterations; if stuck on same state, triggers sooner. Runtime cost: one primitive comparison per iteration (~free). Co-Authored-By: Claude Opus 4.5 <[email protected]>
Remove confusing "index exhausted" terminology and clarify that this function loads from local collection via BTree index and also triggers async backend fetch. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Summary
Adds iteration safety limits to prevent app freezes from excessive iterations in ORDER BY + LIMIT queries. When limits are hit, warnings are logged with diagnostic info but queries continue normally with available data.
User impact: Apps no longer freeze. If iteration limits are hit, users get their data (possibly incomplete TopK) and a console warning with diagnostic info.
Root Cause
ORDER BY + LIMIT queries can enter excessive iteration cycles when:
limititems (e.g., 10)The exact trigger conditions in production are still unclear - we observed debugger pausing in WHERE clause evaluation (
and,gtecases) duringrequestLimitedSnapshot.Approach
Graceful circuit breakers: Add iteration limits that break out of loops and continue with available data:
D2.run()maybeRunGraphrequestLimitedSnapshotKey insight: By the time we hit the limit, we likely have all available data - we just couldn't fill the TopK completely because WHERE filtered out most items. So we log a warning and continue rather than erroring.
Iteration breakdown tracking: When limits are hit, warnings show WHERE the loop got stuck:
This makes it clear that the loop got stuck requesting more data for TopK after filtering completed.
Additional diagnostic info logged:
Non-goals
Trade-offs
Verification