[MongoDB] Adaptive batchSize to improve change stream timeout handling#609
Merged
[MongoDB] Adaptive batchSize to improve change stream timeout handling#609
Conversation
🦋 Changeset detectedLatest commit: 8f60b65 The changes in this PR will be included in the next version bump. This PR includes changesets to release 12 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
…-timeout-handling
Contributor
Author
|
For reference, this is what the logs look like now when manually triggering failures on getMore: |
khawarizmus
previously approved these changes
Apr 21, 2026
khawarizmus
approved these changes
Apr 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This makes a change to batchSize values, to improve timeout handling for
[PSYNC_S1345] Timeout while reading MongoDB ChangeStreamerrors, helping with recovering automatically. For background on this issue, see #417.#417 significantly reduced the number of occurrences by improving the pipeline performance on the source database. However, we still see the issue coming up occasionally. This tweaks the batchSize for the aggregate request to further improve the handling.
To understand how this helps, consider a case where we replicate from database A, which receives a slow stream of updates, while database B in the same cluster receives a very high rate of updates for a couple of minutes. If we fall behind with replication, we could end up with a backlog like this:
With our default batchSize of 6000, the change stream request would have to read through all the changes in B before building up 6000 changes in A. So, the solution is to make the batch size smaller.
However, we can't universally make the batch size significantly smaller, since that can reduce throughput for the common case where this timeout isn't an issue. So for this, we use adaptive batch sizing:
batchSize: 1for theaggregatecommand when opening a change stream. Since this is only run on startup or when retrying, the overhead of the tiny batch size should be small.batchSize: 6000for thegetMorecommand by default.getMore, then double on every batch until it reaches the original batch size.So if we do run into a timeout, the change stream restarts with a new aggregate command with
batchSize: 1, then smallergetMorebatch sizes, which allows us to make progress.The original attempt only used the
batchSize: 1for aggregate command, and not changing the batchSize for the getMore command. There are however edge cases where that could result in glacial replication lag. For example, suppose we this backlog:With this backlog, we can expect a
batchSize: 6000request to always fail. But with onlybatchSize: 1on the aggregate command, we only move one record further, then fail again on the next getMore, effectively replicating around 1 change per minute until we move past those original 5000.By reducing the batchSize for getMore and doubling it again, we should be able to make a lot more progress before running into the timeout. In the above example, we'd make requests with batchSize of 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, before the next request of 4096 fails, and restarting at 1 again.
Tests
This configures the test setup to use
enableTestCommands=1in the MongoDB server. That allows us to inject error responses for specific commands, making testing these edge cases easier.Flush
This now does a
flush()andsetResumeToken()on each change stream batch. This makes sure that we persist progress if we run into a fatal change stream error. This is only possible now that we're using raw change streams.This change also breaks our logic for detecting a change in database URI. The old behavior was not reliable here, since it depends on which type of resume token happened to be used. We can revisit this in the future, looking into different mechanisms to detect the database change (such as persisting and comparing the name).