Skip to content

Fix data race in connection pool by running QuicConnStart on the connection's worker thread#6114

Open
gaurav2699 wants to merge 7 commits into
mainfrom
fix-connection-pool-start-race
Open

Fix data race in connection pool by running QuicConnStart on the connection's worker thread#6114
gaurav2699 wants to merge 7 commits into
mainfrom
fix-connection-pool-start-race

Conversation

@gaurav2699

@gaurav2699 gaurav2699 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Description

QuicConnPoolTryCreateConnection called QuicConnStart directly on the application thread instead of queuing it to the connection's worker.
Partway through, QuicConnStart calls QuicBindingAddSourceConnectionID, which publishes the connection's source CID into the binding lookup table.
From that point the receive datapath can match inbound packets to the connection and queue work to its worker. Because the start was running outside the per-connection operation queue, the OperQ->ActivelyProcessing gate was never set, so the first inbound packet caused the worker to begin draining the same connection in parallel with the app thread still inside QuicConnStart.
That breaks msquic's single-threaded-per-connection invariant: two threads then touch Connection->Crypto concurrently.
Fixes #6058

Testing

The spinquic stress test exercises this path

Documentation

NA

@gaurav2699 gaurav2699 requested a review from a team as a code owner June 22, 2026 13:13
@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.83333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 84.86%. Comparing base (f39550f) to head (9185d46).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
src/core/connection.c 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6114      +/-   ##
==========================================
- Coverage   86.13%   84.86%   -1.27%     
==========================================
  Files          60       60              
  Lines       18847    18878      +31     
==========================================
- Hits        16233    16021     -212     
- Misses       2614     2857     +243     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@guhetier guhetier left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks reasonable (although I have performance concerns if we need to queue every connection start this way, we will need to validate that).
The CI failures are likely pointing to some remaining issue though.

Comment thread src/core/operation.h Outdated
Comment thread src/core/connection_pool.c Outdated
Comment thread src/core/connection_pool.c Outdated
Comment thread src/core/connection.c
// internally owned so that closing it locally below drives shutdown
// completion.
//
Connection->State.ExternalOwner = FALSE;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a bad pattern to me.
The connection pool starts the connection, so it "externally owns it", as in it has a ref on the connection.
ExternalOwner = FALSE was used as a hack to avoid notifications, but that's is backfiring, and you are adding a dedicated flag now. Let's use that flag to prevent notifications.

Comment thread src/core/connection_pool.c Outdated
Comment thread src/core/connection_pool.c Outdated
if (QUIC_FAILED(Status)) {
*Connection = NULL;
}
QuicConnRelease(PoolConnection, QUIC_CONN_REF_HANDLE_OWNER);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is just QuicConnRelease enough to close the connection? Previously it was queueing the connection for closure.

@anrossi

anrossi commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

I agree with Guillaume's feedback. My general comments on the whole PR would be that the connection pools assumed they had ownership/control of connection start on the caller thread, so keep an eye out for corner cases built on that assumption.
Same goes for the ExternalOwner flag, but I think you are fixing that up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI - FAILURE]: stress-Debug-windows-windows-2022-x64-quictls-UseXdp (spinquic.exe)

3 participants