Fix data race in connection pool by running QuicConnStart on the connection's worker thread#6114
Fix data race in connection pool by running QuicConnStart on the connection's worker thread#6114gaurav2699 wants to merge 7 commits into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6114 +/- ##
==========================================
- Coverage 86.13% 84.86% -1.27%
==========================================
Files 60 60
Lines 18847 18878 +31
==========================================
- Hits 16233 16021 -212
- Misses 2614 2857 +243 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
guhetier
left a comment
There was a problem hiding this comment.
The change looks reasonable (although I have performance concerns if we need to queue every connection start this way, we will need to validate that).
The CI failures are likely pointing to some remaining issue though.
| // internally owned so that closing it locally below drives shutdown | ||
| // completion. | ||
| // | ||
| Connection->State.ExternalOwner = FALSE; |
There was a problem hiding this comment.
This seems like a bad pattern to me.
The connection pool starts the connection, so it "externally owns it", as in it has a ref on the connection.
ExternalOwner = FALSE was used as a hack to avoid notifications, but that's is backfiring, and you are adding a dedicated flag now. Let's use that flag to prevent notifications.
| if (QUIC_FAILED(Status)) { | ||
| *Connection = NULL; | ||
| } | ||
| QuicConnRelease(PoolConnection, QUIC_CONN_REF_HANDLE_OWNER); |
There was a problem hiding this comment.
Is just QuicConnRelease enough to close the connection? Previously it was queueing the connection for closure.
|
I agree with Guillaume's feedback. My general comments on the whole PR would be that the connection pools assumed they had ownership/control of connection start on the caller thread, so keep an eye out for corner cases built on that assumption. |
Description
QuicConnPoolTryCreateConnection called QuicConnStart directly on the application thread instead of queuing it to the connection's worker.
Partway through, QuicConnStart calls QuicBindingAddSourceConnectionID, which publishes the connection's source CID into the binding lookup table.
From that point the receive datapath can match inbound packets to the connection and queue work to its worker. Because the start was running outside the per-connection operation queue, the OperQ->ActivelyProcessing gate was never set, so the first inbound packet caused the worker to begin draining the same connection in parallel with the app thread still inside QuicConnStart.
That breaks msquic's single-threaded-per-connection invariant: two threads then touch Connection->Crypto concurrently.
Fixes #6058
Testing
The spinquic stress test exercises this path
Documentation
NA