Don't drop retransmitted packets that were ACKed past but not yet delivered#133
Don't drop retransmitted packets that were ACKed past but not yet delivered#133JohanG-LAS wants to merge 2 commits into
Conversation
|
Important Installation incomplete: to start using Gemini Code Assist, please ask the organization owner(s) to visit the Gemini Code Assist Admin Console and sign the Terms of Services. |
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #133 +/- ##
==========================================
+ Coverage 56.27% 56.60% +0.32%
==========================================
Files 22 23 +1
Lines 4174 4238 +64
==========================================
+ Hits 2349 2399 +50
- Misses 1509 1516 +7
- Partials 316 323 +7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thanks for your contribution. I need to take a closer look at that case. What is a bit "off" in this scenario, is that timestamps are lower in later packets than in earlier packets. SRT assumes that both the sequence number and the timestamps are increasing. That's why this late packet is dropped because the timestamp of packet 4 is after that of the retrasnmitted packet. |
|
Thanks for the careful look — you're right that the test's timestamp ordering doesn't match real SRT traffic. I should have flagged that. The reason I inverted the timestamps was to reach the buggy state (lastDelivered < seq_retrans < lastACK) in a single thread. In production the same state is reached through a race inside Tick() with monotonic timestamps: periodicACK releases r.lock before sendACK runs, and sendACK does network I/O via c.pop → c.onSend. While that's happening the network reader is free to call Push(): Tick N-1 ends with lastDelivered=759. If it would help, I can replace the test with one that keeps PktTsbpdTime strictly monotonic and injects the retransmit Push() from inside the OnSendACK callback — i.e. the same lock-free window that exists in production. Happy to push that if you'd like. |
|
Yes, please update the test accordingly. |
When periodicACK advances lastACKSequenceNumber past a gap (e.g. when packets after the gap have ripe PktTsbpdTime), a retransmission for a sequence inside that gap arrives at Push() with lastDeliveredSequenceNumber < seq < lastACKSequenceNumber and is dropped via the Lt(lastACKSequenceNumber) branch as "already acknowledged". This manifests in user reports as PktRecvRetrans approximately equal to PktRecvDrop. In production the state is reached through a race inside Tick(): periodicACK releases r.lock through its defer before sendACK is invoked, and sendACK performs network I/O (c.pop -> c.onSend). While sendACK runs the receiver lock is free, so the network reader goroutine is free to call Push(); a retransmit arriving there observes lastACK already advanced past the gap but lastDelivered still at its pre-tick value. This test reproduces that interleaving deterministically with strictly monotonic PktTsbpdTime on every packet by injecting the retransmit Push() from inside the OnSendACK callback -- i.e. the same lock-free window that exists in production. It fails on this commit with PktDrop=1; the follow-up commit fixes Push() so the retransmission is reinserted into packetList and delivered in order. Made-with: Cursor Co-authored-by: Cursor <cursoragent@cursor.com>
A retransmission for a sequence number that was ACKed past (because periodicACK crossed a gap, either in-row or via the PktTsbpdTime <= now branch) but not yet delivered was being dropped by the Lt(lastACKSequenceNumber) branch as "already acknowledged". Allow retransmissions through this branch; they fall into the out-of-order handling below where they either fill the gap or are detected as duplicates. TLPKTDROP semantics remain enforced by the earlier Lte(lastDeliveredSequenceNumber) check, which still drops anything already delivered or skipped past. Makes the regression test from the previous commit pass. Made-with: Cursor Co-authored-by: Cursor <cursoragent@cursor.com>
53fd6ce to
fcf4b1f
Compare
|
Restructured into two commits — test and receive. The test commit fails on its own (PktDrop=1) and the fix commit makes it pass. The test uses strictly monotonic PktTsbpdTime on every packet and reproduces the production interleaving by Push-ing the retransmit from inside the OnSendACK callback, i.e. the same lock-free window inside Tick() that the race exploits in production. |
Problem
receiver.Pushcould drop legitimate retransmissions, even though thepacket had not yet been delivered to the application. This was observed
in production via MediaMTX as a near-1:1 ratio between
packetsReceivedRetransandpacketsReceivedDrop— every retransmitthat the sender sent in response to a NAK was being received and then
silently discarded by the receiver. The same streams over libsrt
(
srt-live-transmit/TSDuck) recovered cleanly with zero drops, whichruled out the network and the sender.
This appears to be the same symptom reported in #91.
Smoking gun (from a debug-instrumented build)
GOSRT_RX rx=0xc00014e4e0 retrans_in seq=691158851 lastACK=691158858 lastDelivered=691158759 maxSeen=691158865 GOSRT_RX rx=0xc00014e4e0 drop reason=lt_lastACK seq=691158851 lastACK=691158858 lastDelivered=691158759 maxSeen=691158865 retrans=trueThe retransmission satisfied
lastDeliveredSequenceNumber < seq < lastACKSequenceNumberand wasdropped by the
Lt(r.lastACKSequenceNumber)branch as "alreadyacknowledged" — even though the gap had not yet been delivered or
skipped past.
Root cause
periodicACKand the delivery loop inreceiver.Tickare two separatecritical sections.
periodicACKcan advancelastACKSequenceNumberpast a gap when the packets after the gap have ripe
PktTsbpdTime,while the delivery loop breaks early on the first packet whose
PktTsbpdTimeis still in the future. A retransmission arriving inthis window for a sequence in the gap landed in the
Lt(lastACKSequenceNumber)drop branch.Fix
Allow retransmissions through the
Lt(lastACKSequenceNumber)branch.They fall into the existing out-of-order handling below, where they
either fill the gap (via
InsertBefore) or are detected as duplicatesby the existing linear-scan check. TLPKTDROP semantics are still
enforced by the earlier
Lte(lastDeliveredSequenceNumber)check, whichremains untouched and continues to drop anything already delivered or
skipped past.
Non-retransmit packets below
lastACK(which would be unusual butpossible under heavy reordering) are still dropped exactly as before.
Test
TestRecvRetransmitPastACKconstructs the bug state deterministically:PktTsbpdTimeso the delivery loopbreaks at the head of the list.
PktTsbpdTimesoperiodicACKadvances past the gap.Tick(50)— assertslastACK == 8andlastDeliveredis stilluninitialized.
RetransmittedPacketFlag = true.PktRetransincremented, and that seq 5 is now inpacketList.Tick(300)— asserts the full[0..8]is delivered in order.Verified that the test fails on
main(withPktDrop == 1exactly asthe production log showed) and passes with the fix applied.
Full
go test ./...suite remains green.Made-with: Cursor