Skip to content

Don't drop retransmitted packets that were ACKed past but not yet delivered#133

Open
JohanG-LAS wants to merge 2 commits into
datarhei:mainfrom
JohanG-LAS:fix/retransmit-drop-past-ack
Open

Don't drop retransmitted packets that were ACKed past but not yet delivered#133
JohanG-LAS wants to merge 2 commits into
datarhei:mainfrom
JohanG-LAS:fix/retransmit-drop-past-ack

Conversation

@JohanG-LAS
Copy link
Copy Markdown

Problem

receiver.Push could drop legitimate retransmissions, even though the
packet had not yet been delivered to the application. This was observed
in production via MediaMTX as a near-1:1 ratio between
packetsReceivedRetrans and packetsReceivedDrop — every retransmit
that the sender sent in response to a NAK was being received and then
silently discarded by the receiver. The same streams over libsrt
(srt-live-transmit/TSDuck) recovered cleanly with zero drops, which
ruled out the network and the sender.

This appears to be the same symptom reported in #91.

Smoking gun (from a debug-instrumented build)

GOSRT_RX rx=0xc00014e4e0 retrans_in seq=691158851 lastACK=691158858 lastDelivered=691158759 maxSeen=691158865 GOSRT_RX rx=0xc00014e4e0 drop reason=lt_lastACK seq=691158851 lastACK=691158858 lastDelivered=691158759 maxSeen=691158865 retrans=true

The retransmission satisfied
lastDeliveredSequenceNumber < seq < lastACKSequenceNumber and was
dropped by the Lt(r.lastACKSequenceNumber) branch as "already
acknowledged" — even though the gap had not yet been delivered or
skipped past.

Root cause

periodicACK and the delivery loop in receiver.Tick are two separate
critical sections. periodicACK can advance lastACKSequenceNumber
past a gap when the packets after the gap have ripe PktTsbpdTime,
while the delivery loop breaks early on the first packet whose
PktTsbpdTime is still in the future. A retransmission arriving in
this window for a sequence in the gap landed in the
Lt(lastACKSequenceNumber) drop branch.

Fix

Allow retransmissions through the Lt(lastACKSequenceNumber) branch.
They fall into the existing out-of-order handling below, where they
either fill the gap (via InsertBefore) or are detected as duplicates
by the existing linear-scan check. TLPKTDROP semantics are still
enforced by the earlier Lte(lastDeliveredSequenceNumber) check, which
remains untouched and continues to drop anything already delivered or
skipped past.

Non-retransmit packets below lastACK (which would be unusual but
possible under heavy reordering) are still dropped exactly as before.

Test

TestRecvRetransmitPastACK constructs the bug state deterministically:

  1. Push packets 0–4 with future PktTsbpdTime so the delivery loop
    breaks at the head of the list.
  2. Push packets 6, 7, 8 (gap at 5) with past PktTsbpdTime so
    periodicACK advances past the gap.
  3. Tick(50) — asserts lastACK == 8 and lastDelivered is still
    uninitialized.
  4. Push a retransmit for seq 5 with RetransmittedPacketFlag = true.
  5. Asserts the retransmit was not counted as a drop, that
    PktRetrans incremented, and that seq 5 is now in packetList.
  6. Tick(300) — asserts the full [0..8] is delivered in order.

Verified that the test fails on main (with PktDrop == 1 exactly as
the production log showed) and passes with the fix applied.

Full go test ./... suite remains green.
Made-with: Cursor

@gemini-code-assist
Copy link
Copy Markdown

Important

Installation incomplete: to start using Gemini Code Assist, please ask the organization owner(s) to visit the Gemini Code Assist Admin Console and sign the Terms of Services.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 11, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.60%. Comparing base (823194e) to head (fcf4b1f).
⚠️ Report is 11 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #133      +/-   ##
==========================================
+ Coverage   56.27%   56.60%   +0.32%     
==========================================
  Files          22       23       +1     
  Lines        4174     4238      +64     
==========================================
+ Hits         2349     2399      +50     
- Misses       1509     1516       +7     
- Partials      316      323       +7     
Flag Coverage Δ
unit-linux 56.60% <100.00%> (+0.32%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ioppermann
Copy link
Copy Markdown
Member

Thanks for your contribution. I need to take a closer look at that case. What is a bit "off" in this scenario, is that timestamps are lower in later packets than in earlier packets. SRT assumes that both the sequence number and the timestamps are increasing. That's why this late packet is dropped because the timestamp of packet 4 is after that of the retrasnmitted packet.

@JohanG-LAS
Copy link
Copy Markdown
Author

Thanks for the careful look — you're right that the test's timestamp ordering doesn't match real SRT traffic. I should have flagged that.

The reason I inverted the timestamps was to reach the buggy state (lastDelivered < seq_retrans < lastACK) in a single thread. In production the same state is reached through a race inside Tick() with monotonic timestamps:

func (r *receiver) Tick(now uint64) {
    if ok, sequenceNumber, lite := r.periodicACK(now); ok {
        r.sendACK(sequenceNumber, lite)   // r.lock is free here
    }
    if list := r.periodicNAK(now); len(list) != 0 {
        r.sendNAK(list)                    // r.lock is free here
    }
    r.lock.Lock()                          // delivery loop only acquires it now
    ...
}

periodicACK releases r.lock before sendACK runs, and sendACK does network I/O via c.pop → c.onSend. While that's happening the network reader is free to call Push():

Tick N-1 ends with lastDelivered=759.
Tick N: periodicACK advances lastACK to 858 (all timestamps monotonic and ripe), releases the lock.
sendACK runs; reader goroutine Pushes the retransmit for seq=851.
Inside Push, lastDelivered is still 759 from the previous tick, lastACK is 858, and Lt(lastACKSequenceNumber) drops it as "already acknowledged".
Delivery loop runs and catches up to 858 — retransmit is gone.
That matches the production trace exactly (lastACK=691158858, lastDelivered=691158759, retrans seq=691158851). The stale value is lastDelivered, not the timestamps.

If it would help, I can replace the test with one that keeps PktTsbpdTime strictly monotonic and injects the retransmit Push() from inside the OnSendACK callback — i.e. the same lock-free window that exists in production. Happy to push that if you'd like.

@ioppermann
Copy link
Copy Markdown
Member

Yes, please update the test accordingly.

JohanG-LAS and others added 2 commits May 12, 2026 10:24
When periodicACK advances lastACKSequenceNumber past a gap (e.g. when
packets after the gap have ripe PktTsbpdTime), a retransmission for
a sequence inside that gap arrives at Push() with
lastDeliveredSequenceNumber < seq < lastACKSequenceNumber and is
dropped via the Lt(lastACKSequenceNumber) branch as "already
acknowledged". This manifests in user reports as PktRecvRetrans
approximately equal to PktRecvDrop.

In production the state is reached through a race inside Tick():
periodicACK releases r.lock through its defer before sendACK is
invoked, and sendACK performs network I/O (c.pop -> c.onSend).
While sendACK runs the receiver lock is free, so the network reader
goroutine is free to call Push(); a retransmit arriving there
observes lastACK already advanced past the gap but lastDelivered
still at its pre-tick value.

This test reproduces that interleaving deterministically with strictly
monotonic PktTsbpdTime on every packet by injecting the retransmit
Push() from inside the OnSendACK callback -- i.e. the same lock-free
window that exists in production. It fails on this commit with
PktDrop=1; the follow-up commit fixes Push() so the retransmission
is reinserted into packetList and delivered in order.

Made-with: Cursor
Co-authored-by: Cursor <cursoragent@cursor.com>
A retransmission for a sequence number that was ACKed past (because
periodicACK crossed a gap, either in-row or via the
PktTsbpdTime <= now branch) but not yet delivered was being dropped
by the Lt(lastACKSequenceNumber) branch as "already acknowledged".

Allow retransmissions through this branch; they fall into the
out-of-order handling below where they either fill the gap or are
detected as duplicates. TLPKTDROP semantics remain enforced by the
earlier Lte(lastDeliveredSequenceNumber) check, which still drops
anything already delivered or skipped past.

Makes the regression test from the previous commit pass.

Made-with: Cursor
Co-authored-by: Cursor <cursoragent@cursor.com>
@JohanG-LAS JohanG-LAS force-pushed the fix/retransmit-drop-past-ack branch from 53fd6ce to fcf4b1f Compare May 12, 2026 08:27
@JohanG-LAS
Copy link
Copy Markdown
Author

Restructured into two commits — test and receive. The test commit fails on its own (PktDrop=1) and the fix commit makes it pass. The test uses strictly monotonic PktTsbpdTime on every packet and reproduces the production interleaving by Push-ing the retransmit from inside the OnSendACK callback, i.e. the same lock-free window inside Tick() that the race exploits in production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants