Skip to content

HTTP2 sink deadlocks #25217

@chrisleavoy

Description

@chrisleavoy

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Using Vector with Kafka source and GCP Stackerdriver Logs sink. Large scale, multiple TB/day. During peak load the sink gets stuck at a rate of about 1 pod every 10 minutes (cluster size ~50 pods). It only affects 1 or 2 pods at once. All output stops, source buffer fills, back pressure kicks in, health checks continue responding everything is OK. Occasionally, it'll recover without any intervention after many hours later.

The sink API is fine. In flight requests seem to complete OK, but otherwise, all sink threads block all at once and remain stuck for hours. Theres no trickle through or anything like that. No errors or timeouts are recorded, however we do suspect there is a rare protocol level bug that breaks connection/ssl handling and results in the sink blocking until the remote end closes the socket, as our best theory. It's just really hard to prove this!

I've been able to reproduce this with ARC disabled, at request.concurrency = 10, 50 and 100. I've tried lower request.timeout_secs = 30, smaller batches, and larger buffers. No material change. I've been able to reproduce this bug on 0.50, 0.53, 0.54 and April 14th nightly.

I built a custom build of Vector with a hook on SIGTERM that dumps various debugging output. Below is an analysis of the dump from Claude. It feels like enough to open an issue, but I'll likely need to gather more data. It feels directionally correct because there are no errors or indications of slow or failing http requests. It just up and deadlocks, and none of the internal metrics or logs indicate why. Claude is almost certainly hallucinating on tid=15 though. I had Codex analyze this too, and it agreed there is a bug somewhere, but that blocked thread is likely the Kafka reader and not the root cause.

Open to suggestions that would make this easier to isolate. At such a low reproduction rate and with a massive volume, we're not financially able to turn on DEBUG logs or packet capture all the things. We'd need something a bit more targeted.

  SIGTERM Debug Dump Analysis: vector-c795c856d-xcb4w

  Timeline

  ┌─────────────┬───────────────────────────────────────────────────────────┐
  │ Time (UTC)  │                           Event                           │
  ├─────────────┼───────────────────────────────────────────────────────────┤
  │ 20:26:14    │ Pod started                                               │
  ├─────────────┼───────────────────────────────────────────────────────────┤
  │ 20:30:14    │ First "Source send cancelled" — deadlock in 3m59s         │
  ├─────────────┼───────────────────────────────────────────────────────────┤
  │ 20:30–20:44 │ Continuous drops, accelerating (12→25 suppressions/cycle) │
  ├─────────────┼───────────────────────────────────────────────────────────┤
  │ 20:45:04    │ SIGTERM → debug dump fires                                │
  ├─────────────┼───────────────────────────────────────────────────────────┤
  │ 20:46:04    │ Force-killed (60s grace exhausted)                        │
  └─────────────┴───────────────────────────────────────────────────────────┘

  3 minutes 59 seconds to deadlock — far faster than the previous pod (54 min). The pipeline locked up essentially at startup.

  ---
  Root Cause: Blocking read() on a Tokio Worker Thread

  The dump's smoking gun is tid=15 (vector-worker):

  State:   R (running)
  wchan:   0
  syscall: 0  0x77  0x7b00ad5f9490  0x20  ...
           ^   ^                     ^
           read fd=119(0x77)        count=32 bytes

  A Tokio async worker thread is stuck in a synchronous blocking read() syscall on socket fd=119. This is pathological — all async I/O in Tokio should go through epoll
  (non-blocking read after epoll signals readability). Every other worker thread correctly shows epoll_pwait:

  tid=8-18: wchan=futex_wait_queue, syscall=202 (epoll_pwait), timeout=0ms
  tid=15:   wchan=0,                syscall=0   (read),         BLOCKING

  The count=32 bytes is a specific tell: this is consistent with reading an HTTP/2 frame header (9 bytes + padding) or a TLS record fragment during early connection setup — the
  GCP Stackdriver API connection was stuck waiting for a server response that never arrived.

  Why the thread never unblocks: Tokio's async timeout machinery (request.timeout_secs = 60) fires on another worker, drops the future, and attempts to close the connection — but
  the underlying OS socket was not in O_NONBLOCK mode (or was transiently acquired in blocking mode during TLS/HTTP2 handshake). The worker thread stays blocked at the kernel
  read() call until the remote end sends data or resets the TCP connection. With GCP's load balancers silently dropping the connection on the server side, no RST arrives, and the
  thread stays blocked indefinitely.

  ---
  Memory: Evidence of How Long the Sink Was Stuck

  VmRSS:   5,171,216 kB  ≈ 4.93 GB
  RssAnon: 5,129,712 kB  ≈ 4.89 GB  (heap — rdkafka internal message buffers)
  RssFile:    41,504 kB  ≈   40 MB  (code + GeoIP mmdb)

  The 4.9 GB of anonymous heap is rdkafka's internal prefetch queue filling up while the downstream pipeline was blocked. The 48 rdk:broker threads continued fetching from Kafka
  independent of the application layer (37 of 48 actively polling at dump time). Since the Vector kafka source couldn't push events downstream, rdkafka's in-process queues
  absorbed everything, growing to ~5 GB.

  ---
  Thread Summary

  ┌─────────────────────────────┬───────┬─────────────────────────┬─────────────────────────────────┐
  │        Thread group         │ Count │          State          │              What               │
  ├─────────────────────────────┼───────┼─────────────────────────┼─────────────────────────────────┤
  │ vector-worker (tid 7-18,24) │ 13    │ epoll_pwait timeout=0ms │ Tokio workers — idle/healthy    │
  ├─────────────────────────────┼───────┼─────────────────────────┼─────────────────────────────────┤
  │ vector-worker tid=15        │ 1     │ blocking read()         │ GCP sink socket — stuck         │
  ├─────────────────────────────┼───────┼─────────────────────────┼─────────────────────────────────┤
  │ rdk:broker (do_sys_poll)    │ 37    │ actively polling        │ Kafka consumers — healthy       │
  ├─────────────────────────────┼───────┼─────────────────────────┼─────────────────────────────────┤
  │ rdk:broker (futex_wait)     │ 11    │ idle                    │ Unused broker slots             │
  ├─────────────────────────────┼───────┼─────────────────────────┼─────────────────────────────────┤
  │ rdk:main                    │ 1     │ futex_wait              │ librdkafka event loop — healthy │
  ├─────────────────────────────┼───────┼─────────────────────────┼─────────────────────────────────┤
  │ vector (main, pid=1)        │ 1     │ futex_wait              │ Main process — healthy          │
  └─────────────────────────────┴───────┴─────────────────────────┴─────────────────────────────────┘

  ---
  Why So Fast (4 min vs 54 min)?

  The 32-byte read and immediate-on-startup deadlock strongly suggest the blocking read() occurred during the initial HTTP/2 connection setup to the GCP API — either:
  1. TLS handshake stall — the GCP load balancer accepted the TCP connection but stalled sending the TLS ServerHello (possibly a GCP-side issue at that time)
  2. HTTP/2 SETTINGS frame — after TLS, hyper reads the server's initial SETTINGS frame (typically 9 bytes + settings = ~27–45 bytes). If the socket was not O_NONBLOCK at this
  point, this read blocks until data arrives

  Because the very first batch to GCP couldn't complete, the sink buffer filled on the first batch, immediately backing up all 50+ transforms to the kafka source within ~4
  minutes.

Configuration

gcp_stackdriver_logs sink
kafka source

Version

0.50 through 0.55

Debug Output


Example Data

No response

Additional Context

No response

References

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions