perf: clickhouse read connection pool overhaul by amokan · Pull Request #3591 · Logflare/logflare

amokan · 2026-06-12T03:02:14Z

Why?

Our ClickHouse read services are fronted by a "least-connections" load balancer. The current connection configuration has generally created a situation where nothing ever retires a healthy read connection. Given that the load balancer only acts on new connections, a newly added read replica will sometimes take a couple hours to be fully leveraged.

While digging into this situation, two adjacent gaps were surfaced (and are covered in this PR):

Backend config changes would never reach active read pools
Deleting a backend leaked its read ConnectionManager on every node until restart.

What changed?

Read pools now recycle their connections on a jittered ~10 minute schedule.
Added cluster-wide operational levers to QueryConnectionSup, all scoped to a single backend. One key example is the ability to trigger a cluster-wide recycle of read connections (gracefully).
Added new optional Adaptor behaviour callbacks to allow for better handling of backend updates/deletion scenarios. These are implemented for ClickHouse.
Additional tests around read connection pool handling

Expected Impact

With every connection replaced nearly every ~10 minutes and the LB sending recycled connections to the emptiest replica, a newly added replica should show more connection usage within 1-2 minutes.
Reconnect cost is trivial: each connection pays one TCP+TLS handshake plus a SELECT 1 per interval.
For scaling events where we don't want to wait at all: QueryConnectionSup.recycle_backend(backend_id) from any node's REPL will rebalance that backend's connections cluster-wide. A later PR could wire this up to the backend management UX.
No longer orphaning active read connection resources when a backend is deleted 🫠
Happiness to the entire world (maybe not)

…tion-pooling-for-reads

djwhitt · 2026-06-12T13:43:15Z

Nice work! Couldn't find any issues.

Ziinc · 2026-06-12T13:33:34Z

+  No-op for adaptors that do not implement the optional callback.
+  """
+  @spec backend_deleted(Backend.t()) :: :ok
+  def backend_deleted(%Backend{} = backend) do


I think on_backend_deleted would be more self documenting and clear

Ziinc · 2026-06-12T13:43:06Z

+        :ok
+    end
+  catch
+    :exit, _reason -> :ok


Should log this in case

Ziinc · 2026-06-12T14:04:12Z

+      assert :ok == ConnectionManager.recycle_pool(backend)
+
+      %ConnectionManager{pool_pid: same_pool_pid, next_recycle_at: rescheduled_at} =
+        :sys.get_state(manager_pid)


Hmm not a fan of testing internal state

Looks like we're testing internal implementation here

Ziinc · 2026-06-12T14:06:26Z

+      %ConnectionManager{pool_pid: pool_pid, next_recycle_at: scheduled_at} =
+        :sys.get_state(manager_pid)
+
+      Process.sleep(@timeout_interval)
+
+      %ConnectionManager{pool_pid: same_pool_pid, next_recycle_at: rescheduled_at} =
+        :sys.get_state(manager_pid)


Exposing an api to get the current pool id feels like a better way to test this

Ziinc · 2026-06-12T14:07:34Z

+      assert :ok == ConnectionManager.refresh_pool(backend)
+      refute ConnectionManager.pool_active?(backend)


Same assertions as :231, can remove the former

Ziinc · 2026-06-12T14:08:12Z

+    test "returns an error when no manager is running for the backend", %{backend: backend} do
+      assert {:error, :no_manager} == QueryConnectionSup.recycle_backend_local(backend.id)
+    end
+
+    test "returns an error when the manager has no active pool", %{backend: backend} do


Can be combined

Ziinc · 2026-06-12T14:09:05Z

+  end
+
+  describe "list_query_connection_managers/0" do
+    test "returns backend ids and pids for running managers", %{backend: backend} do


Seems redundant given the other tests

Ziinc · 2026-06-12T14:14:31Z

+    test "returns ok when no manager is running", %{backend: backend} do
+      assert :ok == QueryConnectionSup.refresh_backend_local(backend.id)
+    end
+
+    test "stops the backend's active pool", %{backend: backend} do
+      {:ok, _manager_pid} =


Can be combined

Ziinc · 2026-06-12T14:19:24Z

    end
  end

+  describe "Adaptor lifecycle notifications" do


Seems unnecessary to have these tests

Can combine if we need it for code cov

…tion-pooling-for-reads

Rename `backend_config_changed`/`backend_deleted` to `on_backend_config_changed`/`on_backend_deleted` for clarity, making it self-documenting that these are invoked in reaction to a lifecycle event. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The terminate_manager catch clause previously swallowed exits silently. Log a warning so an unexpected exit during teardown is observable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add `ConnectionManager.get_pool_pid/1` and `get_next_recycle_at/1` so the recycling tests can assert on observable pool state through a public API rather than reaching into the GenServer struct with `:sys.get_state`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The "stops the pool so the next query restarts it" test already covers `refresh_pool/1` returning `:ok`, so the standalone no-pool case is redundant. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The "lists managers for multiple backends" test already asserts a started manager appears in `list_query_connection_managers/0`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Both cases share a started ConnectionManager, so exercise the no-pool error and the successful recycle within a single test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Exercise the no-manager `:ok` path and the active-pool stop within a single test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Both no-op assertions exercise the same fall-through path, so keep them in a single test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

overhaul of clickhouse read pools

4ce69b1

github-actions Bot assigned amokan Jun 12, 2026

amokan requested review from Baishan, Ziinc, chasers and djwhitt June 12, 2026 03:03

Merge branch 'main' into adammokan/o11y-1904-adjust-clickhouse-connec…

5316f8a

…tion-pooling-for-reads

djwhitt approved these changes Jun 12, 2026

View reviewed changes

Ziinc approved these changes Jun 12, 2026

View reviewed changes

amokan and others added 9 commits June 12, 2026 09:54

Merge branch 'main' into adammokan/o11y-1904-adjust-clickhouse-connec…

932f988

…tion-pooling-for-reads

feat: log caught exit when terminating read connection manager

a549870

The terminate_manager catch clause previously swallowed exits silently. Log a warning so an unexpected exit during teardown is observable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test: drop redundant no-pool refresh_pool assertion

f20a6f1

The "stops the pool so the next query restarts it" test already covers `refresh_pool/1` returning `:ok`, so the standalone no-pool case is redundant. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test: drop redundant single-manager list test

01bf968

The "lists managers for multiple backends" test already asserts a started manager appears in `list_query_connection_managers/0`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test: combine recycle_backend_local no-pool and active-pool cases

c7868eb

Both cases share a started ConnectionManager, so exercise the no-pool error and the successful recycle within a single test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test: combine refresh_backend_local no-manager and active-pool cases

486cf1f

Exercise the no-manager `:ok` path and the active-pool stop within a single test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test: combine adaptor lifecycle no-op notification tests

76f6f9b

Both no-op assertions exercise the same fall-through path, so keep them in a single test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Ziinc approved these changes Jun 12, 2026

View reviewed changes

amokan merged commit f5ce4a6 into main Jun 12, 2026
13 checks passed

amokan deleted the adammokan/o11y-1904-adjust-clickhouse-connection-pooling-for-reads branch June 12, 2026 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: clickhouse read connection pool overhaul#3591

perf: clickhouse read connection pool overhaul#3591
amokan merged 11 commits into
mainfrom
adammokan/o11y-1904-adjust-clickhouse-connection-pooling-for-reads

amokan commented Jun 12, 2026

Uh oh!

djwhitt commented Jun 12, 2026

Uh oh!

Ziinc Jun 12, 2026

Uh oh!

Ziinc Jun 12, 2026

Uh oh!

Ziinc Jun 12, 2026

Uh oh!

Ziinc Jun 12, 2026

Uh oh!

Ziinc Jun 12, 2026

Uh oh!

Ziinc Jun 12, 2026

Uh oh!

Ziinc Jun 12, 2026

Uh oh!

Ziinc Jun 12, 2026

Uh oh!

Ziinc Jun 12, 2026

Uh oh!

Ziinc Jun 12, 2026

Uh oh!

Ziinc Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		assert :ok == ConnectionManager.refresh_pool(backend)
		refute ConnectionManager.pool_active?(backend)

Conversation

amokan commented Jun 12, 2026

Why?

What changed?

Expected Impact

Uh oh!

djwhitt commented Jun 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants