Skip to content

perf: clickhouse read connection pool overhaul#3591

Merged
amokan merged 11 commits into
mainfrom
adammokan/o11y-1904-adjust-clickhouse-connection-pooling-for-reads
Jun 12, 2026
Merged

perf: clickhouse read connection pool overhaul#3591
amokan merged 11 commits into
mainfrom
adammokan/o11y-1904-adjust-clickhouse-connection-pooling-for-reads

Conversation

@amokan

@amokan amokan commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Why?

Our ClickHouse read services are fronted by a "least-connections" load balancer. The current connection configuration has generally created a situation where nothing ever retires a healthy read connection. Given that the load balancer only acts on new connections, a newly added read replica will sometimes take a couple hours to be fully leveraged.

While digging into this situation, two adjacent gaps were surfaced (and are covered in this PR):

  • Backend config changes would never reach active read pools
  • Deleting a backend leaked its read ConnectionManager on every node until restart.

What changed?

  • Read pools now recycle their connections on a jittered ~10 minute schedule.
  • Added cluster-wide operational levers to QueryConnectionSup, all scoped to a single backend. One key example is the ability to trigger a cluster-wide recycle of read connections (gracefully).
  • Added new optional Adaptor behaviour callbacks to allow for better handling of backend updates/deletion scenarios. These are implemented for ClickHouse.
  • Additional tests around read connection pool handling

Expected Impact

  • With every connection replaced nearly every ~10 minutes and the LB sending recycled connections to the emptiest replica, a newly added replica should show more connection usage within 1-2 minutes.
  • Reconnect cost is trivial: each connection pays one TCP+TLS handshake plus a SELECT 1 per interval.
  • For scaling events where we don't want to wait at all: QueryConnectionSup.recycle_backend(backend_id) from any node's REPL will rebalance that backend's connections cluster-wide. A later PR could wire this up to the backend management UX.
  • No longer orphaning active read connection resources when a backend is deleted 🫠
  • Happiness to the entire world (maybe not)

@djwhitt

djwhitt commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Nice work! Couldn't find any issues.

Comment thread lib/logflare/backends/adaptor.ex Outdated
No-op for adaptors that do not implement the optional callback.
"""
@spec backend_deleted(Backend.t()) :: :ok
def backend_deleted(%Backend{} = backend) do

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think on_backend_deleted would be more self documenting and clear

:ok
end
catch
:exit, _reason -> :ok

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should log this in case

assert :ok == ConnectionManager.recycle_pool(backend)

%ConnectionManager{pool_pid: same_pool_pid, next_recycle_at: rescheduled_at} =
:sys.get_state(manager_pid)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not a fan of testing internal state

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we're testing internal implementation here

Comment on lines +210 to +216
%ConnectionManager{pool_pid: pool_pid, next_recycle_at: scheduled_at} =
:sys.get_state(manager_pid)

Process.sleep(@timeout_interval)

%ConnectionManager{pool_pid: same_pool_pid, next_recycle_at: rescheduled_at} =
:sys.get_state(manager_pid)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing an api to get the current pool id feels like a better way to test this

Comment on lines +240 to +241
assert :ok == ConnectionManager.refresh_pool(backend)
refute ConnectionManager.pool_active?(backend)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same assertions as :231, can remove the former

Comment on lines +53 to +57
test "returns an error when no manager is running for the backend", %{backend: backend} do
assert {:error, :no_manager} == QueryConnectionSup.recycle_backend_local(backend.id)
end

test "returns an error when the manager has no active pool", %{backend: backend} do

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be combined

end

describe "list_query_connection_managers/0" do
test "returns backend ids and pids for running managers", %{backend: backend} do

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems redundant given the other tests

Comment on lines +110 to +115
test "returns ok when no manager is running", %{backend: backend} do
assert :ok == QueryConnectionSup.refresh_backend_local(backend.id)
end

test "stops the backend's active pool", %{backend: backend} do
{:ok, _manager_pid} =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be combined

end
end

describe "Adaptor lifecycle notifications" do

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unnecessary to have these tests

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can combine if we need it for code cov

amokan and others added 9 commits June 12, 2026 09:54
Rename `backend_config_changed`/`backend_deleted` to
`on_backend_config_changed`/`on_backend_deleted` for clarity, making it
self-documenting that these are invoked in reaction to a lifecycle event.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The terminate_manager catch clause previously swallowed exits silently.
Log a warning so an unexpected exit during teardown is observable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add `ConnectionManager.get_pool_pid/1` and `get_next_recycle_at/1` so the
recycling tests can assert on observable pool state through a public API
rather than reaching into the GenServer struct with `:sys.get_state`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "stops the pool so the next query restarts it" test already covers
`refresh_pool/1` returning `:ok`, so the standalone no-pool case is redundant.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "lists managers for multiple backends" test already asserts a started
manager appears in `list_query_connection_managers/0`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both cases share a started ConnectionManager, so exercise the no-pool error
and the successful recycle within a single test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Exercise the no-manager `:ok` path and the active-pool stop within a single
test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both no-op assertions exercise the same fall-through path, so keep them in a
single test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@amokan amokan merged commit f5ce4a6 into main Jun 12, 2026
13 checks passed
@amokan amokan deleted the adammokan/o11y-1904-adjust-clickhouse-connection-pooling-for-reads branch June 12, 2026 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants