perf: clickhouse read connection pool overhaul#3591
Conversation
…tion-pooling-for-reads
|
Nice work! Couldn't find any issues. |
| No-op for adaptors that do not implement the optional callback. | ||
| """ | ||
| @spec backend_deleted(Backend.t()) :: :ok | ||
| def backend_deleted(%Backend{} = backend) do |
There was a problem hiding this comment.
I think on_backend_deleted would be more self documenting and clear
| :ok | ||
| end | ||
| catch | ||
| :exit, _reason -> :ok |
| assert :ok == ConnectionManager.recycle_pool(backend) | ||
|
|
||
| %ConnectionManager{pool_pid: same_pool_pid, next_recycle_at: rescheduled_at} = | ||
| :sys.get_state(manager_pid) |
There was a problem hiding this comment.
Hmm not a fan of testing internal state
There was a problem hiding this comment.
Looks like we're testing internal implementation here
| %ConnectionManager{pool_pid: pool_pid, next_recycle_at: scheduled_at} = | ||
| :sys.get_state(manager_pid) | ||
|
|
||
| Process.sleep(@timeout_interval) | ||
|
|
||
| %ConnectionManager{pool_pid: same_pool_pid, next_recycle_at: rescheduled_at} = | ||
| :sys.get_state(manager_pid) |
There was a problem hiding this comment.
Exposing an api to get the current pool id feels like a better way to test this
| assert :ok == ConnectionManager.refresh_pool(backend) | ||
| refute ConnectionManager.pool_active?(backend) |
There was a problem hiding this comment.
Same assertions as :231, can remove the former
| test "returns an error when no manager is running for the backend", %{backend: backend} do | ||
| assert {:error, :no_manager} == QueryConnectionSup.recycle_backend_local(backend.id) | ||
| end | ||
|
|
||
| test "returns an error when the manager has no active pool", %{backend: backend} do |
| end | ||
|
|
||
| describe "list_query_connection_managers/0" do | ||
| test "returns backend ids and pids for running managers", %{backend: backend} do |
There was a problem hiding this comment.
Seems redundant given the other tests
| test "returns ok when no manager is running", %{backend: backend} do | ||
| assert :ok == QueryConnectionSup.refresh_backend_local(backend.id) | ||
| end | ||
|
|
||
| test "stops the backend's active pool", %{backend: backend} do | ||
| {:ok, _manager_pid} = |
| end | ||
| end | ||
|
|
||
| describe "Adaptor lifecycle notifications" do |
There was a problem hiding this comment.
Seems unnecessary to have these tests
There was a problem hiding this comment.
Can combine if we need it for code cov
…tion-pooling-for-reads
Rename `backend_config_changed`/`backend_deleted` to `on_backend_config_changed`/`on_backend_deleted` for clarity, making it self-documenting that these are invoked in reaction to a lifecycle event. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The terminate_manager catch clause previously swallowed exits silently. Log a warning so an unexpected exit during teardown is observable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add `ConnectionManager.get_pool_pid/1` and `get_next_recycle_at/1` so the recycling tests can assert on observable pool state through a public API rather than reaching into the GenServer struct with `:sys.get_state`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "stops the pool so the next query restarts it" test already covers `refresh_pool/1` returning `:ok`, so the standalone no-pool case is redundant. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "lists managers for multiple backends" test already asserts a started manager appears in `list_query_connection_managers/0`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both cases share a started ConnectionManager, so exercise the no-pool error and the successful recycle within a single test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Exercise the no-manager `:ok` path and the active-pool stop within a single test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both no-op assertions exercise the same fall-through path, so keep them in a single test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Why?
Our ClickHouse read services are fronted by a "least-connections" load balancer. The current connection configuration has generally created a situation where nothing ever retires a healthy read connection. Given that the load balancer only acts on new connections, a newly added read replica will sometimes take a couple hours to be fully leveraged.
While digging into this situation, two adjacent gaps were surfaced (and are covered in this PR):
ConnectionManageron every node until restart.What changed?
QueryConnectionSup, all scoped to a single backend. One key example is the ability to trigger a cluster-wide recycle of read connections (gracefully).Adaptorbehaviour callbacks to allow for better handling of backend updates/deletion scenarios. These are implemented for ClickHouse.Expected Impact
SELECT 1per interval.QueryConnectionSup.recycle_backend(backend_id)from any node's REPL will rebalance that backend's connections cluster-wide. A later PR could wire this up to the backend management UX.