Add Valkey memory limits and analysis tooling#701
Conversation
52b8e14 to
6d652f8
Compare
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 34s |
Problem: Celery workers create pidbox (control) reply queues for worker management commands (inspect, ping, stats, etc.). These queues accumulate when workers crash or restart improperly, leading to: - 1,693+ orphaned *.reply.celery.pidbox keys in production - Keys with no TTL (TTL = -1) that persist indefinitely Root cause: Celery's Redis transport does not provide a native way to set TTL on pidbox reply queues when they're created. These are internal implementation details of Celery's broadcast/control mechanism, and there's no configuration option to automatically expire them. Solution: Heartbeat cleanup task Since we cannot tell Celery to natively set TTL on pidbox messages, we implement a periodic heartbeat task that: - Runs nightly at 12:30 AM via Celery beat - Scans for *.reply.celery.pidbox keys without TTL - Sets 1-hour expiration on orphaned queues - Tracks total Redis keys via Prometheus for monitoring Related to: packit/deployment#701 Should fix: packit#2983 Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Problem: Celery workers create pidbox (control) reply queues for worker management commands (inspect, ping, stats, etc.). These queues accumulate when workers crash or restart improperly, leading to: - 1,693+ orphaned *.reply.celery.pidbox keys in production - Keys with no TTL (TTL = -1) that persist indefinitely Root cause: Celery's Redis transport does not provide a native way to set TTL on pidbox reply queues when they're created. These are internal implementation details of Celery's broadcast/control mechanism, and there's no configuration option to automatically expire them. Solution: Heartbeat cleanup task Since we cannot tell Celery to natively set TTL on pidbox messages, we implement a periodic heartbeat task that: - Runs nightly at 12:30 AM via Celery beat - Scans for *.reply.celery.pidbox keys without TTL - Sets 1-hour expiration on orphaned queues - Tracks total Redis keys via Prometheus for monitoring Related to: packit/deployment#701 Should fix: packit#2983 Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com> Assisted-By: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Problem: Celery workers create pidbox (control) reply queues for worker management commands (inspect, ping, stats, etc.). These queues accumulate when workers crash or restart improperly, leading to: - 1,693+ orphaned *.reply.celery.pidbox keys in production - Keys with no TTL (TTL = -1) that persist indefinitely Root cause: Celery's Redis transport does not provide a native way to set TTL on pidbox reply queues when they're created. These are internal implementation details of Celery's broadcast/control mechanism, and there's no configuration option to automatically expire them. Solution: Heartbeat cleanup task Since we cannot tell Celery to natively set TTL on pidbox messages, we implement a periodic heartbeat task that: - Runs nightly at 12:30 AM via Celery beat - Scans for *.reply.celery.pidbox keys without TTL - Sets 1-hour expiration on orphaned queues - Tracks total Redis keys via Prometheus for monitoring Related to: packit/deployment#701 Should fix: packit#2983 Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com> Assisted-By: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Problem: Valkey PVC filled up (1Gi -> 2Gi -> 4Gi) due to orphaned Celery pidbox reply queues accumulating without TTL. When disk filled, Packit stack became stuck with "No space left on device" errors. Root cause analysis: - 1,693 *.reply.celery.pidbox keys with no expiry (TTL = -1) - These are worker control queues that should be temporary - Orphaned when workers crash/restart improperly - No maxmemory limits, so memory/disk could grow unbounded Changes: 1. Configure Valkey with memory limits (configmap-redis_like_config.yml): - maxmemory: 3670mb (~87.5% of 4Gi pod limit) - maxmemory-policy: volatile-lru (safest - only evicts keys with TTL) - Prevents unbounded memory/disk growth 2. Add Valkey analysis script (scripts/analyze_valkey.sh): - Comprehensive data analysis tool - Identifies orphaned keys, disk usage, memory stats - Scans for Celery patterns and TTL distribution - Provides actionable recommendations - Safe to run on production (read-only operations) Additional fix (separate PR in packit-service): - Celery beat task to set 24-hour TTL on orphaned pidbox keys - Prometheus metric to track total Redis keys over time Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>
6d652f8 to
00f8c42
Compare
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 32s |
I'm just curious, are you finding Sonnet 4.5 better than Opus 4.6? Or just trying out different options? |
I don't remember changing it so far; my Claude configuration was using Sonnet from the beginning. In my mind, Opus is more expensive (I may be wrong), so I never chose that one. |
Problem: Celery workers create pidbox (control) reply queues for worker management commands (inspect, ping, stats, etc.). These queues accumulate when workers crash or restart improperly, leading to: - 1,693+ orphaned *.reply.celery.pidbox keys in production - Keys with no TTL (TTL = -1) that persist indefinitely Root cause: Celery's Redis transport does not provide a native way to set TTL on pidbox reply queues when they're created. These are internal implementation details of Celery's broadcast/control mechanism, and there's no configuration option to automatically expire them. Solution: Heartbeat cleanup task Since we cannot tell Celery to natively set TTL on pidbox messages, we implement a periodic heartbeat task that: - Runs nightly at 12:30 AM via Celery beat - Scans for *.reply.celery.pidbox keys without TTL - Sets 1-hour expiration on orphaned queues - Tracks total Redis keys via Prometheus for monitoring Related to: packit/deployment#701 Should fix: packit#2983 Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com> Assisted-By: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Add periodic cleanup for orphaned Celery pidbox queues Problem: Celery workers create pidbox (control) reply queues for worker management commands (inspect, ping, stats, etc.). These queues accumulate when workers crash or restart improperly, leading to: 1,693+ orphaned *.reply.celery.pidbox keys in production Keys with no TTL (TTL = -1) that persist indefinitely Root cause: Celery's Redis transport does not provide a native way to set TTL on pidbox reply queues when they're created. These are internal implementation details of Celery's broadcast/control mechanism, and there's no configuration option to automatically expire them. Solution: Heartbeat cleanup task Since we cannot tell Celery to natively set TTL on pidbox messages, we implement a periodic heartbeat task that: Runs nightly at 12:30 AM via Celery beat Scans for *.reply.celery.pidbox keys without TTL Sets 1-hour expiration on orphaned queues Tracks total Redis keys via Prometheus for monitoring Related to: packit/deployment#701 Should fix: #2983 Reviewed-by: gemini-code-assist[bot] Reviewed-by: Maja Massarini Reviewed-by: Matej Focko
Problem: Valkey PVC filled up (1Gi -> 2Gi -> 4Gi) due to orphaned Celery pidbox reply queues accumulating without TTL. When disk filled, Packit stack became stuck with "No space left on device" errors.
Root cause analysis:
Changes:
Configure Valkey with memory limits (configmap-redis_like_config.yml):
Add Valkey analysis script (scripts/analyze_valkey.sh):
Assisted-By: Claude Sonnet 4.5 noreply@anthropic.com
Fix packit/packit-service#2983