Add Valkey memory limits and analysis tooling by majamassarini · Pull Request #701 · packit/deployment

majamassarini · 2026-04-01T06:55:17Z

Problem: Valkey PVC filled up (1Gi -> 2Gi -> 4Gi) due to orphaned Celery pidbox reply queues accumulating without TTL. When disk filled, Packit stack became stuck with "No space left on device" errors.

Root cause analysis:

1,693 *.reply.celery.pidbox keys with no expiry (TTL = -1)
These are worker control queues that should be temporary
Orphaned when workers crash/restart improperly
No maxmemory limits, so memory/disk could grow unbounded

Changes:

Configure Valkey with memory limits (configmap-redis_like_config.yml):
- maxmemory: 3670mb (~87.5% of 4Gi pod limit)
- maxmemory-policy: volatile-lru (safest - only evicts keys with TTL)
- Prevents unbounded memory/disk growth
Add Valkey analysis script (scripts/analyze_valkey.sh):
- Comprehensive data analysis tool
- Identifies orphaned keys, disk usage, memory stats
- Scans for Celery patterns and TTL distribution
- Provides actionable recommendations
- Safe to run on production (read-only operations)

Assisted-By: Claude Sonnet 4.5 noreply@anthropic.com

Fix packit/packit-service#2983

centosinfra-prod-github-app · 2026-04-01T06:58:33Z

Build succeeded.
https://gateway-cloud-softwarefactory.apps.ocp.cloud.ci.centos.org/zuul/t/packit-service/buildset/ec7750f344744c3dbc9eebb4b6b19525

✔️ pre-commit SUCCESS in 1m 34s

Problem: Celery workers create pidbox (control) reply queues for worker management commands (inspect, ping, stats, etc.). These queues accumulate when workers crash or restart improperly, leading to: - 1,693+ orphaned *.reply.celery.pidbox keys in production - Keys with no TTL (TTL = -1) that persist indefinitely Root cause: Celery's Redis transport does not provide a native way to set TTL on pidbox reply queues when they're created. These are internal implementation details of Celery's broadcast/control mechanism, and there's no configuration option to automatically expire them. Solution: Heartbeat cleanup task Since we cannot tell Celery to natively set TTL on pidbox messages, we implement a periodic heartbeat task that: - Runs nightly at 12:30 AM via Celery beat - Scans for *.reply.celery.pidbox keys without TTL - Sets 1-hour expiration on orphaned queues - Tracks total Redis keys via Prometheus for monitoring Related to: packit/deployment#701 Should fix: packit#2983 Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Problem: Celery workers create pidbox (control) reply queues for worker management commands (inspect, ping, stats, etc.). These queues accumulate when workers crash or restart improperly, leading to: - 1,693+ orphaned *.reply.celery.pidbox keys in production - Keys with no TTL (TTL = -1) that persist indefinitely Root cause: Celery's Redis transport does not provide a native way to set TTL on pidbox reply queues when they're created. These are internal implementation details of Celery's broadcast/control mechanism, and there's no configuration option to automatically expire them. Solution: Heartbeat cleanup task Since we cannot tell Celery to natively set TTL on pidbox messages, we implement a periodic heartbeat task that: - Runs nightly at 12:30 AM via Celery beat - Scans for *.reply.celery.pidbox keys without TTL - Sets 1-hour expiration on orphaned queues - Tracks total Redis keys via Prometheus for monitoring Related to: packit/deployment#701 Should fix: packit#2983 Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com> Assisted-By: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Problem: Valkey PVC filled up (1Gi -> 2Gi -> 4Gi) due to orphaned Celery pidbox reply queues accumulating without TTL. When disk filled, Packit stack became stuck with "No space left on device" errors. Root cause analysis: - 1,693 *.reply.celery.pidbox keys with no expiry (TTL = -1) - These are worker control queues that should be temporary - Orphaned when workers crash/restart improperly - No maxmemory limits, so memory/disk could grow unbounded Changes: 1. Configure Valkey with memory limits (configmap-redis_like_config.yml): - maxmemory: 3670mb (~87.5% of 4Gi pod limit) - maxmemory-policy: volatile-lru (safest - only evicts keys with TTL) - Prevents unbounded memory/disk growth 2. Add Valkey analysis script (scripts/analyze_valkey.sh): - Comprehensive data analysis tool - Identifies orphaned keys, disk usage, memory stats - Scans for Celery patterns and TTL distribution - Provides actionable recommendations - Safe to run on production (read-only operations) Additional fix (separate PR in packit-service): - Celery beat task to set 24-hour TTL on orphaned pidbox keys - Prometheus metric to track total Redis keys over time Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>

centosinfra-prod-github-app · 2026-04-07T13:06:36Z

Build succeeded.
https://gateway-cloud-softwarefactory.apps.ocp.cloud.ci.centos.org/zuul/t/packit-service/buildset/eaea267170834b6aa10b81d677b6e9db

✔️ pre-commit SUCCESS in 1m 32s

nforro · 2026-04-08T07:18:55Z

Assisted-By: Claude Sonnet 4.5 noreply@anthropic.com

I'm just curious, are you finding Sonnet 4.5 better than Opus 4.6? Or just trying out different options?

majamassarini · 2026-04-08T07:33:54Z

Assisted-By: Claude Sonnet 4.5 noreply@anthropic.com

I'm just curious, are you finding Sonnet 4.5 better than Opus 4.6? Or just trying out different options?

I don't remember changing it so far; my Claude configuration was using Sonnet from the beginning. In my mind, Opus is more expensive (I may be wrong), so I never chose that one.

Problem: Celery workers create pidbox (control) reply queues for worker management commands (inspect, ping, stats, etc.). These queues accumulate when workers crash or restart improperly, leading to: - 1,693+ orphaned *.reply.celery.pidbox keys in production - Keys with no TTL (TTL = -1) that persist indefinitely Root cause: Celery's Redis transport does not provide a native way to set TTL on pidbox reply queues when they're created. These are internal implementation details of Celery's broadcast/control mechanism, and there's no configuration option to automatically expire them. Solution: Heartbeat cleanup task Since we cannot tell Celery to natively set TTL on pidbox messages, we implement a periodic heartbeat task that: - Runs nightly at 12:30 AM via Celery beat - Scans for *.reply.celery.pidbox keys without TTL - Sets 1-hour expiration on orphaned queues - Tracks total Redis keys via Prometheus for monitoring Related to: packit/deployment#701 Should fix: packit#2983 Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com> Assisted-By: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Add periodic cleanup for orphaned Celery pidbox queues Problem: Celery workers create pidbox (control) reply queues for worker management commands (inspect, ping, stats, etc.). These queues accumulate when workers crash or restart improperly, leading to: 1,693+ orphaned *.reply.celery.pidbox keys in production Keys with no TTL (TTL = -1) that persist indefinitely Root cause: Celery's Redis transport does not provide a native way to set TTL on pidbox reply queues when they're created. These are internal implementation details of Celery's broadcast/control mechanism, and there's no configuration option to automatically expire them. Solution: Heartbeat cleanup task Since we cannot tell Celery to natively set TTL on pidbox messages, we implement a periodic heartbeat task that: Runs nightly at 12:30 AM via Celery beat Scans for *.reply.celery.pidbox keys without TTL Sets 1-hour expiration on orphaned queues Tracks total Redis keys via Prometheus for monitoring Related to: packit/deployment#701 Should fix: #2983 Reviewed-by: gemini-code-assist[bot] Reviewed-by: Maja Massarini Reviewed-by: Matej Focko

usercont-release-bot added this to Packit pull requests Apr 1, 2026

github-project-automation bot moved this to New in Packit pull requests Apr 1, 2026

majamassarini force-pushed the prevent-valkey-filling-up branch from 52b8e14 to 6d652f8 Compare April 1, 2026 06:56

majamassarini mentioned this pull request Apr 1, 2026

Add periodic cleanup for orphaned Celery pidbox queues packit/packit-service#3085

Merged

majamassarini force-pushed the prevent-valkey-filling-up branch from 6d652f8 to 00f8c42 Compare April 7, 2026 13:04

mfocko approved these changes Apr 8, 2026

View reviewed changes

majamassarini merged commit 969ac1f into packit:main Apr 8, 2026
4 checks passed

github-project-automation bot moved this from New to Done in Packit pull requests Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Valkey memory limits and analysis tooling#701

Add Valkey memory limits and analysis tooling#701
majamassarini merged 1 commit intopackit:mainfrom
majamassarini:prevent-valkey-filling-up

majamassarini commented Apr 1, 2026

Uh oh!

centosinfra-prod-github-app bot commented Apr 1, 2026

Uh oh!

centosinfra-prod-github-app bot commented Apr 7, 2026

Uh oh!

nforro commented Apr 8, 2026

Uh oh!

majamassarini commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

majamassarini commented Apr 1, 2026

Uh oh!

centosinfra-prod-github-app bot commented Apr 1, 2026

Uh oh!

centosinfra-prod-github-app bot commented Apr 7, 2026

Uh oh!

nforro commented Apr 8, 2026

Uh oh!

majamassarini commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants