Skip to content

Add Valkey memory limits and analysis tooling#701

Merged
majamassarini merged 1 commit intopackit:mainfrom
majamassarini:prevent-valkey-filling-up
Apr 8, 2026
Merged

Add Valkey memory limits and analysis tooling#701
majamassarini merged 1 commit intopackit:mainfrom
majamassarini:prevent-valkey-filling-up

Conversation

@majamassarini
Copy link
Copy Markdown
Member

Problem: Valkey PVC filled up (1Gi -> 2Gi -> 4Gi) due to orphaned Celery pidbox reply queues accumulating without TTL. When disk filled, Packit stack became stuck with "No space left on device" errors.

Root cause analysis:

  • 1,693 *.reply.celery.pidbox keys with no expiry (TTL = -1)
  • These are worker control queues that should be temporary
  • Orphaned when workers crash/restart improperly
  • No maxmemory limits, so memory/disk could grow unbounded

Changes:

  1. Configure Valkey with memory limits (configmap-redis_like_config.yml):

    • maxmemory: 3670mb (~87.5% of 4Gi pod limit)
    • maxmemory-policy: volatile-lru (safest - only evicts keys with TTL)
    • Prevents unbounded memory/disk growth
  2. Add Valkey analysis script (scripts/analyze_valkey.sh):

    • Comprehensive data analysis tool
    • Identifies orphaned keys, disk usage, memory stats
    • Scans for Celery patterns and TTL distribution
    • Provides actionable recommendations
    • Safe to run on production (read-only operations)

Assisted-By: Claude Sonnet 4.5 noreply@anthropic.com

Fix packit/packit-service#2983

@centosinfra-prod-github-app
Copy link
Copy Markdown
Contributor

majamassarini added a commit to majamassarini/packit-service that referenced this pull request Apr 1, 2026
Problem:
Celery workers create pidbox (control) reply queues for worker management
commands (inspect, ping, stats, etc.). These queues accumulate when workers
crash or restart improperly, leading to:
- 1,693+ orphaned *.reply.celery.pidbox keys in production
- Keys with no TTL (TTL = -1) that persist indefinitely

Root cause:
Celery's Redis transport does not provide a native way to set TTL on pidbox
reply queues when they're created. These are internal implementation details
of Celery's broadcast/control mechanism, and there's no configuration option
to automatically expire them.

Solution: Heartbeat cleanup task
Since we cannot tell Celery to natively set TTL on pidbox messages, we
implement a periodic heartbeat task that:
- Runs nightly at 12:30 AM via Celery beat
- Scans for *.reply.celery.pidbox keys without TTL
- Sets 1-hour expiration on orphaned queues
- Tracks total Redis keys via Prometheus for monitoring

Related to: packit/deployment#701
Should fix: packit#2983

Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>
majamassarini added a commit to majamassarini/packit-service that referenced this pull request Apr 1, 2026
Problem:
Celery workers create pidbox (control) reply queues for worker management
commands (inspect, ping, stats, etc.). These queues accumulate when workers
crash or restart improperly, leading to:
- 1,693+ orphaned *.reply.celery.pidbox keys in production
- Keys with no TTL (TTL = -1) that persist indefinitely

Root cause:
Celery's Redis transport does not provide a native way to set TTL on pidbox
reply queues when they're created. These are internal implementation details
of Celery's broadcast/control mechanism, and there's no configuration option
to automatically expire them.

Solution: Heartbeat cleanup task
Since we cannot tell Celery to natively set TTL on pidbox messages, we
implement a periodic heartbeat task that:
- Runs nightly at 12:30 AM via Celery beat
- Scans for *.reply.celery.pidbox keys without TTL
- Sets 1-hour expiration on orphaned queues
- Tracks total Redis keys via Prometheus for monitoring

Related to: packit/deployment#701
Should fix: packit#2983

Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Assisted-By: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
majamassarini added a commit to majamassarini/packit-service that referenced this pull request Apr 7, 2026
Problem:
Celery workers create pidbox (control) reply queues for worker management
commands (inspect, ping, stats, etc.). These queues accumulate when workers
crash or restart improperly, leading to:
- 1,693+ orphaned *.reply.celery.pidbox keys in production
- Keys with no TTL (TTL = -1) that persist indefinitely

Root cause:
Celery's Redis transport does not provide a native way to set TTL on pidbox
reply queues when they're created. These are internal implementation details
of Celery's broadcast/control mechanism, and there's no configuration option
to automatically expire them.

Solution: Heartbeat cleanup task
Since we cannot tell Celery to natively set TTL on pidbox messages, we
implement a periodic heartbeat task that:
- Runs nightly at 12:30 AM via Celery beat
- Scans for *.reply.celery.pidbox keys without TTL
- Sets 1-hour expiration on orphaned queues
- Tracks total Redis keys via Prometheus for monitoring

Related to: packit/deployment#701
Should fix: packit#2983

Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Assisted-By: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Problem:
Valkey PVC filled up (1Gi -> 2Gi -> 4Gi) due to orphaned Celery
pidbox reply queues accumulating without TTL. When disk filled,
Packit stack became stuck with "No space left on device" errors.

Root cause analysis:
- 1,693 *.reply.celery.pidbox keys with no expiry (TTL = -1)
- These are worker control queues that should be temporary
- Orphaned when workers crash/restart improperly
- No maxmemory limits, so memory/disk could grow unbounded

Changes:
1. Configure Valkey with memory limits (configmap-redis_like_config.yml):
   - maxmemory: 3670mb (~87.5% of 4Gi pod limit)
   - maxmemory-policy: volatile-lru (safest - only evicts keys with TTL)
   - Prevents unbounded memory/disk growth

2. Add Valkey analysis script (scripts/analyze_valkey.sh):
   - Comprehensive data analysis tool
   - Identifies orphaned keys, disk usage, memory stats
   - Scans for Celery patterns and TTL distribution
   - Provides actionable recommendations
   - Safe to run on production (read-only operations)

Additional fix (separate PR in packit-service):
- Celery beat task to set 24-hour TTL on orphaned pidbox keys
- Prometheus metric to track total Redis keys over time

Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@majamassarini majamassarini force-pushed the prevent-valkey-filling-up branch from 6d652f8 to 00f8c42 Compare April 7, 2026 13:04
@centosinfra-prod-github-app
Copy link
Copy Markdown
Contributor

@nforro
Copy link
Copy Markdown
Member

nforro commented Apr 8, 2026

Assisted-By: Claude Sonnet 4.5 noreply@anthropic.com

I'm just curious, are you finding Sonnet 4.5 better than Opus 4.6? Or just trying out different options?

@majamassarini
Copy link
Copy Markdown
Member Author

Assisted-By: Claude Sonnet 4.5 noreply@anthropic.com

I'm just curious, are you finding Sonnet 4.5 better than Opus 4.6? Or just trying out different options?

I don't remember changing it so far; my Claude configuration was using Sonnet from the beginning. In my mind, Opus is more expensive (I may be wrong), so I never chose that one.

majamassarini added a commit to majamassarini/packit-service that referenced this pull request Apr 8, 2026
Problem:
Celery workers create pidbox (control) reply queues for worker management
commands (inspect, ping, stats, etc.). These queues accumulate when workers
crash or restart improperly, leading to:
- 1,693+ orphaned *.reply.celery.pidbox keys in production
- Keys with no TTL (TTL = -1) that persist indefinitely

Root cause:
Celery's Redis transport does not provide a native way to set TTL on pidbox
reply queues when they're created. These are internal implementation details
of Celery's broadcast/control mechanism, and there's no configuration option
to automatically expire them.

Solution: Heartbeat cleanup task
Since we cannot tell Celery to natively set TTL on pidbox messages, we
implement a periodic heartbeat task that:
- Runs nightly at 12:30 AM via Celery beat
- Scans for *.reply.celery.pidbox keys without TTL
- Sets 1-hour expiration on orphaned queues
- Tracks total Redis keys via Prometheus for monitoring

Related to: packit/deployment#701
Should fix: packit#2983

Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Assisted-By: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@majamassarini majamassarini merged commit 969ac1f into packit:main Apr 8, 2026
4 checks passed
@github-project-automation github-project-automation bot moved this from New to Done in Packit pull requests Apr 8, 2026
centosinfra-prod-github-app bot added a commit to packit/packit-service that referenced this pull request Apr 8, 2026
Add periodic cleanup for orphaned Celery pidbox queues

Problem: Celery workers create pidbox (control) reply queues for worker management commands (inspect, ping, stats, etc.). These queues accumulate when workers crash or restart improperly, leading to:

1,693+ orphaned *.reply.celery.pidbox keys in production
Keys with no TTL (TTL = -1) that persist indefinitely

Root cause: Celery's Redis transport does not provide a native way to set TTL on pidbox reply queues when they're created. These are internal implementation details of Celery's broadcast/control mechanism, and there's no configuration option to automatically expire them.
Solution: Heartbeat cleanup task Since we cannot tell Celery to natively set TTL on pidbox messages, we implement a periodic heartbeat task that:

Runs nightly at 12:30 AM via Celery beat
Scans for *.reply.celery.pidbox keys without TTL
Sets 1-hour expiration on orphaned queues
Tracks total Redis keys via Prometheus for monitoring

Related to: packit/deployment#701
Should fix: #2983

Reviewed-by: gemini-code-assist[bot]
Reviewed-by: Maja Massarini
Reviewed-by: Matej Focko
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

valkey-pvc requires periodic increases

4 participants