Skip to content

COO-1597: Raise memory limit of observability-operator deploy#994

Open
muellerfabi wants to merge 1 commit into
rhobs:mainfrom
muellerfabi:coo-1597
Open

COO-1597: Raise memory limit of observability-operator deploy#994
muellerfabi wants to merge 1 commit into
rhobs:mainfrom
muellerfabi:coo-1597

Conversation

@muellerfabi

@muellerfabi muellerfabi commented Feb 10, 2026

Copy link
Copy Markdown

On a larger OCP 4.18 cluster with 66 nodes observability-operator 1.3.0 gets OOMKilled right after start.

Issue is reported in https://access.redhat.com/support/cases/#/case/04368491
Issue is tracked in https://issues.redhat.com/browse/COO-1597

Update due to missing information about the actual issue:
We are aware of the fact that it is possible to set resource requests and limits in the subscription. We use the following as a workaround:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: cluster-observability-operator
  namespace: openshift-operators-redhat
spec:
  channel: stable
  config:
    resources:
      limits:
        memory: 3Gi
     requests:
       memory: 500Mi

The problem is that the given config is set on all COO components equally:

  • obo-prometheus-operator
  • obo-prometheus-operator-admission-webhook
  • observability-operator
  • perses-operator

Because of the elevated memory usage of perses in contrast to the other components, we were forced to set the memory limit to 3Gi.
Now all the components have a needlessly too high memory limit:

$ oc get deploy -o custom-columns=NAME:.metadata.name,RESOURCES:.spec.template.spec.containers[0].resources
NAME                                        RESOURCES
logging                                     map[]
loki-operator-controller-manager            map[]
obo-prometheus-operator                     map[limits:map[memory:3Gi] requests:map[memory:500Mi]]
obo-prometheus-operator-admission-webhook   map[limits:map[memory:3Gi] requests:map[memory:500Mi]]
observability-operator                      map[limits:map[memory:3Gi] requests:map[memory:500Mi]]
perses-operator                             map[limits:map[memory:3Gi] requests:map[memory:500Mi]]


$ oc adm top po
NAME                                                        CPU(cores)   MEMORY(bytes)   
logging-7c8b5bfdf4-v5p67                                    1m           23Mi            
loki-operator-controller-manager-7c5b4ffbfb-wvt9b           49m          5622Mi          
obo-prometheus-operator-545cdc864f-wxmzh                    61m          317Mi           
obo-prometheus-operator-admission-webhook-57d54bf6d-4p6j6   1m           11Mi            
obo-prometheus-operator-admission-webhook-57d54bf6d-x5pfh   1m           12Mi            
observability-operator-595c984dfb-24lsc                     3m           536Mi           
perses-operator-5fc9687477-8g9jc                            11m          1701Mi          

@openshift-ci

openshift-ci Bot commented Feb 10, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: muellerfabi
Once this PR has been reviewed and has the lgtm label, please assign simonpasquier for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci

openshift-ci Bot commented Feb 10, 2026

Copy link
Copy Markdown

Hi @muellerfabi. Thanks for your PR.

I'm waiting for a rhobs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@simonpasquier

simonpasquier commented Feb 16, 2026

Copy link
Copy Markdown
Contributor

/retitle COO-1597: Raise memory limit of observability-operator deploy

while it might help the reported case, I think that we should also document how users can customize the out-of-the box limits and where the current limits fit because I'm sure that we'll get other reports in the future (https://github.com/operator-framework/operator-lifecycle-manager/blob/master/doc/design/subscription-config.md#resources).

@openshift-ci openshift-ci Bot changed the title COO-1597 Raise memory limit of observability-operator deploy COO-1597: Raise memory limit of observability-operator deploy Feb 16, 2026
@openshift-ci-robot

openshift-ci-robot commented Feb 16, 2026

Copy link
Copy Markdown
Collaborator

@muellerfabi: This pull request references COO-1597 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

On a larger OCP 4.18 cluster with 66 nodes observability-operator 1.3.0 gets OOMKilled right after start.

Issue is reported in https://access.redhat.com/support/cases/#/case/04368491
Issue is tracked in https://issues.redhat.com/browse/COO-1597

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jan--f

jan--f commented Apr 24, 2026

Copy link
Copy Markdown
Collaborator

@jgbernalp wdyt? It's a small adjustment, so not sure how many user this will help out of the box.

@jgbernalp

Copy link
Copy Markdown
Member

@jgbernalp wdyt? It's a small adjustment, so not sure how many user this will help out of the box.

Is there a way for users to configure these limits? this might apply for this case. @muellerfabi do you have a pprof from this cluster that we can analyze?

@simonpasquier

Copy link
Copy Markdown
Contributor

@jgbernalp yes it's possible (https://github.com/operator-framework/operator-lifecycle-manager/blob/master/doc/design/subscription-config.md#resources) but the original report complained that it "should" just work out of the box.

@duritong

Copy link
Copy Markdown

@simonpasquier the main issue with: https://github.com/operator-framework/operator-lifecycle-manager/blob/master/doc/design/subscription-config.md#resources is that it overwrites the resource constraints globally and thus for example if you have a deployment with the actual operator and for example kube-rbac-proxy sidecar (very typical), your large requirements will also apply there, see redhat-cop/patch-operator#76 as an example from another operator.

@openshift-ci-robot

openshift-ci-robot commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

@muellerfabi: This pull request references COO-1597 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

On a larger OCP 4.18 cluster with 66 nodes observability-operator 1.3.0 gets OOMKilled right after start.

Issue is reported in https://access.redhat.com/support/cases/#/case/04368491
Issue is tracked in https://issues.redhat.com/browse/COO-1597

Update due to missing information about the actual issue:
We are aware of the fact that it is possible to set resource requests and limits in the subscription. We use the following as a workaround:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
 name: cluster-observability-operator
 namespace: openshift-operators-redhat
spec:
 channel: stable
 config:
   resources:
     limits:
       memory: 3Gi
    requests:
      memory: 500Mi

The problem is that the given config is set on all COO components equally:

  • obo-prometheus-operator
  • obo-prometheus-operator-admission-webhook
  • observability-operator
  • perses-operator

Because of the elevated memory usage of perses in contrast to the other components, we were forced to set the memory limit to 3Gi.
Now all the components have a needlessly too high memory limit:

$ oc get deploy -o custom-columns=NAME:.metadata.name,RESOURCES:.spec.template.spec.containers[0].resources
NAME                                        RESOURCES
logging                                     map[]
loki-operator-controller-manager            map[]
obo-prometheus-operator                     map[limits:map[memory:3Gi] requests:map[memory:500Mi]]
obo-prometheus-operator-admission-webhook   map[limits:map[memory:3Gi] requests:map[memory:500Mi]]
observability-operator                      map[limits:map[memory:3Gi] requests:map[memory:500Mi]]
perses-operator                             map[limits:map[memory:3Gi] requests:map[memory:500Mi]]


$ oc adm top po
NAME                                                        CPU(cores)   MEMORY(bytes)   
logging-7c8b5bfdf4-v5p67                                    1m           23Mi            
loki-operator-controller-manager-7c5b4ffbfb-wvt9b           49m          5622Mi          
obo-prometheus-operator-545cdc864f-wxmzh                    61m          317Mi           
obo-prometheus-operator-admission-webhook-57d54bf6d-4p6j6   1m           11Mi            
obo-prometheus-operator-admission-webhook-57d54bf6d-x5pfh   1m           12Mi            
observability-operator-595c984dfb-24lsc                     3m           536Mi           
perses-operator-5fc9687477-8g9jc                            11m          1701Mi          

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@muellerfabi

Copy link
Copy Markdown
Author

I updated the initial comment with further information, why overwriting resources in the Subscription kind is not ideal.

@jgbernalp Is there a way for users to configure these limits? this might apply for this case. @muellerfabi do you have a pprof from this cluster that we can analyze?
A must-gather from the affected cluster could be uploaded to the support case, if that helps?

Raising the memory limit to 550Mi is sufficient until the cluster grows to 90 nodes or so.
Maybe it was better to drop the limit entirely. Cluster-Operators usually do not have limits.
WDYT?

@simonpasquier

Copy link
Copy Markdown
Contributor

Raising the memory limit to 550Mi is sufficient until the cluster grows to 90 nodes or so.

I'm worried that while it works for this cluster, we'll hear about other clusters still hitting the limit.

Maybe it was better to drop the limit entirely. Cluster-Operators usually do not have limits.
WDYT?

My initial assumption was that resource limits were a strong requirement but apparently not: https://sdk.operatorframework.io/docs/best-practices/managing-resources/#general-guidelines
We'd need to discuss the downsides of removing the limits but it might be the best course of action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants