-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Insufficient resources for cnrm-webhook-manager #252
Comments
Hi @yhrn , sorry you ran into this. Our short term fix for you is to bump your CPU and memory limits. While I understand this is not an ideal solution and deployments should be meant to scale without issue, we are currently investigating performance improvements within Config Connector and will be gradually rolling out fixes. For this issue specifically, we are considering increasing/scaling the number of replicas defined in the ReplicaSet so that the workload of all your namespaces is distributed. I'll bring this up as an area for improvement! |
Hi @caieo, thanks for the update. Regarding the suggested short term fix - will that work? Won't the operator just revert all objects it manages back to what it believes is the desired state? |
Hi @yhrn, can you please try release 1.19.0, I am confident it will have a positive effect on the KCC components overall (but nothing specific to webhook-manager). Once we try that and the issue persists we can look deeper into webhook manager. |
Hey @spew, We're now on 1.19.1 and we are still having problems. The
|
Hi @yhrn thanks so much for that log statement, that is helpful. |
Hi @yhrn -- are you using Prometheus to get metrics? |
We have "Cloud Operations for GKE" enabled so we get those metrics. Anything in specific you are looking for? |
We've seen issues in the past with Prometheus putting load on the metrics-recorder process, just wanted to understand if there was some interaction there. I've created a test cluster with 40 namespaces (and ~70 resources in each namespace) and I am not seeing any problems with the webhook-manager pod so there is still another variable I am missing to reproduce this. |
Hey, we're happy to give you any additional data that might be helpful. One additional useful data point could be that it might be a memory leak of some kind (at least in 1.20): the process gets OOMKilled pretty regularly and the resources at death are pretty consistently: total-vm:789316kB, anon-rss:55400kB, file-rss:42520kB, shmem-rss:0kB. 770MiB is definitely greater than the pod limit which is 64MiB |
We are making the following changes:
|
The changes are available in version |
We are on 1.23.0 (at least thats what annotations say in our GKE cluster with ConfigAddon enabled) and continue to see something similar to this problem: webhook pods are using 100% of their CPU quota and scaled to 10 instances with pod autoscaler. |
Hi @redbaron, can you email me, rleidle@google.com, I would be curious as to the number of config connector resources you have and any information about the 'kinds' of resources would be good too. You can see all your config-connector resources with |
Describe the bug
At times when all/lots of
cnrm-controller-manager
pods are restarted, e.g. when upgrading KCC or replacing a node pool thecnrm-webhook-manager
does not have sufficient resources to cope with the load resulting in reconciliation loop errors like this one:When this is happening the
cnrm-webhook-manager
is typically using >100% of it's CPU limits and repeatedly gets OOMKilled. After a while the system tends to recover. However, we currently have Config Connector enabled for ~40 namespaces and not that many resources per namespace on average but we expect those numbers to increase multiple orders of magnitude and then I'm guessing this problem is going to be much worse.As a side note I also observed
cnrm-resource-stats-recorder
using >100% of it's CPU limits at the same time but I also saw that it was discussed in #239.ConfigConnector Version
1.17.0
The text was updated successfully, but these errors were encountered: