[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insufficient resources for cnrm-webhook-manager #252

Closed
yhrn opened this issue Aug 12, 2020 · 13 comments
Closed

Insufficient resources for cnrm-webhook-manager #252

yhrn opened this issue Aug 12, 2020 · 13 comments
Labels
bug Something isn't working

Comments

@yhrn
Copy link
yhrn commented Aug 12, 2020

Describe the bug
At times when all/lots of cnrm-controller-manager pods are restarted, e.g. when upgrading KCC or replacing a node pool the cnrm-webhook-manager does not have sufficient resources to cope with the load resulting in reconciliation loop errors like this one:

error with update call to API server: Internal error occurred: failed calling webhook "deny-unknown-fields.cnrm.cloud.google.com": Post https://cnrm-validating-webhook.cnrm-system.svc:443/deny-unknown-fields?timeout=30s: no endpoints available for service "cnrm-validating-webhook"

When this is happening the cnrm-webhook-manager is typically using >100% of it's CPU limits and repeatedly gets OOMKilled. After a while the system tends to recover. However, we currently have Config Connector enabled for ~40 namespaces and not that many resources per namespace on average but we expect those numbers to increase multiple orders of magnitude and then I'm guessing this problem is going to be much worse.

As a side note I also observed cnrm-resource-stats-recorder using >100% of it's CPU limits at the same time but I also saw that it was discussed in #239.

ConfigConnector Version
1.17.0

@yhrn yhrn added the bug Something isn't working label Aug 12, 2020
@caieo
Copy link
Contributor
caieo commented Aug 13, 2020

Hi @yhrn , sorry you ran into this. Our short term fix for you is to bump your CPU and memory limits. While I understand this is not an ideal solution and deployments should be meant to scale without issue, we are currently investigating performance improvements within Config Connector and will be gradually rolling out fixes. For this issue specifically, we are considering increasing/scaling the number of replicas defined in the ReplicaSet so that the workload of all your namespaces is distributed. I'll bring this up as an area for improvement!

@yhrn
Copy link
Author
yhrn commented Aug 14, 2020

Hi @caieo, thanks for the update. Regarding the suggested short term fix - will that work? Won't the operator just revert all objects it manages back to what it believes is the desired state?

@spew
Copy link
Contributor
spew commented Aug 20, 2020

Hi @yhrn, can you please try release 1.19.0, I am confident it will have a positive effect on the KCC components overall (but nothing specific to webhook-manager). Once we try that and the issue persists we can look deeper into webhook manager.

@yhrn
Copy link
Author
yhrn commented Sep 3, 2020

Hey @spew,

We're now on 1.19.1 and we are still having problems. The cnrm-webhook-manager keeps getting OOMKilled and we get intermittent failures like this trying to apply resources:

Internal error occurred: failed calling webhook "deny-unknown-fields.cnrm.cloud.google.com": Post https://cnrm-validating-webhook.cnrm-system.svc:443/deny-unknown-fields?timeout=29s

@spew
Copy link
Contributor
spew commented Sep 3, 2020

Hi @yhrn thanks so much for that log statement, that is helpful.

@spew
Copy link
Contributor
spew commented Sep 3, 2020

Hi @yhrn -- are you using Prometheus to get metrics?

@yhrn
Copy link
Author
yhrn commented Sep 4, 2020

We have "Cloud Operations for GKE" enabled so we get those metrics. Anything in specific you are looking for?

@spew
Copy link
Contributor
spew commented Sep 5, 2020

We've seen issues in the past with Prometheus putting load on the metrics-recorder process, just wanted to understand if there was some interaction there.

I've created a test cluster with 40 namespaces (and ~70 resources in each namespace) and I am not seeing any problems with the webhook-manager pod so there is still another variable I am missing to reproduce this.

@dflemstr
Copy link
dflemstr commented Sep 9, 2020

Hey, we're happy to give you any additional data that might be helpful. One additional useful data point could be that it might be a memory leak of some kind (at least in 1.20): the process gets OOMKilled pretty regularly and the resources at death are pretty consistently: total-vm:789316kB, anon-rss:55400kB, file-rss:42520kB, shmem-rss:0kB. 770MiB is definitely greater than the pod limit which is 64MiB

@spew
Copy link
Contributor
spew commented Sep 17, 2020

We are making the following changes:

  1. Changing the webhook's memory to a static, single value. For namespaced mode this value will be larger than before.
  2. Webhook will generate its certificate once, on initial startup, and save it in a Secret in the cnrm-system namespace. When restarting or a new pod is created, the same value will be read out of the secret.
  3. We will enable horizontal autoscaling on the webhook pod. The scaling threshold will be at 60% CPU usage.

@spew
Copy link
Contributor
spew commented Sep 21, 2020

The changes are available in version 1.21.1. Leaving this open for the time being until we can confirm it is enough to solve the issue.

@redbaron
Copy link

We are on 1.23.0 (at least thats what annotations say in our GKE cluster with ConfigAddon enabled) and continue to see something similar to this problem: webhook pods are using 100% of their CPU quota and scaled to 10 instances with pod autoscaler.

@spew
Copy link
Contributor
spew commented Nov 23, 2020

Hi @redbaron, can you email me, rleidle@google.com, I would be curious as to the number of config connector resources you have and any information about the 'kinds' of resources would be good too.

You can see all your config-connector resources with kubectl get gcp --all-namespaces

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants