Insufficient resources for cnrm-webhook-manager #252

yhrn · 2020-08-12T09:43:59Z

Describe the bug
At times when all/lots of cnrm-controller-manager pods are restarted, e.g. when upgrading KCC or replacing a node pool the cnrm-webhook-manager does not have sufficient resources to cope with the load resulting in reconciliation loop errors like this one:

error with update call to API server: Internal error occurred: failed calling webhook "deny-unknown-fields.cnrm.cloud.google.com": Post https://cnrm-validating-webhook.cnrm-system.svc:443/deny-unknown-fields?timeout=30s: no endpoints available for service "cnrm-validating-webhook"

When this is happening the cnrm-webhook-manager is typically using >100% of it's CPU limits and repeatedly gets OOMKilled. After a while the system tends to recover. However, we currently have Config Connector enabled for ~40 namespaces and not that many resources per namespace on average but we expect those numbers to increase multiple orders of magnitude and then I'm guessing this problem is going to be much worse.

As a side note I also observed cnrm-resource-stats-recorder using >100% of it's CPU limits at the same time but I also saw that it was discussed in #239.

ConfigConnector Version
1.17.0

The text was updated successfully, but these errors were encountered:

caieo · 2020-08-13T22:30:20Z

Hi @yhrn , sorry you ran into this. Our short term fix for you is to bump your CPU and memory limits. While I understand this is not an ideal solution and deployments should be meant to scale without issue, we are currently investigating performance improvements within Config Connector and will be gradually rolling out fixes. For this issue specifically, we are considering increasing/scaling the number of replicas defined in the ReplicaSet so that the workload of all your namespaces is distributed. I'll bring this up as an area for improvement!

yhrn · 2020-08-14T09:42:24Z

Hi @caieo, thanks for the update. Regarding the suggested short term fix - will that work? Won't the operator just revert all objects it manages back to what it believes is the desired state?

spew · 2020-08-20T14:55:50Z

Hi @yhrn, can you please try release 1.19.0, I am confident it will have a positive effect on the KCC components overall (but nothing specific to webhook-manager). Once we try that and the issue persists we can look deeper into webhook manager.

yhrn · 2020-09-03T09:04:21Z

Hey @spew,

We're now on 1.19.1 and we are still having problems. The cnrm-webhook-manager keeps getting OOMKilled and we get intermittent failures like this trying to apply resources:

Internal error occurred: failed calling webhook "deny-unknown-fields.cnrm.cloud.google.com": Post https://cnrm-validating-webhook.cnrm-system.svc:443/deny-unknown-fields?timeout=29s

spew · 2020-09-03T11:46:31Z

Hi @yhrn thanks so much for that log statement, that is helpful.

spew · 2020-09-03T11:49:41Z

Hi @yhrn -- are you using Prometheus to get metrics?

yhrn · 2020-09-04T09:12:16Z

We have "Cloud Operations for GKE" enabled so we get those metrics. Anything in specific you are looking for?

spew · 2020-09-05T00:42:38Z

We've seen issues in the past with Prometheus putting load on the metrics-recorder process, just wanted to understand if there was some interaction there.

I've created a test cluster with 40 namespaces (and ~70 resources in each namespace) and I am not seeing any problems with the webhook-manager pod so there is still another variable I am missing to reproduce this.

dflemstr · 2020-09-09T09:00:19Z

Hey, we're happy to give you any additional data that might be helpful. One additional useful data point could be that it might be a memory leak of some kind (at least in 1.20): the process gets OOMKilled pretty regularly and the resources at death are pretty consistently: total-vm:789316kB, anon-rss:55400kB, file-rss:42520kB, shmem-rss:0kB. 770MiB is definitely greater than the pod limit which is 64MiB

spew · 2020-09-17T20:15:09Z

We are making the following changes:

Changing the webhook's memory to a static, single value. For namespaced mode this value will be larger than before.
Webhook will generate its certificate once, on initial startup, and save it in a Secret in the cnrm-system namespace. When restarting or a new pod is created, the same value will be read out of the secret.
We will enable horizontal autoscaling on the webhook pod. The scaling threshold will be at 60% CPU usage.

spew · 2020-09-21T13:59:32Z

The changes are available in version 1.21.1. Leaving this open for the time being until we can confirm it is enough to solve the issue.

redbaron · 2020-11-23T18:47:36Z

We are on 1.23.0 (at least thats what annotations say in our GKE cluster with ConfigAddon enabled) and continue to see something similar to this problem: webhook pods are using 100% of their CPU quota and scaled to 10 instances with pod autoscaler.

spew · 2020-11-23T19:59:35Z

Hi @redbaron, can you email me, rleidle@google.com, I would be curious as to the number of config connector resources you have and any information about the 'kinds' of resources would be good too.

You can see all your config-connector resources with kubectl get gcp --all-namespaces

yhrn added the bug Something isn't working label Aug 12, 2020

subodh101 mentioned this issue Oct 29, 2020

presubmit test broken - 10.29 GoogleCloudPlatform/kubeflow-distribution#143

Closed

spew closed this as completed Nov 23, 2020

jmymy mentioned this issue Apr 1, 2021

When webhooks block kubectl apply and the command times-out, flux should log it as a timeout instead of an empty error fluxcd/kustomize-controller#311

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insufficient resources for cnrm-webhook-manager #252

Insufficient resources for cnrm-webhook-manager #252

Insufficient resources for cnrm-webhook-manager #252

Insufficient resources for cnrm-webhook-manager #252

Comments