Agent injector CrashLoopBackOff when replicas>1 : TLS handshake error - no certificate available #388

masterphenix · 2022-10-06T08:45:25Z

Describe the bug
I have deployed the vault-agent-injector using Helm, with auto-TLS and 2 replicas, and both pods go into CrashLoopBackOff. Logs show the following errors:

[...]
2022-10-06T08:12:19.903Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-10-06T08:12:19.903Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-10-06T08:12:23.176Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-10-06T08:12:25.029Z [ERROR] handler: http: TLS handshake error from 10.102.8.55:59704: no certificate available
2022-10-06T08:12:25.029Z [ERROR] handler: http: TLS handshake error from 10.102.8.55:59702: no certificate available
2022-10-06T08:12:27.029Z [ERROR] handler: http: TLS handshake error from 10.102.8.55:49730: no certificate available
2022-10-06T08:12:27.029Z [ERROR] handler: http: TLS handshake error from 10.102.8.55:49728: no certificate available
2022-10-06T08:12:27.031Z [ERROR] handler: http: TLS handshake error from 10.102.8.55:49732: no certificate available

A describe on the pods shows both liveness and readiness failing:

  Warning  Unhealthy         46m (x5 over 47m)      kubelet            Liveness probe failed: Get "https://10.102.8.59:8080/health/ready": remote error: tls: internal error
  Warning  Unhealthy         46m (x9 over 47m)      kubelet            Readiness probe failed: Get "https://10.102.8.59:8080/health/ready": remote error: tls: internal error

Also, the vault-injector-certs secret is of type "Opaque" and shows no data, which doesn't seem right.

To Reproduce
Steps to reproduce the behavior:

Deploy vault agent injector using the vault Chart version 0.22.0, and the following values:

injector:
  enabled: true
  replicas: 2
  leaderElector:
    enabled: true
  metrics:
    enabled: true
  image:
    repository: "hashicorp/vault-k8s"
    tag: "1.0.0"
    pullPolicy: IfNotPresent
  agentImage:
    repository: "hashicorp/vault"
    tag: "1.11.3"
  authPath: "auth/azure"
  certs:
    secretName: null
    caBundle: ""
    certName: tls.crt
    keyName: tls.key

Logs in pods show the errors above

Application deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/instance: vault
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: vault-agent-injector
    component: webhook
    helm.toolkit.fluxcd.io/name: vault
    helm.toolkit.fluxcd.io/namespace: vault
  name: vault-agent-injector
  namespace: vault
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: vault
      app.kubernetes.io/name: vault-agent-injector
      component: webhook
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        aadpodidbinding: vault-binding
        app.kubernetes.io/instance: vault
        app.kubernetes.io/name: vault-agent-injector
        component: webhook
        maintainer: team-ops
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/instance: vault
                app.kubernetes.io/name: vault-agent-injector
                component: webhook
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - agent-inject
        - 2>&1
        env:
        - name: AGENT_INJECT_LISTEN
          value: :8080
        - name: AGENT_INJECT_LOG_LEVEL
          value: info
        - name: AGENT_INJECT_VAULT_ADDR
          value: https://xxxxxx.hashicorp.cloud:8200/
        - name: AGENT_INJECT_VAULT_AUTH_PATH
          value: auth/azure
        - name: AGENT_INJECT_VAULT_IMAGE
          value: hashicorp/vault:1.11.3
        - name: AGENT_INJECT_TLS_AUTO
          value: vault-agent-injector-cfg
        - name: AGENT_INJECT_TLS_AUTO_HOSTS
          value: vault-agent-injector-svc,vault-agent-injector-svc.vault,vault-agent-injector-svc.vault.svc
        - name: AGENT_INJECT_LOG_FORMAT
          value: standard
        - name: AGENT_INJECT_REVOKE_ON_SHUTDOWN
          value: "false"
        - name: AGENT_INJECT_TELEMETRY_PATH
          value: /metrics
        - name: AGENT_INJECT_USE_LEADER_ELECTOR
          value: "true"
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: AGENT_INJECT_CPU_REQUEST
          value: 250m
        - name: AGENT_INJECT_CPU_LIMIT
          value: 500m
        - name: AGENT_INJECT_MEM_REQUEST
          value: 64Mi
        - name: AGENT_INJECT_MEM_LIMIT
          value: 128Mi
        - name: AGENT_INJECT_DEFAULT_TEMPLATE
          value: map
        - name: AGENT_INJECT_TEMPLATE_CONFIG_EXIT_ON_RETRY_FAILURE
          value: "true"
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: hashicorp/vault-k8s:1.0.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 2
          httpGet:
            path: /health/ready
            port: 8080
            scheme: HTTPS
          initialDelaySeconds: 5
          periodSeconds: 2
          successThreshold: 1
          timeoutSeconds: 5
        name: sidecar-injector
        readinessProbe:
          failureThreshold: 2
          httpGet:
            path: /health/ready
            port: 8080
            scheme: HTTPS
          initialDelaySeconds: 5
          periodSeconds: 2
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            cpu: 250m
            memory: 256Mi
          requests:
            cpu: 250m
            memory: 256Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1000
        runAsGroup: 1000
        runAsNonRoot: true
        runAsUser: 100
      serviceAccount: vault-agent-injector
      serviceAccountName: vault-agent-injector
      terminationGracePeriodSeconds: 30

Expected behavior
Pods should start the same way as they do when replicas=1

Environment

Kubernetes version:
- Distribution or cloud vendor (OpenShift, EKS, GKE, AKS, etc.): AKS 1.24.3
- Other configuration options or runtime services (istio, etc.): using fluxcd to deploy
vault-k8s version: 1.0.0

Additional context
This is what the vault-k8s-leader configMap looks like:

apiVersion: v1
kind: ConfigMap
metadata:
  name: vault-k8s-leader
  namespace: vault
  ownerReferences:
  - apiVersion: v1
    kind: Pod
    name: vault-agent-injector-cdc566446-4kb8d
    uid: 11338b98-c923-46f3-8181-5e66cea4c71c

The text was updated successfully, but these errors were encountered:

tvoran · 2022-10-07T01:04:16Z

Hi @masterphenix, sorry to hear you're running into trouble with multiple replicas! Those chart values look good to me, though with them I haven't been able to reproduce your issue yet. I wonder if the liveness and readiness timeouts are just too low for your system? It looks like the chart doesn't have those as configurable options yet, so if you can adjust them manually that might shed some light on what the problem is. You may also want to set injector.logLevel: debug to get more info in the logs.

masterphenix · 2022-10-07T14:48:16Z

Hello @tvoran , thank you for having a look at this. I have tried changing liveness/readiness, and you were right, it started working. As a definitive solution, I have added a startupProbe to give the pods enough time to elect leader and generate certificates.

tvoran · 2022-10-10T19:10:41Z

That's great to hear! Would you mind sharing what you changed/added to get it working? That will better inform how to go about adding support for this to the chart.

masterphenix · 2022-10-11T06:35:59Z

Sure, here is the startupProbe I added:

    startupProbe:
      failureThreshold: 12
      httpGet:
        path: /health/ready
        port: 8080
        scheme: HTTPS
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 5

Since I am using Flux, and deploying the injector with the vault helm chart, I used a kustomize postRenderer in the HelmRelease to add the probe:

  postRenderers:
  # Instruct helm-controller to use built-in "kustomize" post renderer.
  - kustomize:
      patchesJson6902:
      - target:
          group: apps
          version: v1
          kind: Deployment
          name: vault-agent-injector
        patch:
        - op: add
          path: /spec/template/spec/containers/0/startupProbe
          value: { "failureThreshold": 12, "initialDelaySeconds": 5, "periodSeconds": 5, "timeoutSeconds": 5, "successThreshold": 1, "httpGet": {"path":"/health/ready", "port": 8080, "scheme": "HTTPS"} }

kiich · 2022-10-14T10:56:28Z

+1 to say we have just ran into this as well and changing replica to 1 fixed for us.
Would be great to workout what the root issue is!

kiich · 2022-10-14T16:26:29Z

The same config (with regards to probe etc) in other environment for us works fine so we are wondering if it is a race condition thing?

tvoran · 2022-11-08T21:20:02Z

Yeah I suspect it just takes the injector's leader election a little too long on some systems to establish a leader and generate the certificates for communicating with the k8s API, and so the pod is killed by the liveness probe.

ajiteb · 2022-11-16T10:45:15Z

I get below even with 1 replica. FYI I'm using external Vault address.

vault-agent-injector-7b6ffd4845-js52q sidecar-injector 2022-11-16T10:18:29.609Z [ERROR] handler: http: TLS handshake error from 172.33.16.217:47484: read tcp 172.33.0.213:8080->172.33.16.217:47484: read: connection reset by peer
vault-agent-injector-7b6ffd4845-js52q sidecar-injector 2022-11-16T10:41:37.807Z [ERROR] handler: http: TLS handshake error from 172.33.16.217:40358: read tcp 172.33.0.213:8080->172.33.16.217:40358: read: connection reset by peer
vault-agent-injector-7b6ffd4845-js52q sidecar-injector 2022-11-16T10:43:29.826Z [ERROR] handler: http: TLS handshake error from 172.33.16.217:42968: EOF
vault-agent-injector-7b6ffd4845-js52q sidecar-injector 2022-11-16T10:43:29.846Z [ERROR] handler: http: TLS handshake error from 172.33.16.217:42972: EOF
vault-agent-injector-7b6ffd4845-js52q sidecar-injector 2022-11-16T10:43:29.860Z [ERROR] handler: http: TLS handshake error from 172.33.16.217:42978: read tcp 172.33.0.213:8080->172.33.16.217:42978: read: connection reset by peer

tvoran · 2022-11-18T01:59:21Z

Hi @ajiteb that looks like a different issue, so you may want to start by reviewing the connectivity requirements: https://developer.hashicorp.com/vault/docs/platform/k8s/injector/examples#before-using-the-vault-agent-injector

ajiteb · 2022-11-18T07:48:45Z

@tvoran thanks for your reply. Yes, at least pod doesn't go into CrashLoopBackOff but it stays in running state. Anyways I'm not able to understand the reason for these logs. I'm on EKS 1.23 with vault version 1.12.1.

siwyroot · 2022-11-28T19:13:42Z

I have exactly same problem on few OpenShift cluster (not all of them) where I can deploy Vault injector with only 1 replica. Secret for leader elector is always empty. Newest chart and newest images. OpenShift 4.8.35

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  namespace: flux-system
  name: vault
spec:
  targetNamespace: ${namespace}
  values:
    fullnameOverride: vault
    global:
      openshift: true
    injector:
      authPath: auth/${cluster_id}
      image:
        repository: hashicorp/vault-k8s
        tag: "1.1"
      agentImage:
        repository: hashicorp/vault
        tag: "1.12.1"
      agentDefaults:
        cpuLimit: "200m"
        cpuRequest: "1m"
      externalVaultAddr: ${vault_url}
      failurePolicy: Fail
      namespaceSelector:
        matchLabels:
          vault-injector-enabled: 'true'
      replicas: 2
  interval: 5m
  chart:
    spec:
      chart: vault
      interval: 1m
      reconcileStrategy: ChartVersion
      sourceRef:
        name: hashicorp
        namespace: flux-system
        kind: HelmRepository
      version: '0.22.1'

Logs:

Using internal leader elector logic for webhook certificate management
Listening on ":8080"...
2022-11-28T17:17:42.609Z [INFO]  handler: Starting handler..
2022-11-28T17:17:42.682Z [ERROR] handler: http: TLS handshake error from 10.131.4.1:55786: no certificate available
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Updated certificate bundle received. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
I1128 17:17:43.643688       1 request.go:682] Waited for 1.042662433s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/apps/v1?timeout=32s
2022-11-28T17:17:48.079Z [ERROR] handler: http: TLS handshake error from 10.131.4.1:55810: no certificate available
2022-11-28T17:17:48.079Z [ERROR] handler: http: TLS handshake error from 10.131.4.1:55812: no certificate available
2022-11-28T17:17:50.078Z [ERROR] handler: http: TLS handshake error from 10.131.4.1:55824: no certificate available
2022-11-28T17:17:50.078Z [ERROR] handler: http: TLS handshake error from 10.131.4.1:55826: no certificate available

tvoran · 2022-11-29T01:24:10Z

Hi @siwyroot, I wonder if what you're seeing could be related to #378?

siwyroot · 2022-11-29T09:18:01Z

@tvoran Hi, I think it's same issue. Thanks!

adjain131995 · 2023-02-16T08:37:37Z

@tvoran we have faced the exact same issue and the hot fix was making the replica count 1. However, this definitely indicates this to be a bug and I believe should be fixed.

tvoran · 2023-04-11T16:44:03Z

Hi folks, I believe this has been addressed in #852, which was released in v0.24.0.

masterphenix added the bug Something isn't working label Oct 6, 2022

thyton mentioned this issue Mar 15, 2023

feat: make injector livenessProbe and readinessProbe configurable and add configurable startupProbe hashicorp/vault-helm#852

Merged

tvoran closed this as completed Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent injector CrashLoopBackOff when replicas>1 : TLS handshake error - no certificate available #388

Agent injector CrashLoopBackOff when replicas>1 : TLS handshake error - no certificate available #388

masterphenix commented Oct 6, 2022

tvoran commented Oct 7, 2022

masterphenix commented Oct 7, 2022

tvoran commented Oct 10, 2022

masterphenix commented Oct 11, 2022

kiich commented Oct 14, 2022

kiich commented Oct 14, 2022

tvoran commented Nov 8, 2022

ajiteb commented Nov 16, 2022

tvoran commented Nov 18, 2022

ajiteb commented Nov 18, 2022

siwyroot commented Nov 28, 2022

tvoran commented Nov 29, 2022

siwyroot commented Nov 29, 2022

adjain131995 commented Feb 16, 2023

tvoran commented Apr 11, 2023 •

edited

Loading

Agent injector CrashLoopBackOff when replicas>1 : TLS handshake error - no certificate available #388

Agent injector CrashLoopBackOff when replicas>1 : TLS handshake error - no certificate available #388

Comments

masterphenix commented Oct 6, 2022

tvoran commented Oct 7, 2022

masterphenix commented Oct 7, 2022

tvoran commented Oct 10, 2022

masterphenix commented Oct 11, 2022

kiich commented Oct 14, 2022

kiich commented Oct 14, 2022

tvoran commented Nov 8, 2022

ajiteb commented Nov 16, 2022

tvoran commented Nov 18, 2022

ajiteb commented Nov 18, 2022

siwyroot commented Nov 28, 2022

tvoran commented Nov 29, 2022

siwyroot commented Nov 29, 2022

adjain131995 commented Feb 16, 2023

tvoran commented Apr 11, 2023 • edited Loading

tvoran commented Apr 11, 2023 •

edited

Loading