Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent injector CrashLoopBackOff when replicas>1 : TLS handshake error - no certificate available #388

Closed
masterphenix opened this issue Oct 6, 2022 · 15 comments
Labels
bug Something isn't working

Comments

@masterphenix
Copy link

Describe the bug
I have deployed the vault-agent-injector using Helm, with auto-TLS and 2 replicas, and both pods go into CrashLoopBackOff. Logs show the following errors:

[...]
2022-10-06T08:12:19.903Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-10-06T08:12:19.903Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-10-06T08:12:23.176Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-10-06T08:12:25.029Z [ERROR] handler: http: TLS handshake error from 10.102.8.55:59704: no certificate available
2022-10-06T08:12:25.029Z [ERROR] handler: http: TLS handshake error from 10.102.8.55:59702: no certificate available
2022-10-06T08:12:27.029Z [ERROR] handler: http: TLS handshake error from 10.102.8.55:49730: no certificate available
2022-10-06T08:12:27.029Z [ERROR] handler: http: TLS handshake error from 10.102.8.55:49728: no certificate available
2022-10-06T08:12:27.031Z [ERROR] handler: http: TLS handshake error from 10.102.8.55:49732: no certificate available

A describe on the pods shows both liveness and readiness failing:

  Warning  Unhealthy         46m (x5 over 47m)      kubelet            Liveness probe failed: Get "https://10.102.8.59:8080/health/ready": remote error: tls: internal error
  Warning  Unhealthy         46m (x9 over 47m)      kubelet            Readiness probe failed: Get "https://10.102.8.59:8080/health/ready": remote error: tls: internal error

Also, the vault-injector-certs secret is of type "Opaque" and shows no data, which doesn't seem right.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy vault agent injector using the vault Chart version 0.22.0, and the following values:
injector:
  enabled: true
  replicas: 2
  leaderElector:
    enabled: true
  metrics:
    enabled: true
  image:
    repository: "hashicorp/vault-k8s"
    tag: "1.0.0"
    pullPolicy: IfNotPresent
  agentImage:
    repository: "hashicorp/vault"
    tag: "1.11.3"
  authPath: "auth/azure"
  certs:
    secretName: null
    caBundle: ""
    certName: tls.crt
    keyName: tls.key
  1. Logs in pods show the errors above

Application deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/instance: vault
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: vault-agent-injector
    component: webhook
    helm.toolkit.fluxcd.io/name: vault
    helm.toolkit.fluxcd.io/namespace: vault
  name: vault-agent-injector
  namespace: vault
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: vault
      app.kubernetes.io/name: vault-agent-injector
      component: webhook
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        aadpodidbinding: vault-binding
        app.kubernetes.io/instance: vault
        app.kubernetes.io/name: vault-agent-injector
        component: webhook
        maintainer: team-ops
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/instance: vault
                app.kubernetes.io/name: vault-agent-injector
                component: webhook
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - agent-inject
        - 2>&1
        env:
        - name: AGENT_INJECT_LISTEN
          value: :8080
        - name: AGENT_INJECT_LOG_LEVEL
          value: info
        - name: AGENT_INJECT_VAULT_ADDR
          value: https://xxxxxx.hashicorp.cloud:8200/
        - name: AGENT_INJECT_VAULT_AUTH_PATH
          value: auth/azure
        - name: AGENT_INJECT_VAULT_IMAGE
          value: hashicorp/vault:1.11.3
        - name: AGENT_INJECT_TLS_AUTO
          value: vault-agent-injector-cfg
        - name: AGENT_INJECT_TLS_AUTO_HOSTS
          value: vault-agent-injector-svc,vault-agent-injector-svc.vault,vault-agent-injector-svc.vault.svc
        - name: AGENT_INJECT_LOG_FORMAT
          value: standard
        - name: AGENT_INJECT_REVOKE_ON_SHUTDOWN
          value: "false"
        - name: AGENT_INJECT_TELEMETRY_PATH
          value: /metrics
        - name: AGENT_INJECT_USE_LEADER_ELECTOR
          value: "true"
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: AGENT_INJECT_CPU_REQUEST
          value: 250m
        - name: AGENT_INJECT_CPU_LIMIT
          value: 500m
        - name: AGENT_INJECT_MEM_REQUEST
          value: 64Mi
        - name: AGENT_INJECT_MEM_LIMIT
          value: 128Mi
        - name: AGENT_INJECT_DEFAULT_TEMPLATE
          value: map
        - name: AGENT_INJECT_TEMPLATE_CONFIG_EXIT_ON_RETRY_FAILURE
          value: "true"
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: hashicorp/vault-k8s:1.0.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 2
          httpGet:
            path: /health/ready
            port: 8080
            scheme: HTTPS
          initialDelaySeconds: 5
          periodSeconds: 2
          successThreshold: 1
          timeoutSeconds: 5
        name: sidecar-injector
        readinessProbe:
          failureThreshold: 2
          httpGet:
            path: /health/ready
            port: 8080
            scheme: HTTPS
          initialDelaySeconds: 5
          periodSeconds: 2
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            cpu: 250m
            memory: 256Mi
          requests:
            cpu: 250m
            memory: 256Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1000
        runAsGroup: 1000
        runAsNonRoot: true
        runAsUser: 100
      serviceAccount: vault-agent-injector
      serviceAccountName: vault-agent-injector
      terminationGracePeriodSeconds: 30

Expected behavior
Pods should start the same way as they do when replicas=1

Environment

  • Kubernetes version:
    • Distribution or cloud vendor (OpenShift, EKS, GKE, AKS, etc.): AKS 1.24.3
    • Other configuration options or runtime services (istio, etc.): using fluxcd to deploy
  • vault-k8s version: 1.0.0

Additional context
This is what the vault-k8s-leader configMap looks like:

apiVersion: v1
kind: ConfigMap
metadata:
  name: vault-k8s-leader
  namespace: vault
  ownerReferences:
  - apiVersion: v1
    kind: Pod
    name: vault-agent-injector-cdc566446-4kb8d
    uid: 11338b98-c923-46f3-8181-5e66cea4c71c
@masterphenix masterphenix added the bug Something isn't working label Oct 6, 2022
@tvoran
Copy link
Member

tvoran commented Oct 7, 2022

Hi @masterphenix, sorry to hear you're running into trouble with multiple replicas! Those chart values look good to me, though with them I haven't been able to reproduce your issue yet. I wonder if the liveness and readiness timeouts are just too low for your system? It looks like the chart doesn't have those as configurable options yet, so if you can adjust them manually that might shed some light on what the problem is. You may also want to set injector.logLevel: debug to get more info in the logs.

@masterphenix
Copy link
Author

Hello @tvoran , thank you for having a look at this. I have tried changing liveness/readiness, and you were right, it started working. As a definitive solution, I have added a startupProbe to give the pods enough time to elect leader and generate certificates.

@tvoran
Copy link
Member

tvoran commented Oct 10, 2022

That's great to hear! Would you mind sharing what you changed/added to get it working? That will better inform how to go about adding support for this to the chart.

@masterphenix
Copy link
Author

Sure, here is the startupProbe I added:

    startupProbe:
      failureThreshold: 12
      httpGet:
        path: /health/ready
        port: 8080
        scheme: HTTPS
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 5

Since I am using Flux, and deploying the injector with the vault helm chart, I used a kustomize postRenderer in the HelmRelease to add the probe:

  postRenderers:
  # Instruct helm-controller to use built-in "kustomize" post renderer.
  - kustomize:
      patchesJson6902:
      - target:
          group: apps
          version: v1
          kind: Deployment
          name: vault-agent-injector
        patch:
        - op: add
          path: /spec/template/spec/containers/0/startupProbe
          value: { "failureThreshold": 12, "initialDelaySeconds": 5, "periodSeconds": 5, "timeoutSeconds": 5, "successThreshold": 1, "httpGet": {"path":"/health/ready", "port": 8080, "scheme": "HTTPS"} }

@kiich
Copy link

kiich commented Oct 14, 2022

+1 to say we have just ran into this as well and changing replica to 1 fixed for us.
Would be great to workout what the root issue is!

@kiich
Copy link

kiich commented Oct 14, 2022

The same config (with regards to probe etc) in other environment for us works fine so we are wondering if it is a race condition thing?

@tvoran
Copy link
Member

tvoran commented Nov 8, 2022

Yeah I suspect it just takes the injector's leader election a little too long on some systems to establish a leader and generate the certificates for communicating with the k8s API, and so the pod is killed by the liveness probe.

@ajiteb
Copy link

ajiteb commented Nov 16, 2022

I get below even with 1 replica. FYI I'm using external Vault address.

vault-agent-injector-7b6ffd4845-js52q sidecar-injector 2022-11-16T10:18:29.609Z [ERROR] handler: http: TLS handshake error from 172.33.16.217:47484: read tcp 172.33.0.213:8080->172.33.16.217:47484: read: connection reset by peer
vault-agent-injector-7b6ffd4845-js52q sidecar-injector 2022-11-16T10:41:37.807Z [ERROR] handler: http: TLS handshake error from 172.33.16.217:40358: read tcp 172.33.0.213:8080->172.33.16.217:40358: read: connection reset by peer
vault-agent-injector-7b6ffd4845-js52q sidecar-injector 2022-11-16T10:43:29.826Z [ERROR] handler: http: TLS handshake error from 172.33.16.217:42968: EOF
vault-agent-injector-7b6ffd4845-js52q sidecar-injector 2022-11-16T10:43:29.846Z [ERROR] handler: http: TLS handshake error from 172.33.16.217:42972: EOF
vault-agent-injector-7b6ffd4845-js52q sidecar-injector 2022-11-16T10:43:29.860Z [ERROR] handler: http: TLS handshake error from 172.33.16.217:42978: read tcp 172.33.0.213:8080->172.33.16.217:42978: read: connection reset by peer

@tvoran
Copy link
Member

tvoran commented Nov 18, 2022

Hi @ajiteb that looks like a different issue, so you may want to start by reviewing the connectivity requirements: https://developer.hashicorp.com/vault/docs/platform/k8s/injector/examples#before-using-the-vault-agent-injector

@ajiteb
Copy link

ajiteb commented Nov 18, 2022

@tvoran thanks for your reply. Yes, at least pod doesn't go into CrashLoopBackOff but it stays in running state. Anyways I'm not able to understand the reason for these logs. I'm on EKS 1.23 with vault version 1.12.1.

@siwyroot
Copy link

I have exactly same problem on few OpenShift cluster (not all of them) where I can deploy Vault injector with only 1 replica. Secret for leader elector is always empty. Newest chart and newest images. OpenShift 4.8.35

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  namespace: flux-system
  name: vault
spec:
  targetNamespace: ${namespace}
  values:
    fullnameOverride: vault
    global:
      openshift: true
    injector:
      authPath: auth/${cluster_id}
      image:
        repository: hashicorp/vault-k8s
        tag: "1.1"
      agentImage:
        repository: hashicorp/vault
        tag: "1.12.1"
      agentDefaults:
        cpuLimit: "200m"
        cpuRequest: "1m"
      externalVaultAddr: ${vault_url}
      failurePolicy: Fail
      namespaceSelector:
        matchLabels:
          vault-injector-enabled: 'true'
      replicas: 2
  interval: 5m
  chart:
    spec:
      chart: vault
      interval: 1m
      reconcileStrategy: ChartVersion
      sourceRef:
        name: hashicorp
        namespace: flux-system
        kind: HelmRepository
      version: '0.22.1'

Logs:

Using internal leader elector logic for webhook certificate management
Listening on ":8080"...
2022-11-28T17:17:42.609Z [INFO]  handler: Starting handler..
2022-11-28T17:17:42.682Z [ERROR] handler: http: TLS handshake error from 10.131.4.1:55786: no certificate available
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Updated certificate bundle received. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
I1128 17:17:43.643688       1 request.go:682] Waited for 1.042662433s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/apps/v1?timeout=32s
2022-11-28T17:17:48.079Z [ERROR] handler: http: TLS handshake error from 10.131.4.1:55810: no certificate available
2022-11-28T17:17:48.079Z [ERROR] handler: http: TLS handshake error from 10.131.4.1:55812: no certificate available
2022-11-28T17:17:50.078Z [ERROR] handler: http: TLS handshake error from 10.131.4.1:55824: no certificate available
2022-11-28T17:17:50.078Z [ERROR] handler: http: TLS handshake error from 10.131.4.1:55826: no certificate available

@tvoran
Copy link
Member

tvoran commented Nov 29, 2022

Hi @siwyroot, I wonder if what you're seeing could be related to #378?

@siwyroot
Copy link

@tvoran Hi, I think it's same issue. Thanks!

@adjain131995
Copy link

@tvoran we have faced the exact same issue and the hot fix was making the replica count 1. However, this definitely indicates this to be a bug and I believe should be fixed.

@tvoran
Copy link
Member

tvoran commented Apr 11, 2023

Hi folks, I believe this has been addressed in #852, which was released in v0.24.0.

@tvoran tvoran closed this as completed Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants