Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operatorhub-catalog in crashloop backoff without clear error message #1087

Closed
mjpitz opened this issue Oct 25, 2019 · 22 comments
Closed

operatorhub-catalog in crashloop backoff without clear error message #1087

mjpitz opened this issue Oct 25, 2019 · 22 comments
Labels
triaged Issue has been considered by a member of the OLM community

Comments

@mjpitz
Copy link

mjpitz commented Oct 25, 2019

Bug Report

I quite frequently find the operatorhub-catalog process in a crashloop backoff with no useful information for debugging how it ended up in that state. It would be nice if we could get some more diagnostic information as to why this pod winds up in that state.

What did you do?

Deployed OLM using the provided helm template and wait for some time.

olm                     operatorhubio-catalog-t66gt                              0/1     CrashLoopBackOff             3848       8d

What did you expect to see?

I would expect the process to run healthy unless there was an issue. Upon encountering an issue, I would expect the log of operatorhub-catalog to communicate what is wrong with the pod instead of serving a single starting gRPC server message.

Environment

  • operator-lifecycle-manager version:
  • Kubernetes version information:
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.0", GitCommit:"2bd9643cee5b3b3a5ecbd3af49d09018f0773c77", GitTreeState:"clean", BuildDate:"2019-09-18T14:36:53Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:50Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster kind: (openstack, bare metal, and aws)

Possible Solution

Additional context

Name:         operatorhubio-catalog-5jtrd
Namespace:    olm
Priority:     0
Node:         admin-kcp-primary-0/192.168.200.9
Start Time:   Fri, 25 Oct 2019 09:51:57 -0500
Labels:       olm.catalogSource=operatorhubio-catalog
Annotations:  cni.projectcalico.org/podIP: 10.42.1.2/32
              kubernetes.io/psp: default-psp
Status:       Running
IP:           10.42.1.2
IPs:          <none>
Containers:
  registry-server:
    Container ID:   docker://37233b694ec50f4963d23cd9447fd458a19cb3f36013ca53521a500e1fceba4d
    Image:          quay.io/operator-framework/upstream-community-operators:latest
    Image ID:       docker-pullable://quay.io/operator-framework/upstream-community-operators@sha256:95a59849ea594e97742264d66b80dcc2a8ac3515ff22cf64538b21101f345111
    Port:           50051/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Fri, 25 Oct 2019 09:55:03 -0500
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Fri, 25 Oct 2019 09:54:23 -0500
      Finished:     Fri, 25 Oct 2019 09:55:01 -0500
    Ready:          True
    Restart Count:  4
    Requests:
      cpu:      10m
      memory:   50Mi
    Liveness:   exec [grpc_health_probe -addr=localhost:50051] delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:  exec [grpc_health_probe -addr=localhost:50051] delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-5hwc7 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  default-token-5hwc7:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-5hwc7
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     
Events:
  Type     Reason     Age                    From                                Message
  ----     ------     ----                   ----                                -------
  Normal   Scheduled  4m1s                   default-scheduler                   Successfully assigned olm/operatorhubio-catalog-5jtrd to admin-kcp-primary-0
  Normal   Started    2m54s (x2 over 3m49s)  kubelet, admin-kcp-primary-0  Started container registry-server
  Warning  Unhealthy  2m20s (x6 over 3m20s)  kubelet, admin-kcp-primary-0  Readiness probe failed: timeout: failed to connect service "localhost:50051" within 1s
  Normal   Killing    2m17s (x2 over 2m57s)  kubelet, admin-kcp-primary-0  Container registry-server failed liveness probe, will be restarted
  Warning  Unhealthy  2m17s (x6 over 3m17s)  kubelet, admin-kcp-primary-0  Liveness probe failed: timeout: failed to connect service "localhost:50051" within 1s
  Normal   Pulling    2m16s (x3 over 3m55s)  kubelet, admin-kcp-primary-0  Pulling image "quay.io/operator-framework/upstream-community-operators:latest"
  Normal   Pulled     2m15s (x3 over 3m51s)  kubelet, admin-kcp-primary-0  Successfully pulled image "quay.io/operator-framework/upstream-community-operators:latest"
  Normal   Created    2m15s (x3 over 3m50s)  kubelet, admin-kcp-primary-0  Created container registry-server
@ecordell
Copy link
Member

Thanks @mjpitz - could you share the logs from the operatorhubio pod?

@mjpitz
Copy link
Author

mjpitz commented Oct 28, 2019

@ecordell : here they are

$ kubectl logs -f -n olm operatorhubio-catalog-5jtrd 
time="2019-10-28T16:35:43Z" level=info msg="serving registry" database=/bundles.db port=50051

@mjpitz
Copy link
Author

mjpitz commented Nov 5, 2019

Looking at the event log, it seems that the liveness probes are causing a restart. Given they're both the grpc probe, I assume it's looking at the global health of the service. Is there a way to change the verbosity of the logs?

@stale
Copy link

stale bot commented Feb 26, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Feb 26, 2020
@openshift-ci-robot openshift-ci-robot added triage/unresolved Indicates an issue that can not or will not be resolved. and removed wontfix labels Feb 27, 2020
@piyush-nimbalkar
Copy link

I am facing the exact same issue, where the logs don't tell much. The image that I am using is

    Image:          quay.io/operator-framework/upstream-community-operators:latest
    Image ID:       docker-pullable://quay.io/operator-framework/upstream-community-operators@sha256:d60e598dc930027fcbe536240540c45dc8468dfca8aa2a761663acb697973a57

Tried to re-deploy the catalog multiple times but not luck.

@mjpitz were you able to get around this issue?

@mjpitz
Copy link
Author

mjpitz commented Apr 3, 2020

No. We wound up uninstalling OLM and working with upstream helm-charts instead. Sorry.

@zzztimbo
Copy link

zzztimbo commented Apr 11, 2020

I had the same issue and it caused deleting namespaces to hang.
Also, there is no convenient way to uninstall OLM.

This was a helpful PR for uninstall:
#789

Too bad it was never approved and merged.

@AndrienkoAleksandr
Copy link

I see this an error on the minikube 1.7.1, 1.5.2 and openshift 4:

crc version: 1.6.0+8ef676f
OpenShift version: 4.3.0 (embedded in binary)

@jmickey
Copy link

jmickey commented May 15, 2020

Experiencing this issue - appears to be related to the liveness probe as detailed in this comment:

#740 (comment)

Any tips for fixing appreciated!

@CermakM
Copy link

CermakM commented May 22, 2020

Same issue w/ >=0.14.2 on Kubernetes v1.16

@rpajay
Copy link

rpajay commented May 27, 2020

$ kubectl logs operatorhubio-catalog-5r5r6 -n olm
time="2020-05-27T14:37:45Z" level=info msg="serving registry" database=/bundles.db port=50051
$ kubectl describe po operatorhubio-catalog-5r5r6 -n olm
Name:         operatorhubio-catalog-5r5r6
Namespace:    olm
Priority:     0
Node:         minikube/192.168.39.185
Start Time:   Wed, 27 May 2020 19:59:00 +0530
Labels:       olm.catalogSource=operatorhubio-catalog
Annotations:  <none>
Status:       Running
IP:           172.17.0.6
IPs:
  IP:  172.17.0.6
Containers:
  registry-server:
    Container ID:   docker://42292d3cc252052f28e026d5b26806a2ee43adb0cfda3d1a7d97a0c3bd58b55c
    Image:          quay.io/operator-framework/upstream-community-operators:latest
    Image ID:       docker-pullable://quay.io/operator-framework/upstream-community-operators@sha256:861810ae99f9dcaf454a09b844af9a0d150965b2fdc3ceccf4c5120571c000ce
    Port:           50051/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 27 May 2020 20:07:44 +0530
      Finished:     Wed, 27 May 2020 20:08:31 +0530
    Ready:          False
    Restart Count:  6
    Limits:
      cpu:     100m
      memory:  100Mi
    Requests:
      cpu:        10m
      memory:     50Mi
    Liveness:     exec [grpc_health_probe -addr=localhost:50051] delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:    exec [grpc_health_probe -addr=localhost:50051] delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-r5r8t (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-r5r8t:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-r5r8t
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  11m                    default-scheduler  Successfully assigned olm/operatorhubio-catalog-5r5r6 to minikube
  Warning  Unhealthy  10m                    kubelet, minikube  Readiness probe failed: timeout: health rpc did not complete within 1s
  Warning  Unhealthy  10m                    kubelet, minikube  Liveness probe errored: rpc error: code = Unknown desc = container not running (86083553a35e5186dab4f6ecff8054e416965dfc2a9b68873c89ec54b4f80c1f)
  Warning  Unhealthy  10m                    kubelet, minikube  Readiness probe failed: OCI runtime exec failed: exec failed: container_linux.go:346: starting container process caused "read init-p: connection reset by peer": unknown
  Normal   Created    9m25s (x4 over 11m)    kubelet, minikube  Created container registry-server
  Normal   Started    9m24s (x4 over 11m)    kubelet, minikube  Started container registry-server
  Warning  Unhealthy  8m48s (x3 over 10m)    kubelet, minikube  Liveness probe failed:
  Warning  Unhealthy  8m48s (x2 over 9m48s)  kubelet, minikube  Readiness probe failed:
  Normal   Pulled     6m2s (x6 over 11m)     kubelet, minikube  Container image "quay.io/operator-framework/upstream-community-operators:latest" already present on machine
  Warning  BackOff    62s (x33 over 10m)     kubelet, minikube  Back-off restarting failed container
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:52:00Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2019-12-07T21:12:17Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}

@Kampe
Copy link

Kampe commented Jul 10, 2020

same issue on k3s

@greglanthier
Copy link

greglanthier commented Jul 16, 2020

I'm seeing the same problem with Minikube v1.12.0 using the olm addon.

I wonder if the operatorhubio-catalog pod is running out of memory?

$ minikube version
minikube version: v1.12.0
commit: c83e6c47124b71190e138dbc687d2556d31488d6
$ minikube start --vm-driver=hyperkit --addons=olm
😄  minikube v1.12.0 on Darwin 10.15.5
...
🌟  Enabled addons: default-storageclass, olm, storage-provisioner
🏄  Done! kubectl is now configured to use "minikube"
$ operator-sdk version   
operator-sdk version: "v0.19.0", commit: "8e28aca60994c5cb1aec0251b85f0116cc4c9427", kubernetes version: "v1.18.2", go version: "go1.14.4 darwin/amd64"
$ operator-sdk olm status
INFO[0000] Fetching CRDs for version "0.14.1"           
INFO[0001] Fetching resources for version "0.14.1"      
INFO[0001] Successfully got OLM status for version "0.14.1" 

NAME                                            NAMESPACE    KIND                        STATUS
olm                                                          Namespace                   Installed
operatorgroups.operators.coreos.com                          CustomResourceDefinition    Installed
catalogsources.operators.coreos.com                          CustomResourceDefinition    Installed
subscriptions.operators.coreos.com                           CustomResourceDefinition    Installed
installplans.operators.coreos.com                            CustomResourceDefinition    Installed
aggregate-olm-edit                                           ClusterRole                 Installed
catalog-operator                                olm          Deployment                  Installed
olm-operator                                    olm          Deployment                  Installed
operatorhubio-catalog                           olm          CatalogSource               Installed
olm-operators                                   olm          OperatorGroup               Installed
aggregate-olm-view                                           ClusterRole                 Installed
operators                                                    Namespace                   Installed
global-operators                                operators    OperatorGroup               Installed
olm-operator-serviceaccount                     olm          ServiceAccount              Installed
packageserver                                   olm          ClusterServiceVersion       Installed
system:controller:operator-lifecycle-manager                 ClusterRole                 Installed
clusterserviceversions.operators.coreos.com                  CustomResourceDefinition    Installed
olm-operator-binding-olm                                     ClusterRoleBinding          Installed

$ 

The status of the operatorhubio-catalog pod occasionally flips to OOMKilled after its been running for a few minutes...

$ kubectl get pod -l olm.catalogSource=operatorhubio-catalog -n olm -w
NAME                          READY   STATUS    RESTARTS   AGE
operatorhubio-catalog-vxxhr   0/1     Pending   0          0s
operatorhubio-catalog-vxxhr   0/1     Pending   0          0s
operatorhubio-catalog-vxxhr   0/1     ContainerCreating   0          0s
operatorhubio-catalog-vxxhr   0/1     Terminating         0          1s
operatorhubio-catalog-vxxhr   0/1     Terminating         0          1s
operatorhubio-catalog-88d7p   0/1     Pending             0          0s
operatorhubio-catalog-88d7p   0/1     Pending             0          0s
operatorhubio-catalog-88d7p   0/1     ContainerCreating   0          0s
operatorhubio-catalog-88d7p   0/1     Running             0          9s
operatorhubio-catalog-88d7p   1/1     Running             0          20s
operatorhubio-catalog-88d7p   0/1     OOMKilled           0          33s  # <- ?
operatorhubio-catalog-88d7p   0/1     Running             1          34s
operatorhubio-catalog-88d7p   1/1     Running             1          40s
...

The resource specs for the operatorhubio-catalog pod don't look too small, but 🤷 .

$ kubectl get pod -lolm.catalogSource=operatorhubio-catalog -ojsonpath='{.items[*].spec.containers[*].resources}'       
map[limits:map[cpu:100m memory:100Mi] requests:map[cpu:10m memory:50Mi]]
$

It feels like the catalog from operatorhub.io might bump into limits as discussed in this comment.

It looks like this was fixed by #1389 in release 0.15.0.

@stale
Copy link

stale bot commented Sep 15, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale
Copy link

stale bot commented Nov 14, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale
Copy link

stale bot commented Jan 13, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@ezeeyahoo
Copy link

ezeeyahoo commented Apr 7, 2021

Facing same issue. I am trying it on IBM's PPC architecture.
describe log:-

Name:         operatorhubio-catalog-95wvc
Namespace:    olm
Priority:     0
Node:         gardener-worker2/192.168.142.188
Start Time:   Wed, 07 Apr 2021 07:50:34 -0400
Labels:       olm.catalogSource=operatorhubio-catalog
Annotations:  cni.projectcalico.org/podIP: 172.20.251.199/32
              cni.projectcalico.org/podIPs: 172.20.251.199/32
Status:       Running
IP:           172.20.251.199
IPs:
  IP:  172.20.251.199
Containers:
  registry-server:
    Container ID:   docker://7953aaa31206dcc62b42a546a272b385e827099757aa343cc560268e52047a3c
    Image:          quay.io/operatorhubio/catalog:latest
    Image ID:       docker-pullable://quay.io/operatorhubio/catalog@sha256:9089276dd5522b70c0aeb00cce02bd5f688983261f2a181fc741b4c44fbddc69
    Port:           50051/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 07 Apr 2021 09:49:52 -0400
      Finished:     Wed, 07 Apr 2021 09:49:52 -0400
    Ready:          False
    Restart Count:  28
    Requests:
      cpu:        10m
      memory:     50Mi
    Liveness:     exec [grpc_health_probe -addr=:50051] delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:    exec [grpc_health_probe -addr=:50051] delay=5s timeout=5s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-rfrp9 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-rfrp9:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-rfrp9
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                  From     Message
  ----     ------   ----                 ----     -------
  Warning  BackOff  7s (x573 over 120m)  kubelet  Back-off restarting failed container

It seems image is unsupported. How can I build it locally?

kubectl logs pod/operatorhubio-catalog-95wvc -n olm
standard_init_linux.go:219: exec user process caused: exec format error

@dinhxuanvu
Copy link
Member

We have a few PRs merged to resolve this issue (operator-framework/operator-registry#227 and operator-framework/operator-registry#227). So I think this issue may have been resolved. I would like to ask if you can retest this on a newer version of OLM (0.19.0+) to see if this issue still exists.

@dinhxuanvu dinhxuanvu added triaged Issue has been considered by a member of the OLM community and removed triage/unresolved Indicates an issue that can not or will not be resolved. labels Nov 11, 2021
@tcoupin
Copy link

tcoupin commented Mar 20, 2022

Hi,
same issue with 0.20.0.
I try to manually start container on a worker with image quay.io/operatorhubio/catalog:latest. It takes more a minute to start. The pod has no startupprobe defined.

@vbondoo7
Copy link

vbondoo7 commented Aug 2, 2023

same issue I am facing with minikube

@mateuszkca
Copy link

I have same issue but only on k8s nodes with Rocky9/CentOS9 OS. On dhat nodes catalog POD starts more than 75s.
On nodes with OS CentOS8/Rocky8 it start in 6s.
There is no options in CRD or in operator config to change Readiness probe for catalog POD.

@dinhxuanvu
Copy link
Member

@joelanford FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been considered by a member of the OLM community
Projects
None yet
Development

No branches or pull requests