[bitnami/redis] New installs can get stuck #8662

jtslear · 2022-01-13T20:14:34Z

Which chart:

https://github.com/bitnami/charts/tree/master/bitnami/redis

Version: 15.7.5

Introduced in: #8641

Describe the bug

Often, new installations of Redis with Sentinel will fail due to an initialization failure with Redis during start up. #8641 introduced a retry which is performing its job, but now we are running into the upper limits provided by the default chart for the liveness probe leading to an occasional failure for the Redis deployment to initialize. This is leading to the Statefulset getting hung up on bringing online the first pod as either a Primary or Replica. If the right set of conditions occur, we don't see this, however, often this leads to a nearly infinite loop as that first Pod cannot come online.

If we are lucky, the Endpoints are populated at the VERY last second, allowing our deployment to come online. Though Kubernetes intervenes and removes that first Pod due to hitting the limits of the liveness probe.

❗ I suspect this may also impact CI testing of this helm chart.

Workaround!

Bump the liveness initial delay to something higher than default, Example:

replica
  livenessProbe:
    initialDelaySeconds: 60

To Reproduce

Steps to reproduce the behavior:

helm upgrade a bitnami/redis --install --set auth.password=hunter5 --set sentinel.enabled=true
View the Redis container logs of the first Pod and observe one of the following:
- Connection refused repeated until the Pod is killed
- Connection refused and then Redis starts immediately followed by Kubernetes killing the Pod - resulting in this process starting over

Eventually we land in a CrashLoopBackoff.

Expected behavior

A Redis Deployment shall have zero Pod restarts when a Redis Deployment is being first created.

Additional context

The retry_while function is using the default parameters Reference: https://github.com/bitnami/bitnami-docker-redis/blob/3fbfed26472877c478ef24436d3face461bf1d36/6.2/debian-10/prebuildfs/opt/bitnami/scripts/libos.sh

This means that for this one call get_sentinel_master_info, we'll wait upwards of 60 seconds (5 seconds between each try, repeat 12 times) before the retry finally fails resulting in us being able to continue with the initialization of the start script providing in a healthy Pod. However, Kubernetes kills the Pod after only 45 seconds with a default configuration. This means that for the time we are waiting for Redis to become ready, that way it can be added to the Endpoints object to respond to itself, we're killing the Pod unnecessarily.

cc @ricosega and @carrodher as Author and Acceptor of what I suspect is root cause.

The text was updated successfully, but these errors were encountered:

ricosega · 2022-01-17T09:54:24Z

@jtslear , I found the problem.
At the moment I created the PR #8641 this part of the code added in PR #8563 was not merged in the start-node.sh script:
Link to file blame

    # check if there is a master
    get_sentinel_master_info
    redisRetVal=$?
    if [[ $redisRetVal -ne 0 ]]; then
        # there is no master yet, master by default

So it will never end.
I am checking how to make both changes valid

javsalgar · 2022-01-17T10:46:03Z

Thanks for letting us know! We will check the PR

ricosega · 2022-01-25T13:35:34Z

PR #8641 has been reverted so this issue can be closed.

jtslear mentioned this issue Jan 18, 2022

[bitnami/redis] Enables Redis to utilize external-dns #8570

Merged

3 tasks

carrodher closed this as completed Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bitnami/redis] New installs can get stuck #8662

[bitnami/redis] New installs can get stuck #8662

jtslear commented Jan 13, 2022

ricosega commented Jan 17, 2022 •

edited

Loading

javsalgar commented Jan 17, 2022

ricosega commented Jan 25, 2022

[bitnami/redis] New installs can get stuck #8662

[bitnami/redis] New installs can get stuck #8662

Comments

jtslear commented Jan 13, 2022

ricosega commented Jan 17, 2022 • edited Loading

javsalgar commented Jan 17, 2022

ricosega commented Jan 25, 2022

ricosega commented Jan 17, 2022 •

edited

Loading