You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Often, new installations of Redis with Sentinel will fail due to an initialization failure with Redis during start up. #8641 introduced a retry which is performing its job, but now we are running into the upper limits provided by the default chart for the liveness probe leading to an occasional failure for the Redis deployment to initialize. This is leading to the Statefulset getting hung up on bringing online the first pod as either a Primary or Replica. If the right set of conditions occur, we don't see this, however, often this leads to a nearly infinite loop as that first Pod cannot come online.
If we are lucky, the Endpoints are populated at the VERY last second, allowing our deployment to come online. Though Kubernetes intervenes and removes that first Pod due to hitting the limits of the liveness probe.
❗ I suspect this may also impact CI testing of this helm chart.
Workaround!
Bump the liveness initial delay to something higher than default, Example:
replicalivenessProbe:
initialDelaySeconds: 60
To Reproduce
Steps to reproduce the behavior:
helm upgrade a bitnami/redis --install --set auth.password=hunter5 --set sentinel.enabled=true
View the Redis container logs of the first Pod and observe one of the following:
Connection refused repeated until the Pod is killed
Connection refused and then Redis starts immediately followed by Kubernetes killing the Pod - resulting in this process starting over
Eventually we land in a CrashLoopBackoff.
Expected behavior
A Redis Deployment shall have zero Pod restarts when a Redis Deployment is being first created.
This means that for this one call get_sentinel_master_info, we'll wait upwards of 60 seconds (5 seconds between each try, repeat 12 times) before the retry finally fails resulting in us being able to continue with the initialization of the start script providing in a healthy Pod. However, Kubernetes kills the Pod after only 45 seconds with a default configuration. This means that for the time we are waiting for Redis to become ready, that way it can be added to the Endpoints object to respond to itself, we're killing the Pod unnecessarily.
cc @ricosega and @carrodher as Author and Acceptor of what I suspect is root cause.
The text was updated successfully, but these errors were encountered:
@jtslear , I found the problem.
At the moment I created the PR #8641 this part of the code added in PR #8563 was not merged in the start-node.sh script: Link to file blame
# check if there is a master
get_sentinel_master_info
redisRetVal=$?
if [[ $redisRetVal -ne 0 ]]; then
# there is no master yet, master by default
So it will never end.
I am checking how to make both changes valid
Which chart:
https://github.com/bitnami/charts/tree/master/bitnami/redis
Version: 15.7.5
Introduced in: #8641
Describe the bug
Often, new installations of Redis with Sentinel will fail due to an initialization failure with Redis during start up. #8641 introduced a retry which is performing its job, but now we are running into the upper limits provided by the default chart for the liveness probe leading to an occasional failure for the Redis deployment to initialize. This is leading to the Statefulset getting hung up on bringing online the first pod as either a Primary or Replica. If the right set of conditions occur, we don't see this, however, often this leads to a nearly infinite loop as that first Pod cannot come online.
If we are lucky, the Endpoints are populated at the VERY last second, allowing our deployment to come online. Though Kubernetes intervenes and removes that first Pod due to hitting the limits of the liveness probe.
❗ I suspect this may also impact CI testing of this helm chart.
Workaround!
Bump the liveness initial delay to something higher than default, Example:
To Reproduce
Steps to reproduce the behavior:
helm upgrade a bitnami/redis --install --set auth.password=hunter5 --set sentinel.enabled=true
Connection refused
repeated until the Pod is killedConnection refused
and then Redis starts immediately followed by Kubernetes killing the Pod - resulting in this process starting overEventually we land in a
CrashLoopBackoff
.Expected behavior
A Redis Deployment shall have zero Pod restarts when a Redis Deployment is being first created.
Additional context
The
retry_while
function is using the default parameters Reference: https://github.com/bitnami/bitnami-docker-redis/blob/3fbfed26472877c478ef24436d3face461bf1d36/6.2/debian-10/prebuildfs/opt/bitnami/scripts/libos.shThis means that for this one call
get_sentinel_master_info
, we'll wait upwards of 60 seconds (5 seconds between each try, repeat 12 times) before the retry finally fails resulting in us being able to continue with the initialization of the start script providing in a healthy Pod. However, Kubernetes kills the Pod after only 45 seconds with a default configuration. This means that for the time we are waiting for Redis to become ready, that way it can be added to the Endpoints object to respond to itself, we're killing the Pod unnecessarily.cc @ricosega and @carrodher as Author and Acceptor of what I suspect is root cause.
The text was updated successfully, but these errors were encountered: