Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(election): retry participating if the lease is improperly lost #60

Merged
merged 3 commits into from
Oct 3, 2023

Conversation

jlevesy
Copy link
Owner

@jlevesy jlevesy commented Oct 2, 2023

What Does This PR do?

This PR fixes an issue where a prometheus-elector instance could stop participating into the election indefinitely if it loses (ie: without releasing it explicitly) the election lease.

In that case, the LeaderElector.Run method (see here) exits, this is actually documented 🤦 .

But the previous implementation didn't account for that and it was leading to the actual member not taking part into the election anymore.

This PR fixes this issue by checking if the given runCtx is done or not after Run exits.

  • If the context is not done, it means that we're still supposed to participate to the election, so we're recalling elector.Run to join back the election again.
  • If the context is done, then it means that we're in a proper termination. So it signals its termination and exits.

How to Test This PR?

# Run the agent example
make run_agent_example

# In another terminal monitor the logs
stern .

# In a third terminal, set the `spec.holderIdentity` field to  ` prometheus-elector-dev-1`
k edit lease prometheus-elector-lease

# You should see elector-1 take the ownership, and elector-0 drop it
# after a while and then start grabbing again the lease.
# Then do the opposite operation, and you'll see the roles reversing. On the current main, prometheus-elector-0 would have been dead in the water.

Good PR Checklist

  • Addresses one issue
  • Adds/Updates unit tests
  • Adds/Updates the documentation
  • Opened against the right branch
  • Correctly Labeled

Additional Notes

@jlevesy jlevesy added the bug Something isn't working label Oct 2, 2023
@jlevesy jlevesy self-assigned this Oct 2, 2023
@jlevesy jlevesy force-pushed the jl/fix-lease-loss branch from f7112b6 to 0fb39c0 Compare October 2, 2023 15:44
@jlevesy jlevesy force-pushed the jl/fix-lease-loss branch 3 times, most recently from 1ce7b7c to 1512f81 Compare October 2, 2023 17:14
@jlevesy jlevesy force-pushed the jl/fix-lease-loss branch from 1512f81 to a502c4d Compare October 2, 2023 17:15
@jlevesy jlevesy merged commit c6c08b3 into main Oct 3, 2023
@jlevesy jlevesy deleted the jl/fix-lease-loss branch October 3, 2023 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants