-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dev: bootstrap: Add delay to start corosync when node list larger than 5 #985
Dev: bootstrap: Add delay to start corosync when node list larger than 5 #985
Conversation
6777cb9
to
01f438d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I may suggest a change here.
For example if it's possible to do something like this:
failed_nodes = []
for node in node_list:
time.sleep(0.25)
try:
utils.start_service("corosync.service", remote_addr=node)
except (ValueError, IOError):
failed_nodes.append(node)
node_list = [
node for node in node_list if node not in failed_nodes
]
From what I see we might get ValueError from _do_action() or IOError from parallax. This could fail the startup of the cluster cause of one failing node.
We can even be more aggressive and catch pretty much anything as exception here.
From what I see pcs 0.9 has something similar https://github.com/ClusterLabs/pcs/blob/pcs-0.9/pcs/cluster.py#L1197 implemented. |
Thanks for the review and nice suggestion! |
2dc9a7e
to
a2873d0
Compare
Add second commit Dev: utils: Refactor class ServiceManager, to show all nodes' status when running in parallel
When some nodes failed to start cluster service:
|
a2873d0
to
44bc0cb
Compare
Should we check the return value of |
No need, it will success or raise an exception |
So whether |
And I think for all the case, when command failed, the exception will be raised |
Good pointer, though I have less worry about the concern in the above bug, since Parallax does wait cmd to return. That says, the design here does wait corosync finish starting on a node before moving on with the next node. |
…when running in parallel - Return success node list for start/stop/enable/disable service - Check corosync-qdevice.service is available before using qdevice
44bc0cb
to
eb933c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, great work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks.
To avoid possible JOIN flood in corosync