Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev: bootstrap: Add delay to start corosync when node list larger than 5 #985

Merged
merged 3 commits into from
Sep 13, 2022

Conversation

liangxin1300
Copy link
Collaborator

To avoid possible JOIN flood in corosync

@liangxin1300 liangxin1300 force-pushed the 20220617_start_all_delay branch from 6777cb9 to 01f438d Compare June 17, 2022 07:52
Copy link

@epenchev epenchev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I may suggest a change here.
For example if it's possible to do something like this:

failed_nodes = []
for node in node_list:
    time.sleep(0.25) 
    try:
       utils.start_service("corosync.service", remote_addr=node)
    except (ValueError, IOError):
        failed_nodes.append(node)

node_list = [
    node for node in node_list if node not in failed_nodes
]

From what I see we might get ValueError from _do_action() or IOError from parallax. This could fail the startup of the cluster cause of one failing node.
We can even be more aggressive and catch pretty much anything as exception here.

@epenchev
Copy link

From what I see pcs 0.9 has something similar https://github.com/ClusterLabs/pcs/blob/pcs-0.9/pcs/cluster.py#L1197 implemented.
From the discussion back then in https://bugzilla.redhat.com/show_bug.cgi?id=1572886 which @liangxin1300 has pointed me to looks like a complete remedy is to go with corosync3 and knet https://bugzilla.redhat.com/show_bug.cgi?id=1572886#c26 as @liangxin1300 has already spotted.
So guessing that's way latest pcs 0.10 and above doesn't have this implemented.

@liangxin1300
Copy link
Collaborator Author

If I may suggest a change here. For example if it's possible to do something like this:

failed_nodes = []
for node in node_list:
    time.sleep(0.25) 
    try:
       utils.start_service("corosync.service", remote_addr=node)
    except (ValueError, IOError):
        failed_nodes.append(node)

node_list = [
    node for node in node_list if node not in failed_nodes
]

From what I see we might get ValueError from _do_action() or IOError from parallax. This could fail the startup of the cluster cause of one failing node. We can even be more aggressive and catch pretty much anything as exception here.

Thanks for the review and nice suggestion!
I will consider to submit new commit or new PR to implement this suggestion:)

@liangxin1300 liangxin1300 force-pushed the 20220617_start_all_delay branch 6 times, most recently from 2dc9a7e to a2873d0 Compare August 26, 2022 06:40
@liangxin1300
Copy link
Collaborator Author

Add second commit

Dev: utils: Refactor class ServiceManager, to show all nodes' status when running in parallel

- Return success node list for start/stop/enable/disable service
- Check corosync-qdevice.service is available before using qdevice

When some nodes failed to start cluster service:

# crm cluster start --all
INFO: BEGIN Starting pacemaker ...
ERROR: Failed to start pacemaker.service on 15sp4-2: Exited with error code 1, Error output: A dependency job for pacemaker.service failed. See 'journalctl -xe' for details.

INFO: END Starting pacemaker.
INFO: Cluster services started on 15sp4-1
INFO: Cluster services started on 15sp4-3

@liangxin1300 liangxin1300 force-pushed the 20220617_start_all_delay branch from a2873d0 to 44bc0cb Compare August 26, 2022 08:12
@nicholasyang2022
Copy link
Collaborator

Should we check the return value of start_pacemaker in init_cluster_local?

@liangxin1300
Copy link
Collaborator Author

Should we check the return value of start_pacemaker in init_cluster_local?

No need, it will success or raise an exception

@nicholasyang2022
Copy link
Collaborator

So whether ServiceManager._do_action raises exception on failure depends on the number of nodes it is called with. The behavior is inconsistent and I think it is not a good design.

@liangxin1300
Copy link
Collaborator Author

So whether ServiceManager._do_action raises exception on failure depends on the number of nodes it is called with. The behavior is inconsistent and I think it is not a good design.

  • For multi node, _do_action call parallax to execute command in parallel, which will store the exception parallax.Error in the return object self.parallax_res. The purpose of this PR is to show all nodes' status, whether the command is success or failed, so method _handle_action_result will handle that exception. If we raise error immediately, we will lost other nodes' status
  • For remote and local case, it will raise an exception when error happens

And I think for all the case, when command failed, the exception will be raised

@zzhou1
Copy link
Contributor

zzhou1 commented Aug 29, 2022

From the discussion back then in https://bugzilla.redhat.com/show_bug.cgi?id=1572886

Good pointer, though I have less worry about the concern in the above bug, since Parallax does wait cmd to return. That says, the design here does wait corosync finish starting on a node before moving on with the next node.

…when running in parallel

- Return success node list for start/stop/enable/disable service
- Check corosync-qdevice.service is available before using qdevice
@liangxin1300 liangxin1300 force-pushed the 20220617_start_all_delay branch from 44bc0cb to eb933c5 Compare September 1, 2022 14:43
Copy link

@epenchev epenchev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, great work.

Copy link
Member

@gao-yan gao-yan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@zzhou1 zzhou1 merged commit 57fa9d9 into ClusterLabs:master Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants