Dev: bootstrap: Add delay to start corosync when node list larger than 5 #985

liangxin1300 · 2022-06-17T02:56:11Z

To avoid possible JOIN flood in corosync

epenchev

If I may suggest a change here.
For example if it's possible to do something like this:

failed_nodes = []
for node in node_list:
    time.sleep(0.25) 
    try:
       utils.start_service("corosync.service", remote_addr=node)
    except (ValueError, IOError):
        failed_nodes.append(node)

node_list = [
    node for node in node_list if node not in failed_nodes
]

From what I see we might get ValueError from _do_action() or IOError from parallax. This could fail the startup of the cluster cause of one failing node.
We can even be more aggressive and catch pretty much anything as exception here.

epenchev · 2022-08-22T12:22:07Z

From what I see pcs 0.9 has something similar https://github.com/ClusterLabs/pcs/blob/pcs-0.9/pcs/cluster.py#L1197 implemented.
From the discussion back then in https://bugzilla.redhat.com/show_bug.cgi?id=1572886 which @liangxin1300 has pointed me to looks like a complete remedy is to go with corosync3 and knet https://bugzilla.redhat.com/show_bug.cgi?id=1572886#c26 as @liangxin1300 has already spotted.
So guessing that's way latest pcs 0.10 and above doesn't have this implemented.

liangxin1300 · 2022-08-23T03:11:23Z

If I may suggest a change here. For example if it's possible to do something like this:
failed_nodes = []
for node in node_list:
    time.sleep(0.25) 
    try:
       utils.start_service("corosync.service", remote_addr=node)
    except (ValueError, IOError):
        failed_nodes.append(node)

node_list = [
    node for node in node_list if node not in failed_nodes
]
From what I see we might get ValueError from _do_action() or IOError from parallax. This could fail the startup of the cluster cause of one failing node. We can even be more aggressive and catch pretty much anything as exception here.

Thanks for the review and nice suggestion!
I will consider to submit new commit or new PR to implement this suggestion:)

liangxin1300 · 2022-08-26T07:09:31Z

Add second commit

Dev: utils: Refactor class ServiceManager, to show all nodes' status when running in parallel

- Return success node list for start/stop/enable/disable service
- Check corosync-qdevice.service is available before using qdevice

When some nodes failed to start cluster service:

# crm cluster start --all
INFO: BEGIN Starting pacemaker ...
ERROR: Failed to start pacemaker.service on 15sp4-2: Exited with error code 1, Error output: A dependency job for pacemaker.service failed. See 'journalctl -xe' for details.

INFO: END Starting pacemaker.
INFO: Cluster services started on 15sp4-1
INFO: Cluster services started on 15sp4-3

nicholasyang2022 · 2022-08-26T09:59:59Z

Should we check the return value of start_pacemaker in init_cluster_local?

liangxin1300 · 2022-08-26T14:52:35Z

Should we check the return value of start_pacemaker in init_cluster_local?

No need, it will success or raise an exception

nicholasyang2022 · 2022-08-29T03:04:16Z

So whether ServiceManager._do_action raises exception on failure depends on the number of nodes it is called with. The behavior is inconsistent and I think it is not a good design.

liangxin1300 · 2022-08-29T06:36:06Z

So whether ServiceManager._do_action raises exception on failure depends on the number of nodes it is called with. The behavior is inconsistent and I think it is not a good design.

For multi node, _do_action call parallax to execute command in parallel, which will store the exception parallax.Error in the return object self.parallax_res. The purpose of this PR is to show all nodes' status, whether the command is success or failed, so method _handle_action_result will handle that exception. If we raise error immediately, we will lost other nodes' status
For remote and local case, it will raise an exception when error happens

And I think for all the case, when command failed, the exception will be raised

zzhou1 · 2022-08-29T08:11:30Z

From the discussion back then in https://bugzilla.redhat.com/show_bug.cgi?id=1572886

Good pointer, though I have less worry about the concern in the above bug, since Parallax does wait cmd to return. That says, the design here does wait corosync finish starting on a node before moving on with the next node.

…when running in parallel - Return success node list for start/stop/enable/disable service - Check corosync-qdevice.service is available before using qdevice

epenchev

Looks good to me, great work.

gao-yan

LGTM, thanks.

liangxin1300 force-pushed the 20220617_start_all_delay branch from 6777cb9 to 01f438d Compare June 17, 2022 07:52

epenchev requested changes Aug 22, 2022

View reviewed changes

liangxin1300 force-pushed the 20220617_start_all_delay branch 6 times, most recently from 2dc9a7e to a2873d0 Compare August 26, 2022 06:40

liangxin1300 force-pushed the 20220617_start_all_delay branch from a2873d0 to 44bc0cb Compare August 26, 2022 08:12

liangxin1300 added 3 commits August 30, 2022 10:53

Dev: bootstrap: Add delay to start corosync when node list larger than 5

eab4244

Dev: utils: Refactor class ServiceManager, to show all nodes' status …

7b1bca6

…when running in parallel - Return success node list for start/stop/enable/disable service - Check corosync-qdevice.service is available before using qdevice

Dev: unittest: Adjust unit test based on previous changes

eb933c5

liangxin1300 force-pushed the 20220617_start_all_delay branch from 44bc0cb to eb933c5 Compare September 1, 2022 14:43

epenchev approved these changes Sep 2, 2022

View reviewed changes

zzhou1 approved these changes Sep 8, 2022

View reviewed changes

gao-yan approved these changes Sep 8, 2022

View reviewed changes

zzhou1 merged commit 57fa9d9 into ClusterLabs:master Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev: bootstrap: Add delay to start corosync when node list larger than 5 #985

Dev: bootstrap: Add delay to start corosync when node list larger than 5 #985

liangxin1300 commented Jun 17, 2022

epenchev left a comment •

edited

Loading

epenchev commented Aug 22, 2022

liangxin1300 commented Aug 23, 2022

liangxin1300 commented Aug 26, 2022

nicholasyang2022 commented Aug 26, 2022

liangxin1300 commented Aug 26, 2022

nicholasyang2022 commented Aug 29, 2022

liangxin1300 commented Aug 29, 2022

zzhou1 commented Aug 29, 2022

epenchev left a comment

gao-yan left a comment

Dev: bootstrap: Add delay to start corosync when node list larger than 5 #985

Dev: bootstrap: Add delay to start corosync when node list larger than 5 #985

Conversation

liangxin1300 commented Jun 17, 2022

epenchev left a comment • edited Loading

Choose a reason for hiding this comment

epenchev commented Aug 22, 2022

liangxin1300 commented Aug 23, 2022

liangxin1300 commented Aug 26, 2022

nicholasyang2022 commented Aug 26, 2022

liangxin1300 commented Aug 26, 2022

nicholasyang2022 commented Aug 29, 2022

liangxin1300 commented Aug 29, 2022

zzhou1 commented Aug 29, 2022

epenchev left a comment

Choose a reason for hiding this comment

gao-yan left a comment

Choose a reason for hiding this comment

epenchev left a comment •

edited

Loading