Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler bounces LiveActions between delayed and policy-delayed #4655

Closed
nmaludy opened this issue May 1, 2019 · 6 comments · Fixed by #4667
Closed

Scheduler bounces LiveActions between delayed and policy-delayed #4655

nmaludy opened this issue May 1, 2019 · 6 comments · Fixed by #4667
Assignees
Milestone

Comments

@nmaludy
Copy link
Member

nmaludy commented May 1, 2019

SUMMARY

I noticed this when debugging another problem and wanted to say something in case this behavior wasn't intended.

When there are executions that are delayed by a policy, the st2scheduler is constantly flipping them between policy-delayed and delayed state. Each state change is accumulated in the LiveAction instance and causes fairly large objects, especially if executions are delayed for a long period of time (large backlog).

ISSUE TYPE
  • Bug Report
  • Feature Idea
STACKSTORM VERSION
$ st2 --version
st2 3.0, on Python 2.7.5
OS / ENVIRONMENT / INSTALL METHOD
CentOS 7.6
Puppet-st2
STEPS TO REPRODUCE

Create the following policy `/opt/stackstorm/packs/default/policies/core_local_concurrency.yaml

pack: default
name: core.local.concurrency
description: Limits the concurrent executions for core.local
enabled: true
resource_ref: core.local
policy_type: action.concurrency
parameters:
  action: delay
  threshold: 10

Run the following script to kick off a bunch of core.local excutions

#!/bin/bash
for NUM in `seq -w 0 250`; do
  st2 action execute core.local cmd='sleep 120' timeout=180
done

Tail the st2scheduler logs

tail -f /var/log/st2/st2scheduler.log
ACTUAL RESULTS
2019-05-01 15:02:59,673 140648388594096 INFO handler [-] Scheduling liveaction: 5cc9ec569387ef586c0c82c3 (queue_item_id=5cc9ec569387ef58ba380e73) (queue_item_id='5cc9ec569387ef58ba380e73',liveaction_id='5cc9ec569387ef586c0c82c3')
2019-05-01 15:02:59,711 140648388594096 INFO policies [-] Applying policy "default.core.local.concurrency" (action.concurrency) for liveaction "5cc9ec569387ef586c0c82c3".
2019-05-01 15:02:59,848 140648388594096 AUDIT action [-] The status of action execution is changed from delayed to policy-delayed. <LiveAction.id=5cc9ec569387ef586c0c82c3, ActionExecution.id=5cc9ec569387ef586c0c82c4> (liveaction_db={'status': 'policy-delayed', 'runner_info': {}, 'workflow_execution': None, 'parameters': {u'cmd': u'sleep 120', u'timeout': 180}, 'action_is_workflow': False, 'start_timestamp': '2019-05-01 18:58:30.534541+00:00', 'delay': None, 'callback': {}, 'task_execution': None, 'notify': None, 'result': {}, 'context': {u'user': u'st2admin', u'pack': u'core'}, 'action': u'core.local', 'id': '5cc9ec569387ef586c0c82c3', 'end_timestamp': None},action_execution_db={'status': u'policy-delayed', 'start_timestamp': '2019-05-01 18:58:30.534541+00:00', 'web_url': u'https://stackstorm.domain.tld#/history/5cc9ec569387ef586c0c82c4/general', 'log': [{u'status': u'requested', u'timestamp': datetime.datetime(2019, 5, 1, 18, 58, 30, 550000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 58, 40, 199000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 58, 40, 370000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 58, 55, 539000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 58, 55, 781000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 59, 14, 435000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 59, 14, 762000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 59, 38, 182000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 59, 38, 475000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 0, 5, 460000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 0, 5, 955000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 0, 34, 537000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 0, 34, 650000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 1, 3, 703000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 1, 3, 919000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 1, 33, 355000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 1, 33, 639000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 3, 76000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 3, 286000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 31, 627000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 31, 774000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 59, 814000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}], 'parameters': {u'cmd': u'sleep 120', u'timeout': 180}, 'trigger_type': {}, 'runner': {u'runner_module': u'local_shell_command_runner', u'uid': u'runner_type:local-shell-cmd', u'runner_package': u'local_runner', u'enabled': True, u'name': u'local-shell-cmd', u'output_schema': {}, u'runner_parameters': {u'sudo': {u'default': False, u'type': u'boolean', u'description': u'The command will be executed with sudo.'}, u'sudo_password': {u'default': None, u'secret': '********', u'required': False, u'type': u'string', u'description': u'Sudo password. To be used when paswordless sudo is not allowed.'}, u'env': {u'type': u'object', u'description': u'Environment variables which will be available to the command(e.g. key1=val1,key2=val2)'}, u'cmd': {u'type': u'string', u'description': u'Arbitrary Linux command to be executed on the host.'}, u'kwarg_op': {u'default': u'--', u'type': u'string', u'description': u'Operator to use in front of keyword args i.e. "--" or "-".'}, u'timeout': {u'default': 60, u'type': u'integer', u'description': u"Action timeout in seconds. Action will get killed if it doesn't finish in timeout seconds."}, u'cwd': {u'type': u'string', u'description': u'Working directory where the command will be executed in'}}, u'id': u'59b7d5e6a814c02c97e8a96a', u'description': u'A runner to execute local actions as a fixed user.'}, 'trigger_instance': {}, 'rule': {}, 'children': [], 'delay': None, 'task_execution': None, 'trigger': {}, 'parent': None, 'result': {}, 'context': {u'user': u'st2admin', u'pack': u'core'}, 'workflow_execution': None, 'action': {u'notify': {}, u'name': u'local', u'runner_type': u'local-shell-cmd', u'tags': [], u'description': u'Action that executes an arbitrary Linux command on the localhost.', u'enabled': True, u'entry_point': u'', u'metadata_file': u'actions/local.yaml', u'output_schema': {}, u'uid': u'action:core:local', u'parameters': {u'cmd': {u'required': True, u'type': u'string', u'description': u'Arbitrary Linux command to be executed on the local host.'}, u'sudo': {u'immutable': True}}, u'ref': u'core.local', u'id': u'59b7d5e6a814c02c97e8a97f', u'pack': u'core'}, 'liveaction': {u'runner_info': {}, u'parameters': {u'cmd': u'sleep 120', u'timeout': 180}, u'action_is_workflow': False, u'callback': {}, u'action': u'core.local', u'id': u'5cc9ec569387ef586c0c82c3'}, 'id': '5cc9ec569387ef586c0c82c4', 'end_timestamp': None})
2019-05-01 15:02:59,923 140648388594096 INFO action [-] The status of action execution is changed from delayed to policy-delayed. <LiveAction.id=5cc9ec569387ef586c0c82c3, ActionExecution.id=5cc9ec569387ef586c0c82c4>
2019-05-01 15:02:59,924 140648388594096 INFO handler [-] Liveaction (5cc9ec569387ef586c0c82c3) Status Pre-Run: policy-delayed (5cc9ec569387ef58ba380e73) (liveaction_status='policy-delayed',queue_item_id='5cc9ec569387ef58ba380e73',liveaction_id='5cc9ec569387ef586c0c82c3')
2019-05-01 15:02:59,983 140648388594096 AUDIT action [-] The status of action execution is changed from policy-delayed to delayed. <LiveAction.id=5cc9ec569387ef586c0c82c3, ActionExecution.id=5cc9ec569387ef586c0c82c4> (liveaction_db={'status': 'delayed', 'runner_info': {}, 'workflow_execution': None, 'parameters': {u'cmd': u'sleep 120', u'timeout': 180}, 'action_is_workflow': False, 'start_timestamp': '2019-05-01 18:58:30.534541+00:00', 'delay': None, 'callback': {}, 'task_execution': None, 'notify': None, 'result': {}, 'context': {u'user': u'st2admin', u'pack': u'core'}, 'action': u'core.local', 'id': '5cc9ec569387ef586c0c82c3', 'end_timestamp': None},action_execution_db={'status': u'delayed', 'start_timestamp': '2019-05-01 18:58:30.534541+00:00', 'web_url': u'https://stackstorm.domain.tld/#/history/5cc9ec569387ef586c0c82c4/general', 'log': [{u'status': u'requested', u'timestamp': datetime.datetime(2019, 5, 1, 18, 58, 30, 550000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 58, 40, 199000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 58, 40, 370000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 58, 55, 539000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 58, 55, 781000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 59, 14, 435000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 59, 14, 762000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 59, 38, 182000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 59, 38, 475000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 0, 5, 460000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 0, 5, 955000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 0, 34, 537000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 0, 34, 650000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 1, 3, 703000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 1, 3, 919000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 1, 33, 355000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 1, 33, 639000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 3, 76000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 3, 286000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 31, 627000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 31, 774000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 59, 814000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 59, 965000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}], 'parameters': {u'cmd': u'sleep 120', u'timeout': 180}, 'trigger_type': {}, 'runner': {u'runner_module': u'local_shell_command_runner', u'uid': u'runner_type:local-shell-cmd', u'runner_package': u'local_runner', u'enabled': True, u'name': u'local-shell-cmd', u'output_schema': {}, u'runner_parameters': {u'sudo': {u'default': False, u'type': u'boolean', u'description': u'The command will be executed with sudo.'}, u'sudo_password': {u'default': None, u'secret': '********', u'required': False, u'type': u'string', u'description': u'Sudo password. To be used when paswordless sudo is not allowed.'}, u'env': {u'type': u'object', u'description': u'Environment variables which will be available to the command(e.g. key1=val1,key2=val2)'}, u'cmd': {u'type': u'string', u'description': u'Arbitrary Linux command to be executed on the host.'}, u'kwarg_op': {u'default': u'--', u'type': u'string', u'description': u'Operator to use in front of keyword args i.e. "--" or "-".'}, u'timeout': {u'default': 60, u'type': u'integer', u'description': u"Action timeout in seconds. Action will get killed if it doesn't finish in timeout seconds."}, u'cwd': {u'type': u'string', u'description': u'Working directory where the command will be executed in'}}, u'id': u'59b7d5e6a814c02c97e8a96a', u'description': u'A runner to execute local actions as a fixed user.'}, 'trigger_instance': {}, 'rule': {}, 'children': [], 'delay': None, 'task_execution': None, 'trigger': {}, 'parent': None, 'result': {}, 'context': {u'user': u'st2admin', u'pack': u'core'}, 'workflow_execution': None, 'action': {u'notify': {}, u'name': u'local', u'runner_type': u'local-shell-cmd', u'tags': [], u'description': u'Action that executes an arbitrary Linux command on the localhost.', u'enabled': True, u'entry_point': u'', u'metadata_file': u'actions/local.yaml', u'output_schema': {}, u'uid': u'action:core:local', u'parameters': {u'cmd': {u'required': True, u'type': u'string', u'description': u'Arbitrary Linux command to be executed on the local host.'}, u'sudo': {u'immutable': True}}, u'ref': u'core.local', u'id': u'59b7d5e6a814c02c97e8a97f', u'pack': u'core'}, 'liveaction': {u'runner_info': {}, u'parameters': {u'cmd': u'sleep 120', u'timeout': 180}, u'action_is_workflow': False, u'callback': {}, u'action': u'core.local', u'id': u'5cc9ec569387ef586c0c82c3'}, 'id': '5cc9ec569387ef586c0c82c4', 'end_timestamp': None})
@Kami
Copy link
Member

Kami commented May 2, 2019

Thanks for reporting this.

This indeed looks like a bug / race.

@m4dcoder will you have time to have a look this / next week (since you are the last person who worked on that and probably have the most context , otherwise I can have a look as well).

@nmaludy
Copy link
Member Author

nmaludy commented May 2, 2019

@m4dcoder ^^ (@Kami mistyped your name above)

@warrenvw warrenvw added the bug label May 2, 2019
@Kami Kami added this to the 3.0.1 milestone May 5, 2019
@m4dcoder
Copy link
Contributor

m4dcoder commented May 6, 2019

This is not a race. It's more of an annoying behavior. We introduced policy-delayed when we refactored the scheduler. An action execution goes to policy-delayed when the concurrency policy enforce delay for the execution. When the scheduler sees the policy enforcement, it delays the execution. In this case, the execution are being delayed and rescheduled and delayed again by policy. We can refactor the scheduler/policies to reduce and silence the effect.

@m4dcoder m4dcoder removed the bug label May 6, 2019
@m4dcoder
Copy link
Contributor

m4dcoder commented May 6, 2019

This is not a bug either to be precise. Is it performing the way it's suppose to? Did the concurrency policy delayed the action execution? The behavior of having many status change recorded in the object can be categorized as annoying.

@m4dcoder m4dcoder self-assigned this May 6, 2019
@nmaludy
Copy link
Member Author

nmaludy commented May 6, 2019

@m4dcoder i found this behavior when debugging an issue where a bunch of our actions were stuck in a delayed state. I haven't able to create steps to reproduce that issue, but noticed that constantly reading/writing these actions from the database in this "busy loop" was probably not helping anything.

@Kami
Copy link
Member

Kami commented May 7, 2019

I was able to reproduce this issue, at least the flipping back and forth between policy-delayed and delayed status part.

Even if the end result and policies work correctly, this flipping back and forth is not desired, especially because on each of the state change we need to update execution object (log attribute on it) and that's very wasteful (expensive and unnecessary db write). It also causes unnecessary growing of the Execution object.

Example output:

st2 execution get  5cd14ac40761292469c559aa --attr log
log: - status: requested
  timestamp: '2019-05-07T09:07:16.000000Z'
- status: policy-delayed
  timestamp: '2019-05-07T09:07:16.000000Z'
- status: delayed
  timestamp: '2019-05-07T09:07:16.000000Z'
- status: policy-delayed
  timestamp: '2019-05-07T09:07:18.000000Z'
- status: delayed
  timestamp: '2019-05-07T09:07:18.000000Z'
- status: policy-delayed
  timestamp: '2019-05-07T09:07:20.000000Z'
- status: delayed
  timestamp: '2019-05-07T09:07:20.000000Z'
- status: policy-delayed
  timestamp: '2019-05-07T09:07:22.000000Z'
- status: delayed
  timestamp: '2019-05-07T09:07:22.000000Z'
- status: policy-delayed
  timestamp: '2019-05-07T09:07:24.000000Z'
- status: delayed
  timestamp: '2019-05-07T09:07:24.000000Z'
- status: policy-delayed
  timestamp: '2019-05-07T09:07:26.000000Z'
- status: delayed
  timestamp: '2019-05-07T09:07:26.000000Z'
- status: policy-delayed
  timestamp: '2019-05-07T09:07:28.000000Z'
- status: delayed
  timestamp: '2019-05-07T09:07:28.000000Z'
...

It could cause very massive execution objects for executions which are delayed for longer time periods (it flips status back and forth around every two seconds which is the POLICY_DELAYED_EXECUTION_RESCHEDULE_TIME_MS interval).

And it also generates tons of (unnecessary) log traffic in st2scheduler.

EDIT: Actual policies itself seem to work correctly, but this flipping back and forth still needs to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants