Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency policy fails when set to 1 #4481

Closed
djha-skin opened this issue Dec 19, 2018 · 5 comments
Closed

Concurrency policy fails when set to 1 #4481

djha-skin opened this issue Dec 19, 2018 · 5 comments
Assignees

Comments

@djha-skin
Copy link

djha-skin commented Dec 19, 2018

SUMMARY

I have created a very simple orquesta workflow, created solely to be able to set a concurrency policy over a particular set of actions.

I have set a concurrency policy on this workflow, and set the threshold to 1, hoping to mutex the workflow so that it is only run one at a time.

Unfortunately, this policy does not work, due to what looks like a race condition.

ISSUE TYPE
  • Bug Report
STACKSTORM VERSION

st2 --version:

st2 2.10.0, on Python 2.7.5
OS / ENVIRONMENT / INSTALL METHOD

I have a two-node, active-active, highly-available stackstorm cluster. The cluster is set up with everything running on every node: the garbage collector, the scheduler, the runners, the web page, everything. The web page and API layers sit behind a load balancer.

I do NOT have mistral installed, or st2chatops.

Both nodes run on RHEL 7. The mongo database runs in its own cluster, as does rabbitmq. is on its own cluster of VMs. In addition, I have set up zookeeper as a coordination server.

I have overcome most of the challenges with this setup. It works well for me, but in the interest of transparency I have presented it here.

STEPS TO REPRODUCE

Before beginning, I just wanted to say that this is a rough sketch of how to set up something that looks like my HA setup, hopefully enough so that the error can be reproduced, but it's not perfect.

  1. Create a workflow that looks like this:
PS C:\Users\me\Development\src\st2-pack> cat .\actions\workflows\my-task.yaml
version: 1.0

description: Workflow to run and set concurrency of command to '1'

input:
  - cmd
  - timeout

tasks:
  task1:
    action: core.local cmd=<% ctx(cmd) %> timeout=<% ctx(timeout) %>
    next:
      - when: <% succeeded() %>
        publish:
          - stdout: <% result().stdout %>
          - stderr: <% result().stderr %>

output:
  - stdout: <% ctx(stdout) %>
  1. Set a concurrency policy on it:
PS C:\Users\me\Development\src\st2-pack> cat .\policies\ccurrency_my_task.yaml
---
name: isteo_vmp.concurrency
description: Limits the concurrent executions of the my_pack.my-task action.
enabled: true
resource_ref: my_pack.my-task
policy_type: action.concurrency
parameters:
    threshold: 1
    action: delay
  1. Create a zookeeper backend and point stackstorm at it for its coordination backend. This can be done, for example, by using the following docker-compose file. Place it at the root of a single stackstorm VM, run docker-compose up, and you'll have a zookeeper instance at localhost:2181 that you can use:
PS C:\Users\me\Development\src\st2-pack> cat .\docker-compose.yaml
version: '2'

services:
  zookeeper:
    image: 'bitnami/zookeeper:latest'
    environment:
      - ALLOW_ANONYMOUS_LOGIN=yes
    ports:
      - '2181:2181'

Change /etc/st2/st2.conf to use the zookeeper backend:

[coordination]
url = kazoo://sszookeeper.veriskdom.com:2181
  1. Set debug = True in st2.conf:
[system]
debug = True
  1. Restart stackstorm to pick up the configuration changes.

  2. Run five of the workflows all at once:

$ st2 run my_pack.my-task cmd='sleep 90' & st2 run my_pack.my-task cmd='sleep 90' & st2 run my_pack.my-task cmd='sleep 90' & st2 run my_pack.my-task cmd='sleep 90' & st2 run my_pack.my-task cmd='sleep 90'

Notable differences between the above step and my setup are that my setup creates five tasks all at once because a sensor creates five triggers within the same few milliseconds and they're all processed very rapidly, but the outcome should hopefully be the same.

EXPECTED RESULTS

When I print out the executions using st2 execution list, I should see one running task and five delayed tasks.

ACTUAL RESULTS

There were several cases when more than one task was scheduled at the same time.

I have captured log output from my real servers, and I am posting it here:
https://gist.github.com/djhaskin987/de002db96a27991aff8f46c0fd339fd4

The log file is a composite of the /var/log/st2/st2scheduler.log files from both HA servers.
The files were cat-ed together, then sorted by line content, which nicely made the lines
appear in chronological order.

I had to scrub much of the content of those logs, but essentially they look
like what you might find by running the steps to reproduce above. I changed the
name of the pack in the logs to my_pack, the name of the workflow to
my-task, etc. I also changed the cmd argument of the workflow to sleep 90, since the contents of the command line have company-specific information
in it, but it works well enough since the command in question takes roughly 90
seconds to complete. Finally, I replaced the liveaction IDs with
first_task_run, second_task_run, etc. to match the runs as if they were
done in the steps to reproduce above and so that they're easy to identify.

Here are the juicy bits that you'll find in that log file.

The second_task_run is the first to ask to be scheduled, and stackstorm
correctly identifies that it ought to be:

2018-12-18 23:18:56,708 140672270229168 DEBUG concurrency [-] There are 0 instances of my_pack.my-task in scheduled or running status. Threshold of my_pack.my_pack.concurrency is not reached. Action execution will be scheduled.

Next, fourth_task_run asks to be scheduled. Here's the error, though: fourth_task_run is erroneously scheduled because stackstorm is unable to read out what it just wrote to the mongo database. This line should say There are 1 instances of my_pack.my-task...Threshold...is reached but it doesn't.

2018-12-18 23:18:56,891 140615979540144 DEBUG concurrency [-] There are 0 instances of my_pack.my-task in scheduled or running status. Threshold of my_pack.my_pack.concurrency is not reached. Action execution will be scheduled.

Note that I don't know which server created which log entry (that's my bad, sorry :( ). It could be that one st2scheduler process on one of the nodes created the first and another created the second. I suspect this, but haven't tried proving it yet.

Note also that the two conflicting log entries posted above occur within 83 milliseconds of each othe, or less than a tenth of a second 183
milliseconds of each other.

Finally, first_task_run asks to be scheduled. This time, the database read reflects the values correctly and stackstorm sees that two tasks were already scheduled, and so delays the execution of first_task_run.

2018-12-18 23:18:58,112 140615979540304 DEBUG concurrency [-] There are 2 instances of my_pack.my-task in scheduled or running status. Threshold of my_pack.my_pack.concurrency is reached. Action execution will be delayed.
Likely causes

The problem appears to originate in the _apply_before function, found here.

        scheduled = action_access.LiveAction.count(
            action=target.action, status=action_constants.LIVEACTION_STATUS_SCHEDULED)

        # Get the count of running instances of the action.
        running = action_access.LiveAction.count(
            action=target.action, status=action_constants.LIVEACTION_STATUS_RUNNING)

The above lines query the database for scheduled and requested actions, and those calls return erroneous data.

@m4dcoder m4dcoder self-assigned this Jan 8, 2019
@m4dcoder m4dcoder changed the title concurrency policy fails when set to 1 Concurrency policy fails when set to 1 Jan 8, 2019
@djha-skin
Copy link
Author

I am grateful for you people working on this.

I just wanted to confirm that I am still seeing this problem. Initially I thought I saw it because of environmental problems like both of my nodes were not on the same version, but I have squared all that away now and the problem is still there.

@m4dcoder
Copy link
Contributor

@djhaskin987 This is on our priority list. We have identified a potential solution but we need to duplicate this in our environment first. We aim to have this fix in the next release.

@djha-skin
Copy link
Author

Sweet! thanks 👍

@m4dcoder
Copy link
Contributor

m4dcoder commented Feb 6, 2019

Fix at #4537

@djha-skin
Copy link
Author

YAY

@m4dcoder m4dcoder closed this as completed Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants