-
-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrency policy fails when set to 1 #4481
Comments
I am grateful for you people working on this. I just wanted to confirm that I am still seeing this problem. Initially I thought I saw it because of environmental problems like both of my nodes were not on the same version, but I have squared all that away now and the problem is still there. |
@djhaskin987 This is on our priority list. We have identified a potential solution but we need to duplicate this in our environment first. We aim to have this fix in the next release. |
Sweet! thanks 👍 |
Fix at #4537 |
YAY |
SUMMARY
I have created a very simple orquesta workflow, created solely to be able to set a concurrency policy over a particular set of actions.
I have set a concurrency policy on this workflow, and set the threshold to
1
, hoping to mutex the workflow so that it is only run one at a time.Unfortunately, this policy does not work, due to what looks like a race condition.
ISSUE TYPE
STACKSTORM VERSION
st2 --version
:OS / ENVIRONMENT / INSTALL METHOD
I have a two-node, active-active, highly-available stackstorm cluster. The cluster is set up with everything running on every node: the garbage collector, the scheduler, the runners, the web page, everything. The web page and API layers sit behind a load balancer.
I do NOT have mistral installed, or st2chatops.
Both nodes run on RHEL 7. The mongo database runs in its own cluster, as does rabbitmq. is on its own cluster of VMs. In addition, I have set up zookeeper as a coordination server.
I have overcome most of the challenges with this setup. It works well for me, but in the interest of transparency I have presented it here.
STEPS TO REPRODUCE
Before beginning, I just wanted to say that this is a rough sketch of how to set up something that looks like my HA setup, hopefully enough so that the error can be reproduced, but it's not perfect.
docker-compose up
, and you'll have a zookeeper instance atlocalhost:2181
that you can use:Change
/etc/st2/st2.conf
to use the zookeeper backend:debug = True
in st2.conf:Restart stackstorm to pick up the configuration changes.
Run five of the workflows all at once:
Notable differences between the above step and my setup are that my setup creates five tasks all at once because a sensor creates five triggers within the same few milliseconds and they're all processed very rapidly, but the outcome should hopefully be the same.
EXPECTED RESULTS
When I print out the executions using
st2 execution list
, I should see one running task and five delayed tasks.ACTUAL RESULTS
There were several cases when more than one task was scheduled at the same time.
I have captured log output from my real servers, and I am posting it here:
https://gist.github.com/djhaskin987/de002db96a27991aff8f46c0fd339fd4
The log file is a composite of the
/var/log/st2/st2scheduler.log
files from both HA servers.The files were
cat
-ed together, then sorted by line content, which nicely made the linesappear in chronological order.
I had to scrub much of the content of those logs, but essentially they look
like what you might find by running the steps to reproduce above. I changed the
name of the pack in the logs to
my_pack
, the name of the workflow tomy-task
, etc. I also changed thecmd
argument of the workflow tosleep 90
, since the contents of the command line have company-specific informationin it, but it works well enough since the command in question takes roughly 90
seconds to complete. Finally, I replaced the liveaction IDs with
first_task_run
,second_task_run
, etc. to match the runs as if they weredone in the steps to reproduce above and so that they're easy to identify.
Here are the juicy bits that you'll find in that log file.
The
second_task_run
is the first to ask to be scheduled, and stackstormcorrectly identifies that it ought to be:
Next,
fourth_task_run
asks to be scheduled. Here's the error, though:fourth_task_run
is erroneously scheduled because stackstorm is unable to read out what it just wrote to the mongo database. This line should sayThere are 1 instances of my_pack.my-task...Threshold...is reached
but it doesn't.Note that I don't know which server created which log entry (that's my bad, sorry
:(
). It could be that onest2scheduler
process on one of the nodes created the first and another created the second. I suspect this, but haven't tried proving it yet.Note also that the two conflicting log entries posted above occur within
83 milliseconds of each othe, or less than a tenth of a second183milliseconds of each other.
Finally,
first_task_run
asks to be scheduled. This time, the database read reflects the values correctly and stackstorm sees that two tasks were already scheduled, and so delays the execution offirst_task_run
.Likely causes
The problem appears to originate in the
_apply_before
function, found here.The above lines query the database for scheduled and requested actions, and those calls return erroneous data.
The text was updated successfully, but these errors were encountered: