-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement/transpiler rename grad vars to add trainer id, so RPC call can be retried. #8049
Enhancement/transpiler rename grad vars to add trainer id, so RPC call can be retried. #8049
Conversation
… no_counter_on_pserver
… no_counter_on_pserver
… no_counter_on_pserver
… no_counter_on_pserver
… no_counter_on_pserver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
optimize_ops, params_grads, pservers=pserver_endpoints, trainers=trainers) | ||
optimize_ops, | ||
params_grads, | ||
0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to get trainer_id from ENV or command-line arguments.
I will fix it in a new PR.
# step3 | ||
optimize_block = pserver_program.create_block(0) | ||
# step 4 | ||
# Create a union-find data struct from optimize ops, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed documentation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
As discussed in #7947 (comment), we can delete
fan_in
attribute inlisten_and_serv
op, so RPC calls can be retried when the network sucks.