Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(torchx/components) pass --tee=3 to dist.ddp to prefix local_rank on the worker's stdout and stderr streams #412

Closed
wants to merge 1 commit into from

Conversation

kiukchung
Copy link
Contributor

Summary:
Addresses the QOL issue around SLURM logs mentioned in #405

TL;DR - since torchx launches nodes (not tasks) in SLURM, the stdout and stderr logs are combined for all 8 workers on the node (versus having separate ones for each worker when launched as task). This makes dist.ddp set --tee=3 flag to torchelastic which prefixes each line of stderr and stdout of the workers with the local_rank of that worker so that the user can easily grep out the logs for a particular worker.

Differential Revision: D34726681

…the worker's stdout and stderr streams

Summary:
Addresses the QOL issue around SLURM logs mentioned in #405

TL;DR - since torchx launches nodes (not tasks) in SLURM, the stdout and stderr logs are combined for all 8 workers on the node (versus having separate ones for each worker when launched as task). This makes `dist.ddp` set `--tee=3` flag to torchelastic which prefixes each line of stderr and stdout of the workers with the local_rank of that worker so that the user can easily grep out the logs for a particular worker.

Differential Revision: D34726681

fbshipit-source-id: 24500a68981db7671f2f57961cc7dfa96c8a45e8
@facebook-github-bot facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Mar 8, 2022
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34726681

@codecov
Copy link

codecov bot commented Mar 8, 2022

Codecov Report

Merging #412 (6d45662) into main (9a4faca) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #412   +/-   ##
=======================================
  Coverage   94.21%   94.21%           
=======================================
  Files          66       66           
  Lines        3685     3685           
=======================================
  Hits         3472     3472           
  Misses        213      213           
Impacted Files Coverage Δ
torchx/components/dist.py 79.54% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9a4faca...6d45662. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants