-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add job description field in toml #101
Conversation
Summary: Adding a description field, useful for integration tests to describe the test. Test Plan: Reviewers: Subscribers: Tasks: Tags:
@@ -93,7 +93,7 @@ def main(job_config: JobConfig): | |||
world_size=world_size, | |||
) | |||
world_mesh = parallel_dims.build_mesh(device_type="cuda") | |||
|
|||
rank0_log(f"Starting job: {job_config.job.description}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps since this is an optional parameter for the run, we should confirm if the entry is there and then display - this way lack of this optional param doesn't break the run?
(i.e.
if 'description' in job_config.job: rank0_log(f"...")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entry will always be there since the config manager populated the default added to argparse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, useful to have this info up front at the start of a run.
added minor nit that perhaps should check on presence of this entry, since it's an optional param, to avoid breaking if they don't include it in a config.
Summary: Adding a description field, useful for integration tests to describe the test. Test Plan: ``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0,1 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] ***************************************** W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] ***************************************** [rank1]:2024-02-29 17:05:04,269 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-02-29 17:05:04,510 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-02-29 17:05:05,327 - root - INFO - Starting job: debug training [rank0]:2024-02-29 17:05:05,327 - root - INFO - Building llama [rank0]:2024-02-29 17:05:05,335 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-02-29 17:05:05,335 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank1]:2024-02-29 17:05:06,782 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank1]:2024-02-29 17:05:06,782 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank0]:2024-02-29 17:05:07,347 - root - INFO - Model fully initialized via reset_params [rank0]:2024-02-29 17:05:07,349 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-02-29 17:05:07,349 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-02-29 17:05:07,349 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use [rank0]:2024-02-29 17:05:08,375 - root - INFO - Applied FSDP to the model... [rank0]:2024-02-29 17:05:08,376 - root - INFO - Gradient scaling not enabled. [rank0]:2024-02-29 17:05:08,376 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240229-1705. [rank0]:2024-02-29 17:05:08,610 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:2024-02-29 17:05:10,570 - root - INFO - �[36mstep: 1 �[32mloss: 10.9183 �[39miter: �[34m 1.9258�[39m data: �[34m0.0303 �[39mlr: �[33m0.00026667�[39m [rank0]:2024-02-29 17:05:10,653 - root - INFO - �[36mstep: 2 �[32mloss: 10.8347 �[39miter: �[34m 0.0487�[39m data: �[34m0.0336 �[39mlr: �[33m0.00053333�[39m [rank0]:2024-02-29 17:05:10,733 - root - INFO - �[36mstep: 3 �[32mloss: 10.6861 �[39miter: �[34m 0.045�[39m data: �[34m0.0334 �[39mlr: �[33m0.0008�[39m [rank0]:2024-02-29 17:05:10,812 - root - INFO - �[36mstep: 4 �[32mloss: 10.4672 �[39miter: �[34m 0.0453�[39m data: �[34m0.0336 �[39mlr: �[33m0.0007�[39m [rank0]:2024-02-29 17:05:10,893 - root - INFO - �[36mstep: 5 �[32mloss: 10.2154 �[39miter: �[34m 0.0466�[39m data: �[34m0.033 �[39mlr: �[33m0.0006�[39m [rank0]:2024-02-29 17:05:10,975 - root - INFO - �[36mstep: 6 �[32mloss: 9.9573 �[39miter: �[34m 0.0496�[39m data: �[34m0.0314 �[39mlr: �[33m0.0005�[39m [rank0]:2024-02-29 17:05:11,056 - root - INFO - �[36mstep: 7 �[32mloss: 9.7627 �[39miter: �[34m 0.0486�[39m data: �[34m0.0321 �[39mlr: �[33m0.0004�[39m [rank0]:2024-02-29 17:05:11,201 - root - INFO - �[36mstep: 8 �[32mloss: 9.437 �[39miter: �[34m 0.0457�[39m data: �[34m0.0333 �[39mlr: �[33m0.0003�[39m [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank0]:2024-02-29 17:05:11,317 - root - INFO - �[36mstep: 9 �[32mloss: 9.2446 �[39miter: �[34m 0.0794�[39m data: �[34m0.0324 �[39mlr: �[33m0.0002�[39m [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:2024-02-29 17:05:11,748 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10 [rank0]:2024-02-29 17:05:11,762 - root - INFO - �[36mstep: 10 �[32mloss: 9.1772 �[39miter: �[34m 0.0893�[39m data: �[34m0.0324 �[39mlr: �[33m0.0001�[39m [rank0]:2024-02-29 17:05:11,763 - root - INFO - Average iter time: 0.0578 seconds [rank0]:2024-02-29 17:05:11,763 - root - INFO - Average data load time: 0.0326 seconds [rank0]:2024-02-29 17:05:11,763 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2% [rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44% [rank0]:num retries: 0, num ooms: 0 [rank0]:NCCL version 2.19.3+cuda12.0 ``` Reviewers: Subscribers: Tasks: Tags: Co-authored-by: gnadathur <[email protected]>
Summary: Adding a description field, useful for integration tests to describe the test. Test Plan: ``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0,1 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] ***************************************** W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] ***************************************** [rank1]:2024-02-29 17:05:04,269 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-02-29 17:05:04,510 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-02-29 17:05:05,327 - root - INFO - Starting job: debug training [rank0]:2024-02-29 17:05:05,327 - root - INFO - Building llama [rank0]:2024-02-29 17:05:05,335 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-02-29 17:05:05,335 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank1]:2024-02-29 17:05:06,782 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank1]:2024-02-29 17:05:06,782 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank0]:2024-02-29 17:05:07,347 - root - INFO - Model fully initialized via reset_params [rank0]:2024-02-29 17:05:07,349 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-02-29 17:05:07,349 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-02-29 17:05:07,349 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use [rank0]:2024-02-29 17:05:08,375 - root - INFO - Applied FSDP to the model... [rank0]:2024-02-29 17:05:08,376 - root - INFO - Gradient scaling not enabled. [rank0]:2024-02-29 17:05:08,376 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240229-1705. [rank0]:2024-02-29 17:05:08,610 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:2024-02-29 17:05:10,570 - root - INFO - �[36mstep: 1 �[32mloss: 10.9183 �[39miter: �[34m 1.9258�[39m data: �[34m0.0303 �[39mlr: �[33m0.00026667�[39m [rank0]:2024-02-29 17:05:10,653 - root - INFO - �[36mstep: 2 �[32mloss: 10.8347 �[39miter: �[34m 0.0487�[39m data: �[34m0.0336 �[39mlr: �[33m0.00053333�[39m [rank0]:2024-02-29 17:05:10,733 - root - INFO - �[36mstep: 3 �[32mloss: 10.6861 �[39miter: �[34m 0.045�[39m data: �[34m0.0334 �[39mlr: �[33m0.0008�[39m [rank0]:2024-02-29 17:05:10,812 - root - INFO - �[36mstep: 4 �[32mloss: 10.4672 �[39miter: �[34m 0.0453�[39m data: �[34m0.0336 �[39mlr: �[33m0.0007�[39m [rank0]:2024-02-29 17:05:10,893 - root - INFO - �[36mstep: 5 �[32mloss: 10.2154 �[39miter: �[34m 0.0466�[39m data: �[34m0.033 �[39mlr: �[33m0.0006�[39m [rank0]:2024-02-29 17:05:10,975 - root - INFO - �[36mstep: 6 �[32mloss: 9.9573 �[39miter: �[34m 0.0496�[39m data: �[34m0.0314 �[39mlr: �[33m0.0005�[39m [rank0]:2024-02-29 17:05:11,056 - root - INFO - �[36mstep: 7 �[32mloss: 9.7627 �[39miter: �[34m 0.0486�[39m data: �[34m0.0321 �[39mlr: �[33m0.0004�[39m [rank0]:2024-02-29 17:05:11,201 - root - INFO - �[36mstep: 8 �[32mloss: 9.437 �[39miter: �[34m 0.0457�[39m data: �[34m0.0333 �[39mlr: �[33m0.0003�[39m [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank0]:2024-02-29 17:05:11,317 - root - INFO - �[36mstep: 9 �[32mloss: 9.2446 �[39miter: �[34m 0.0794�[39m data: �[34m0.0324 �[39mlr: �[33m0.0002�[39m [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:2024-02-29 17:05:11,748 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10 [rank0]:2024-02-29 17:05:11,762 - root - INFO - �[36mstep: 10 �[32mloss: 9.1772 �[39miter: �[34m 0.0893�[39m data: �[34m0.0324 �[39mlr: �[33m0.0001�[39m [rank0]:2024-02-29 17:05:11,763 - root - INFO - Average iter time: 0.0578 seconds [rank0]:2024-02-29 17:05:11,763 - root - INFO - Average data load time: 0.0326 seconds [rank0]:2024-02-29 17:05:11,763 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2% [rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44% [rank0]:num retries: 0, num ooms: 0 [rank0]:NCCL version 2.19.3+cuda12.0 ``` Reviewers: Subscribers: Tasks: Tags: Co-authored-by: gnadathur <[email protected]>
Summary:
Adding a description field, useful for integration tests to describe the test.
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags: