Add job description field in toml #101

gnadathur · 2024-03-01T01:09:20Z

Summary:
Adding a description field, useful for integration tests to describe the test.

Test Plan:

+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0,1
+ CONFIG_FILE=./train_configs/debug_model.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] 
W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] *****************************************
W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] *****************************************
[rank1]:2024-02-29 17:05:04,269 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank0]:2024-02-29 17:05:04,510 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank0]:2024-02-29 17:05:05,327 - root - INFO - Starting job: debug training
[rank0]:2024-02-29 17:05:05,327 - root - INFO - Building llama
[rank0]:2024-02-29 17:05:05,335 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-02-29 17:05:05,335 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank1]:2024-02-29 17:05:06,782 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank1]:2024-02-29 17:05:06,782 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank0]:2024-02-29 17:05:07,347 - root - INFO - Model fully initialized via reset_params
[rank0]:2024-02-29 17:05:07,349 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-02-29 17:05:07,349 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-02-29 17:05:07,349 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use
[rank0]:2024-02-29 17:05:08,375 - root - INFO - Applied FSDP to the model...
[rank0]:2024-02-29 17:05:08,376 - root - INFO - Gradient scaling not enabled.
[rank0]:2024-02-29 17:05:08,376 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240229-1705.
[rank0]:2024-02-29 17:05:08,610 - root - INFO - Profiling active.  Traces will be saved at ./outputs/profiling/traces
[rank0]:2024-02-29 17:05:10,570 - root - INFO - �[36mstep:  1  �[32mloss: 10.9183  �[39miter: �[34m 1.9258�[39m  data: �[34m0.0303  �[39mlr: �[33m0.00026667�[39m
[rank0]:2024-02-29 17:05:10,653 - root - INFO - �[36mstep:  2  �[32mloss: 10.8347  �[39miter: �[34m 0.0487�[39m  data: �[34m0.0336  �[39mlr: �[33m0.00053333�[39m
[rank0]:2024-02-29 17:05:10,733 - root - INFO - �[36mstep:  3  �[32mloss: 10.6861  �[39miter: �[34m  0.045�[39m  data: �[34m0.0334  �[39mlr: �[33m0.0008�[39m
[rank0]:2024-02-29 17:05:10,812 - root - INFO - �[36mstep:  4  �[32mloss: 10.4672  �[39miter: �[34m 0.0453�[39m  data: �[34m0.0336  �[39mlr: �[33m0.0007�[39m
[rank0]:2024-02-29 17:05:10,893 - root - INFO - �[36mstep:  5  �[32mloss: 10.2154  �[39miter: �[34m 0.0466�[39m  data: �[34m0.033  �[39mlr: �[33m0.0006�[39m
[rank0]:2024-02-29 17:05:10,975 - root - INFO - �[36mstep:  6  �[32mloss:  9.9573  �[39miter: �[34m 0.0496�[39m  data: �[34m0.0314  �[39mlr: �[33m0.0005�[39m
[rank0]:2024-02-29 17:05:11,056 - root - INFO - �[36mstep:  7  �[32mloss:  9.7627  �[39miter: �[34m 0.0486�[39m  data: �[34m0.0321  �[39mlr: �[33m0.0004�[39m
[rank0]:2024-02-29 17:05:11,201 - root - INFO - �[36mstep:  8  �[32mloss:   9.437  �[39miter: �[34m 0.0457�[39m  data: �[34m0.0333  �[39mlr: �[33m0.0003�[39m
[rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank0]:2024-02-29 17:05:11,317 - root - INFO - �[36mstep:  9  �[32mloss:  9.2446  �[39miter: �[34m 0.0794�[39m  data: �[34m0.0324  �[39mlr: �[33m0.0002�[39m
[rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:2024-02-29 17:05:11,748 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10
[rank0]:2024-02-29 17:05:11,762 - root - INFO - �[36mstep: 10  �[32mloss:  9.1772  �[39miter: �[34m 0.0893�[39m  data: �[34m0.0324  �[39mlr: �[33m0.0001�[39m
[rank0]:2024-02-29 17:05:11,763 - root - INFO - Average iter time: 0.0578 seconds
[rank0]:2024-02-29 17:05:11,763 - root - INFO - Average data load time: 0.0326 seconds
[rank0]:2024-02-29 17:05:11,763 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2%
[rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44%
[rank0]:num retries: 0, num ooms: 0
[rank0]:NCCL version 2.19.3+cuda12.0

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: Adding a description field, useful for integration tests to describe the test. Test Plan: Reviewers: Subscribers: Tasks: Tags:

lessw2020 · 2024-03-01T03:44:54Z

train.py

@@ -93,7 +93,7 @@ def main(job_config: JobConfig):
        world_size=world_size,
    )
    world_mesh = parallel_dims.build_mesh(device_type="cuda")
-
+    rank0_log(f"Starting job: {job_config.job.description}")


Perhaps since this is an optional parameter for the run, we should confirm if the entry is there and then display - this way lack of this optional param doesn't break the run?
(i.e.
if 'description' in job_config.job: rank0_log(f"...")

The entry will always be there since the config manager populated the default added to argparse.

lessw2020

nice, useful to have this info up front at the start of a run.
added minor nit that perhaps should check on presence of this entry, since it's an optional param, to avoid breaking if they don't include it in a config.

Summary: Adding a description field, useful for integration tests to describe the test. Test Plan: ``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0,1 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] ***************************************** W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] ***************************************** [rank1]:2024-02-29 17:05:04,269 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-02-29 17:05:04,510 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-02-29 17:05:05,327 - root - INFO - Starting job: debug training [rank0]:2024-02-29 17:05:05,327 - root - INFO - Building llama [rank0]:2024-02-29 17:05:05,335 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-02-29 17:05:05,335 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank1]:2024-02-29 17:05:06,782 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank1]:2024-02-29 17:05:06,782 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank0]:2024-02-29 17:05:07,347 - root - INFO - Model fully initialized via reset_params [rank0]:2024-02-29 17:05:07,349 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-02-29 17:05:07,349 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-02-29 17:05:07,349 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use [rank0]:2024-02-29 17:05:08,375 - root - INFO - Applied FSDP to the model... [rank0]:2024-02-29 17:05:08,376 - root - INFO - Gradient scaling not enabled. [rank0]:2024-02-29 17:05:08,376 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240229-1705. [rank0]:2024-02-29 17:05:08,610 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:2024-02-29 17:05:10,570 - root - INFO - �[36mstep: 1 �[32mloss: 10.9183 �[39miter: �[34m 1.9258�[39m data: �[34m0.0303 �[39mlr: �[33m0.00026667�[39m [rank0]:2024-02-29 17:05:10,653 - root - INFO - �[36mstep: 2 �[32mloss: 10.8347 �[39miter: �[34m 0.0487�[39m data: �[34m0.0336 �[39mlr: �[33m0.00053333�[39m [rank0]:2024-02-29 17:05:10,733 - root - INFO - �[36mstep: 3 �[32mloss: 10.6861 �[39miter: �[34m 0.045�[39m data: �[34m0.0334 �[39mlr: �[33m0.0008�[39m [rank0]:2024-02-29 17:05:10,812 - root - INFO - �[36mstep: 4 �[32mloss: 10.4672 �[39miter: �[34m 0.0453�[39m data: �[34m0.0336 �[39mlr: �[33m0.0007�[39m [rank0]:2024-02-29 17:05:10,893 - root - INFO - �[36mstep: 5 �[32mloss: 10.2154 �[39miter: �[34m 0.0466�[39m data: �[34m0.033 �[39mlr: �[33m0.0006�[39m [rank0]:2024-02-29 17:05:10,975 - root - INFO - �[36mstep: 6 �[32mloss: 9.9573 �[39miter: �[34m 0.0496�[39m data: �[34m0.0314 �[39mlr: �[33m0.0005�[39m [rank0]:2024-02-29 17:05:11,056 - root - INFO - �[36mstep: 7 �[32mloss: 9.7627 �[39miter: �[34m 0.0486�[39m data: �[34m0.0321 �[39mlr: �[33m0.0004�[39m [rank0]:2024-02-29 17:05:11,201 - root - INFO - �[36mstep: 8 �[32mloss: 9.437 �[39miter: �[34m 0.0457�[39m data: �[34m0.0333 �[39mlr: �[33m0.0003�[39m [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank0]:2024-02-29 17:05:11,317 - root - INFO - �[36mstep: 9 �[32mloss: 9.2446 �[39miter: �[34m 0.0794�[39m data: �[34m0.0324 �[39mlr: �[33m0.0002�[39m [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:2024-02-29 17:05:11,748 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10 [rank0]:2024-02-29 17:05:11,762 - root - INFO - �[36mstep: 10 �[32mloss: 9.1772 �[39miter: �[34m 0.0893�[39m data: �[34m0.0324 �[39mlr: �[33m0.0001�[39m [rank0]:2024-02-29 17:05:11,763 - root - INFO - Average iter time: 0.0578 seconds [rank0]:2024-02-29 17:05:11,763 - root - INFO - Average data load time: 0.0326 seconds [rank0]:2024-02-29 17:05:11,763 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2% [rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44% [rank0]:num retries: 0, num ooms: 0 [rank0]:NCCL version 2.19.3+cuda12.0 ``` Reviewers: Subscribers: Tasks: Tags: Co-authored-by: gnadathur <[email protected]>

Add job description field in toml

036f39e

Summary: Adding a description field, useful for integration tests to describe the test. Test Plan: Reviewers: Subscribers: Tasks: Tags:

gnadathur requested review from wanchaol and lessw2020 March 1, 2024 01:09

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 1, 2024

lessw2020 reviewed Mar 1, 2024

View reviewed changes

lessw2020 approved these changes Mar 1, 2024

View reviewed changes

gnadathur merged commit ce048cd into main Mar 1, 2024
4 checks passed

gnadathur deleted the add_descr branch March 1, 2024 22:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add job description field in toml #101

Add job description field in toml #101

gnadathur commented Mar 1, 2024

lessw2020 Mar 1, 2024

gnadathur Mar 1, 2024

lessw2020 left a comment •

edited

Loading

Add job description field in toml #101

Add job description field in toml #101

Conversation

gnadathur commented Mar 1, 2024

lessw2020 Mar 1, 2024

Choose a reason for hiding this comment

gnadathur Mar 1, 2024

Choose a reason for hiding this comment

lessw2020 left a comment • edited Loading

Choose a reason for hiding this comment

lessw2020 left a comment •

edited

Loading