Add job description field in toml (pytorch#101)

gnadathur · gnadathur · web-flow · commit 452baee66ff9 · 2024-03-01T10:03:21.000-08:00
Summary:
Adding a description field, useful for integration tests to describe the
test.

Test Plan:
```
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0,1
+ CONFIG_FILE=./train_configs/debug_model.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] 
W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] *****************************************
W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] *****************************************
[rank1]:2024-02-29 17:05:04,269 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank0]:2024-02-29 17:05:04,510 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank0]:2024-02-29 17:05:05,327 - root - INFO - Starting job: debug training
[rank0]:2024-02-29 17:05:05,327 - root - INFO - Building llama
[rank0]:2024-02-29 17:05:05,335 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-02-29 17:05:05,335 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank1]:2024-02-29 17:05:06,782 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank1]:2024-02-29 17:05:06,782 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank0]:2024-02-29 17:05:07,347 - root - INFO - Model fully initialized via reset_params
[rank0]:2024-02-29 17:05:07,349 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-02-29 17:05:07,349 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-02-29 17:05:07,349 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use
[rank0]:2024-02-29 17:05:08,375 - root - INFO - Applied FSDP to the model...
[rank0]:2024-02-29 17:05:08,376 - root - INFO - Gradient scaling not enabled.
[rank0]:2024-02-29 17:05:08,376 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240229-1705.
[rank0]:2024-02-29 17:05:08,610 - root - INFO - Profiling active.  Traces will be saved at ./outputs/profiling/traces
[rank0]:2024-02-29 17:05:10,570 - root - INFO - �[36mstep:  1  �[32mloss: 10.9183  �[39miter: �[34m 1.9258�[39m  data: �[34m0.0303  �[39mlr: �[33m0.00026667�[39m
[rank0]:2024-02-29 17:05:10,653 - root - INFO - �[36mstep:  2  �[32mloss: 10.8347  �[39miter: �[34m 0.0487�[39m  data: �[34m0.0336  �[39mlr: �[33m0.00053333�[39m
[rank0]:2024-02-29 17:05:10,733 - root - INFO - �[36mstep:  3  �[32mloss: 10.6861  �[39miter: �[34m  0.045�[39m  data: �[34m0.0334  �[39mlr: �[33m0.0008�[39m
[rank0]:2024-02-29 17:05:10,812 - root - INFO - �[36mstep:  4  �[32mloss: 10.4672  �[39miter: �[34m 0.0453�[39m  data: �[34m0.0336  �[39mlr: �[33m0.0007�[39m
[rank0]:2024-02-29 17:05:10,893 - root - INFO - �[36mstep:  5  �[32mloss: 10.2154  �[39miter: �[34m 0.0466�[39m  data: �[34m0.033  �[39mlr: �[33m0.0006�[39m
[rank0]:2024-02-29 17:05:10,975 - root - INFO - �[36mstep:  6  �[32mloss:  9.9573  �[39miter: �[34m 0.0496�[39m  data: �[34m0.0314  �[39mlr: �[33m0.0005�[39m
[rank0]:2024-02-29 17:05:11,056 - root - INFO - �[36mstep:  7  �[32mloss:  9.7627  �[39miter: �[34m 0.0486�[39m  data: �[34m0.0321  �[39mlr: �[33m0.0004�[39m
[rank0]:2024-02-29 17:05:11,201 - root - INFO - �[36mstep:  8  �[32mloss:   9.437  �[39miter: �[34m 0.0457�[39m  data: �[34m0.0333  �[39mlr: �[33m0.0003�[39m
[rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank0]:2024-02-29 17:05:11,317 - root - INFO - �[36mstep:  9  �[32mloss:  9.2446  �[39miter: �[34m 0.0794�[39m  data: �[34m0.0324  �[39mlr: �[33m0.0002�[39m
[rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:2024-02-29 17:05:11,748 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10
[rank0]:2024-02-29 17:05:11,762 - root - INFO - �[36mstep: 10  �[32mloss:  9.1772  �[39miter: �[34m 0.0893�[39m  data: �[34m0.0324  �[39mlr: �[33m0.0001�[39m
[rank0]:2024-02-29 17:05:11,763 - root - INFO - Average iter time: 0.0578 seconds
[rank0]:2024-02-29 17:05:11,763 - root - INFO - Average data load time: 0.0326 seconds
[rank0]:2024-02-29 17:05:11,763 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2%
[rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44%
[rank0]:num retries: 0, num ooms: 0
[rank0]:NCCL version 2.19.3+cuda12.0
```

Reviewers:

Subscribers:

Tasks:

Tags:

Co-authored-by: gnadathur &lt;gnadathur@devvm4378.nao0.facebook.com&gt;
diff --git a/torchtrain/config_manager.py b/torchtrain/config_manager.py
@@ -74,7 +74,12 @@ def init_args_from_command_line(
             default="./torchtrain/outputs",
             help="folder to dump job outputs",
         )
-
+        parser.add_argument(
+            "--job.description",
+            type=str,
+            default="default job",
+            help="description of the job",
+        )
         # profiling configs
         parser.add_argument(
             "--profiling.run_profiler",
diff --git a/train.py b/train.py
@@ -93,7 +93,7 @@ def main(job_config: JobConfig):
         world_size=world_size,
     )
     world_mesh = parallel_dims.build_mesh(device_type="cuda")
-
+    rank0_log(f"Starting job: {job_config.job.description}")
     model_name = job_config.model.name
     rank0_log(f"Building {model_name}")
     # build tokenizer
diff --git a/train_configs/debug_model.toml b/train_configs/debug_model.toml
@@ -1,6 +1,7 @@
 # TorchTrain Config.toml
 [job]
 dump_folder = "./outputs"
+description = "debug training"
 
 [profiling]
 run_profiler = true
diff --git a/train_configs/llama_7b.toml b/train_configs/llama_7b.toml
@@ -1,6 +1,7 @@
 # TorchTrain Config.toml
 [job]
 dump_folder = "./outputs"
+description = "llama 7b training"
 
 [profiling]
 run_profiler = true

Original file line number	Diff line number	Diff line change
`@@ -93,7 +93,7 @@ def main(job_config: JobConfig):`
`93`	`93`	`world_size=world_size,`
`94`	`94`	`)`
`95`	`95`	`world_mesh = parallel_dims.build_mesh(device_type="cuda")`
`96`		`-`
	`96`	`+ rank0_log(f"Starting job: {job_config.job.description}")`
`97`	`97`	`model_name = job_config.model.name`
`98`	`98`	`rank0_log(f"Building {model_name}")`
`99`	`99`	`# build tokenizer`