Fbhuiyan2 -- adding Sophia to default configs #386

fbhuiyan2 · 2024-12-14T09:43:19Z

Added apprun and compute node for Sophia. Added Sophia as a default config with appropriate settings.yml and job_sample.sh script. Tested the configuration on Sophia. With the changes added, Balsam shows Sophia as an option when opening new sites. Tested running jobs on 'by-gpu' and 'by-node' queues, worked as expected. Further testing maybe needed to make sure node packing is working as expected. Hyperthreading is not enabled/added here, but can be added later.

cms21 · 2025-01-22T22:28:24Z

Added apprun and compute node for Sophia. Added Sophia as a default config with appropriate settings.yml and job_sample.sh script. Tested the configuration on Sophia. With the changes added, Balsam shows Sophia as an option when opening new sites. Tested running jobs on 'by-gpu' and 'by-node' queues, worked as expected. Further testing maybe needed to make sure node packing is working as expected. Hyperthreading is not enabled/added here, but can be added later.

Hi @fbhuiyan2, sorry it's taken so long to address this PR. One question, when you say you've tested in the by-gpu queue, can you clarify what you've tested?

fbhuiyan2 · 2025-01-27T14:54:59Z

Yes, sure. I have tested running the Python job from your Balsam workshop. Moreover, I have been running LAMMPS and VASP calculations on Sophia using Balsam. I have not run any LAMMPS calculations using 'by-gpu' node, but I have run my VASP app using 'by-gpu' node. VASP jobs ran just fine.

I found out that node packing also works. Initially, I assumed that each gpu in the 'by-gpu' node would be a 'node' for node packing purposes. But it turned out to be wrong, nodes are still actual nodes. To keep things simple, node_packing_count = 1 should be used for 'by-gpu'.

If higher node packing is used, like node_packing_count = 4 with n_gpus=2, then the following error can occur if you do not ask for or get 8 gpus in the same node:

=================================
[sophia-gpu-20:1487546] *** Process received signal ***
[sophia-gpu-20:1487546] Signal: Segmentation fault (11)
[sophia-gpu-20:1487546] Signal code:  (-6)
[sophia-gpu-20:1487546] Failing at address: 0x9b8c0016b2ba
[sophia-gpu-20:1487546] [ 0] /lib64/libc.so.6(+0x3e6f0)[0x14ffd203e6f0]
[sophia-gpu-20:1487546] [ 1] /soft/applications/vasp/vasp.6.4.3/bin/vasp_std[0x1f76ff0]
[sophia-gpu-20:1487546] [ 2] /soft/applications/vasp/vasp.6.4.3/bin/vasp_std[0x1f50900]
[sophia-gpu-20:1487546] [ 3] /soft/applications/vasp/vasp.6.4.3/bin/vasp_std[0x4bef88]
[sophia-gpu-20:1487546] [ 4] /soft/applications/vasp/vasp.6.4.3/bin/vasp_std[0x836487]
[sophia-gpu-20:1487546] [ 5] /soft/applications/vasp/vasp.6.4.3/bin/vasp_std[0xf01cd0]
[sophia-gpu-20:1487546] [ 6] /soft/applications/vasp/vasp.6.4.3/bin/vasp_std[0xf931d1]
[sophia-gpu-20:1487546] [ 7] /soft/applications/vasp/vasp.6.4.3/bin/vasp_std[0x19410fd]
[sophia-gpu-20:1487546] [ 8] /soft/applications/vasp/vasp.6.4.3/bin/vasp_std[0x1918214]
[sophia-gpu-20:1487546] [ 9] /soft/applications/vasp/vasp.6.4.3/bin/vasp_std[0x415ab1]
[sophia-gpu-20:1487546] [10] /lib64/libc.so.6(+0x29590)[0x14ffd2029590]
[sophia-gpu-20:1487546] [11] /lib64/libc.so.6(__libc_start_main+0x80)[0x14ffd2029640]
[sophia-gpu-20:1487546] [12] /soft/applications/vasp/vasp.6.4.3/bin/vasp_std[0x411f65]
[sophia-gpu-20:1487546] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node sophia-gpu-20 exited on signal 11 (Segmentation fault).

Here, I asked for 4 gpus in Balsam queue with node packing = 4 and n_gpus = 2. Balsam tried to pack 4 calculations in the 4 gpus but only 2 jobs could fit, the other 2 threw out this error.

app_run for Crux

fbhuiyan2 added 7 commits December 13, 2024 21:17

Create sophia.py

9703084

Create alcf_sophia_node.py

50e6f91

Update __init__.py in compute_node

1dea652

Update __init__.py in app_run

1cb0949

Create settings.yml

d5f30d0

Create job-template.sh

a20717e

Create how to add Sophia to Balsam default configs.md

628f8c6

fbhuiyan2 added 6 commits February 21, 2025 12:41

Create crux.py

5f78626

app_run for Crux

Create alcf_crux_node.py

387885e

Create settings.yml

8d4c5b6

Create job-template.sh

e8508e3

Update __init__.py

74e1c4e

Update __init__.py

18d5c11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fbhuiyan2 -- adding Sophia to default configs #386

Fbhuiyan2 -- adding Sophia to default configs #386

fbhuiyan2 commented Dec 14, 2024

cms21 commented Jan 22, 2025

fbhuiyan2 commented Jan 27, 2025 •

edited

Loading

Fbhuiyan2 -- adding Sophia to default configs #386

Are you sure you want to change the base?

Fbhuiyan2 -- adding Sophia to default configs #386

Conversation

fbhuiyan2 commented Dec 14, 2024

cms21 commented Jan 22, 2025

fbhuiyan2 commented Jan 27, 2025 • edited Loading

fbhuiyan2 commented Jan 27, 2025 •

edited

Loading