Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fbhuiyan2 -- adding Sophia to default configs #386

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

fbhuiyan2
Copy link

Added apprun and compute node for Sophia. Added Sophia as a default config with appropriate settings.yml and job_sample.sh script. Tested the configuration on Sophia. With the changes added, Balsam shows Sophia as an option when opening new sites. Tested running jobs on 'by-gpu' and 'by-node' queues, worked as expected. Further testing maybe needed to make sure node packing is working as expected. Hyperthreading is not enabled/added here, but can be added later.

@cms21
Copy link
Contributor

cms21 commented Jan 22, 2025

Added apprun and compute node for Sophia. Added Sophia as a default config with appropriate settings.yml and job_sample.sh script. Tested the configuration on Sophia. With the changes added, Balsam shows Sophia as an option when opening new sites. Tested running jobs on 'by-gpu' and 'by-node' queues, worked as expected. Further testing maybe needed to make sure node packing is working as expected. Hyperthreading is not enabled/added here, but can be added later.

Hi @fbhuiyan2, sorry it's taken so long to address this PR. One question, when you say you've tested in the by-gpu queue, can you clarify what you've tested?

@fbhuiyan2
Copy link
Author

fbhuiyan2 commented Jan 27, 2025

Yes, sure. I have tested running the Python job from your Balsam workshop. Moreover, I have been running LAMMPS and VASP calculations on Sophia using Balsam. I have not run any LAMMPS calculations using 'by-gpu' node, but I have run my VASP app using 'by-gpu' node. VASP jobs ran just fine.

I found out that node packing also works. Initially, I assumed that each gpu in the 'by-gpu' node would be a 'node' for node packing purposes. But it turned out to be wrong, nodes are still actual nodes. To keep things simple, node_packing_count = 1 should be used for 'by-gpu'.

If higher node packing is used, like node_packing_count = 4 with n_gpus=2, then the following error can occur if you do not ask for or get 8 gpus in the same node:

=================================
[sophia-gpu-20:1487546] *** Process received signal ***
[sophia-gpu-20:1487546] Signal: Segmentation fault (11)
[sophia-gpu-20:1487546] Signal code:  (-6)
[sophia-gpu-20:1487546] Failing at address: 0x9b8c0016b2ba
[sophia-gpu-20:1487546] [ 0] /lib64/libc.so.6(+0x3e6f0)[0x14ffd203e6f0]
[sophia-gpu-20:1487546] [ 1] /soft/applications/vasp/vasp.6.4.3/bin/vasp_std[0x1f76ff0]
[sophia-gpu-20:1487546] [ 2] /soft/applications/vasp/vasp.6.4.3/bin/vasp_std[0x1f50900]
[sophia-gpu-20:1487546] [ 3] /soft/applications/vasp/vasp.6.4.3/bin/vasp_std[0x4bef88]
[sophia-gpu-20:1487546] [ 4] /soft/applications/vasp/vasp.6.4.3/bin/vasp_std[0x836487]
[sophia-gpu-20:1487546] [ 5] /soft/applications/vasp/vasp.6.4.3/bin/vasp_std[0xf01cd0]
[sophia-gpu-20:1487546] [ 6] /soft/applications/vasp/vasp.6.4.3/bin/vasp_std[0xf931d1]
[sophia-gpu-20:1487546] [ 7] /soft/applications/vasp/vasp.6.4.3/bin/vasp_std[0x19410fd]
[sophia-gpu-20:1487546] [ 8] /soft/applications/vasp/vasp.6.4.3/bin/vasp_std[0x1918214]
[sophia-gpu-20:1487546] [ 9] /soft/applications/vasp/vasp.6.4.3/bin/vasp_std[0x415ab1]
[sophia-gpu-20:1487546] [10] /lib64/libc.so.6(+0x29590)[0x14ffd2029590]
[sophia-gpu-20:1487546] [11] /lib64/libc.so.6(__libc_start_main+0x80)[0x14ffd2029640]
[sophia-gpu-20:1487546] [12] /soft/applications/vasp/vasp.6.4.3/bin/vasp_std[0x411f65]
[sophia-gpu-20:1487546] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node sophia-gpu-20 exited on signal 11 (Segmentation fault).

Here, I asked for 4 gpus in Balsam queue with node packing = 4 and n_gpus = 2. Balsam tried to pack 4 calculations in the 4 gpus but only 2 jobs could fit, the other 2 threw out this error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants