Skip to content

These are a bunch of small python test files for framework testing

Notifications You must be signed in to change notification settings

argonne-lcf/test_frameworks

Repository files navigation

Testing Frameworks

This repo is meant to be used for testing the frameworks software stacks on ALCF systems. It includes a set of simple pytorch examples to test whether the frameworks work well or not.

The issues will be reported here: https://github.com/argonne-lcf/test_frameworks/issues. This will be used to keep track of all the issues.

  • Torch Dist test: Torch dist communication tests, including all the collective communicatino tests.

    mpiexec -np 24 --ppn 12 --cpu-binding $CPU_BIND python3 ./test_torch_dist.py
  • DTensor: This is testing the distributed matrix multiplication tests using DTensor. --dim is the total dimension of the global matrix, and --tp-size is the organization of the processor mesh (tp_size, world_size/tp_size)

    mpiexec -np 24 --ppn 12 --cpu-binding $CPU_BIND python3 ./test_dtensor.py --tp-size 8 --dim 96
  • ResNet50: Resnet50 with FSDP or DDP

    mpiexec -np 24 --ppn 12 --cpu-binding $CPU_BIND python3 ./test_resnet50.py
  • MNIST: MNIST with DDP

    mpiexec -np 24 --ppn 12 --cpu-binding $CPU_BIND python3 ./test_mnist.py
  • mpi4py: Testing mpi4py

    mpiexec -np 24 --ppn 12 --cpu-binding $CPU_BIND python3 ./test_mpi4py.py
  • Checkpoint: this is to test the file system and storage system

    mpiexec -np 24 --ppn 12 --cpu-binding $CPU_BIND python3 ./test_torch_checkpoint.py --output-folder /tmp/

How to contribute

If you have any tests that you think important to include, please send a PR. In the PR, please include 1) the code to run; 2) information about what aspects of the software / hardware the test is evaluating.

About

These are a bunch of small python test files for framework testing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages