Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New automated Checkpoint-Restart Tests #324

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

mabruzzo
Copy link
Contributor

@mabruzzo mabruzzo commented Jul 4, 2023

This PR introduces new automated Checkpoint-Restart tests that will be executed by the continuous integration as part of the pytest framework.

Overview

Prior to the migration from Scons to Cmake, we previously had some Checkpoint-Restart tests. Unfortunately, those tests got lost in the transition. At the same time, James has been hard at work developing/refining the new Scalable Checkpoint functionality.

It was initially my intention to just write tests for the new-style checkpoints. However, issue #316 has made it clear that the new-style checkpoints are not ready to completely replace the old-style checkpoints quite yet. For that reason, I have introduced tests of both types of Checkpoints.

I largely rewrote most of the code for the Checkpoint-Restart functionality (hopefully it's easier to digest than the older version of the code that is now removed).

  • Part of the reason the old code was so hard to understand was that the original version wrote a log file that tried to look a lot like the error-reports written by our unit-tests. The new testing code is written with the intention of being called with pytest (even though they aren't actually implementing answer tests).
  • The core functionality of can be found test/answer_tests/test_utils/ckpt_restart_testing.py
  • The actual automated tests are defined in test/answer_tests/test_ckpt_restart.py.
    • However, the pytest framework can be pretty cumbersome when you are trying to diagnose issues. For that reason, I have written a command-line tool, called tools/ckpt_restart_test.py, that can be used to help diagnose problems (it makes use of the exact same core-logic as the unit-tests).
    • For context, this replaced an old file by the same-name that was previously used to run the automated restart tests that had been integrated with the Scons build system
  • The implementation of the testing logic is slightly more complicated than I would have liked since ability to test the legacy charm-based restarts were introduced an after-thought

Essentially these tests execute the following logic:

  1. They setup some temporary directories where the simulations will be run
  2. The checkpoint-run is executed
  3. The restart-run is executed, starting from one of the checkpoints
  4. Data dumps from both simulations are compared (the tests require them to be bitwise identical).
    • For tests of new-style checkpoints, the test code explicitly compares the checkpoint-dumps themselves. It can optionally be configured to compare dumps generated by the legacy-Output machinery (in the future, we should probably add support for comparing the scalable data-dumps)
    • For tests of old-style checkpoints, the code just compares the dumps generated by the legacy-Output machinery.

The website-documentation has been updated for these checkpoint tests, and provides additional detail. If you still have questions, let me know and I can update it.

Notes for Reviewers:

  • I have confirmed that the tests of the new-style tests all pass when the changes introduced in Texascale bug fixes #313 are included. However, the old-style tests all pass regardless of which version is used. Thus I have commented out the tests for the new-style restarts from test/answer_tests/test_ckpt_restart.py and I will submit another PR after Texascale bug fixes #313 is merged (or update this one if it hasn't merged yet).
  • I have stumbled onto 2 bugs with these new tests. There seem to be some very subtle issues when the new-style checkpoint-restart tests are used with the VL+CT solver. When old-style data dumps are compared (following a new-style restart), there are also differences in the dataset saved to the root block. I plan to investigate both of these topics slightly more before opening issues. But, they have no impact on whether this PR should be accepted or not.

EDIT: Now that #318 is merged, this PR no longer has any dependencies. The "Files Changed" tab now just includes changes from this PR (rather than from both PRs)

@mabruzzo mabruzzo added testing io:checkpoint-restart Issue/PR associated with checkpoint/restart labels Jul 4, 2023
@mabruzzo mabruzzo force-pushed the new-ckpt-restart-tests branch from 7b982aa to be93341 Compare July 5, 2023 17:33
@mabruzzo mabruzzo force-pushed the new-ckpt-restart-tests branch from be93341 to 136c40a Compare July 5, 2023 19:45
@mabruzzo mabruzzo force-pushed the new-ckpt-restart-tests branch from 136c40a to d3d6a2e Compare July 5, 2023 20:33
@mabruzzo mabruzzo force-pushed the new-ckpt-restart-tests branch from 466a928 to 8cab0f5 Compare August 1, 2023 13:06
@jwise77 jwise77 requested a review from jobordner February 13, 2024 14:40
@jobordner
Copy link
Contributor

Recommendations:

  1. Change occurrences of "legacy" to e.g. "charm" or "charm-based" checkpoint/restart, and refer to the new approach as e.g. "cello-based"
  • The Charm++ group has put a lot of effort into developing some very sophisticated algorithms for checkpoint/restart (as well as load-balancing), so for diplomatic reasons I think we should not relegate using their approaches to "legacy" status.
  • The original Charm++-based CR approach provides some useful functionality not in the new Cello-based approach, for example providing fault-tolerance by checkpointing to remote memory, which can allow simulations to continue running even after a node goes down.
  • We will want to be in the position to take advantage of any future improvements and capabilities to CR that the Charm++ group develops.
  1. Rename or remove "ppm-implostion2D-NoTestingSection.incl" (and spellcheck!)
  2. Ensure tests run correctly after merging in latest updates from the main branch.

@mabruzzo
Copy link
Contributor Author

@jobordner: I've started to address your recommendations.

I had quick question about this recommendation:

  1. Rename or remove "ppm-implostion2D-NoTestingSection.incl" (and spellcheck!)

I have now renamed this file so that it's called ppm-implosion2D-NoTestingSection.incl.

This file exists because:

  • it's difficult to run checkpoint-restart tests for a parameter-file that defines a Testing parameter-group or (to a lesser-extent) a Stopping parameter-group.
  • I wanted to use the test-setup defined in input/PPM/ppm.incl for the checkpoint-restart tests, but that file unfortunately includes Testing and Stopping sections of parameters. (I also thought it would be better to simply make a copy of the contents since that increases the maintenance burden later)
  • Since input/PPM/ppm.incl is already used in a bunch of test-problems, I decided to avoid directly modifying that problem (I was primarily concerned that it would make this PR very difficult to review).
  • Consequently, I moved the bulk of ppm.incl's contents into ppm-implosion2D-NoTestingSection.incl other than the Testing and Stopping sections. Then ppm.incl includes those contents and introduces the Testing and Stopping sections

Would you prefer that I do something like:

  • rename ppm.incl to something like ppm-implosion2D-fulltestcase.incl AND
  • rename ppm-implosion2D-NoTestingSection.incl to ppm-implosion2D.incl?

@jobordner
Copy link
Contributor

  • it's difficult to run checkpoint-restart tests for a parameter-file that defines a Testing parameter-group or (to a lesser-extent) a Stopping parameter-group.

(I assume you're referring to parameter include files (foo.incl) and not parameter test files (test-foo.in) here? Defining Stopping parameters is required for a simulation to stop).

  • I wanted to use the test-setup defined in input/PPM/ppm.incl for the checkpoint-restart tests, but that file unfortunately includes Testing and Stopping sections of parameters. (I also thought it would be better to simply make a copy of the contents since that increases the maintenance burden later)
  • Since input/PPM/ppm.incl is already used in a bunch of test-problems, I decided to avoid directly modifying that problem (I was primarily concerned that it would make this PR very difficult to review).
  • Consequently, I moved the bulk of ppm.incl's contents into ppm-implosion2D-NoTestingSection.incl other than the Testing and Stopping sections. Then ppm.incl includes those contents and introduces the Testing and Stopping sections

Would you prefer that I do something like:

  • rename ppm.incl to something like ppm-implosion2D-fulltestcase.incl AND
  • rename ppm-implosion2D-NoTestingSection.incl to ppm-implosion2D.incl?

IMHO duplicating code (or parameter files) should be avoided, so relocating the Testing and Stopping parameter groups from the parameter include file to the parameter test files would be preferred, despite complicating reviewers' work. Adding them via e.g. ppm-testing.incl and ppm-stopping.incl files would simplify both relocating and reviewing.

@jobordner jobordner closed this May 17, 2024
@mabruzzo mabruzzo reopened this May 17, 2024
mabruzzo added 2 commits May 20, 2024 11:13
…ld produce an error if the mask_list_ attribute was not empty. I didn't realize that it usually held a single nullptr)
1. The new approach that writes checkpoints with the ``"check"`` method, that promises a lot more flexibility.
2. The legacy approach that uses Charm++'s machinery

The intention is to entirely transition to the new approach. But, at this time the new approach has limitations (see `Issue #316 <https://github.com/enzo-project/enzo-e/issues/316>`_). For that reason, both approaches are automatically tested (by the pytest machinery).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is not to transition to the new approach. Each has its advantages and disadvantages, so both should be kept. Charm++ CR can do lots of things Cello CR cannot.


.. code-block:: bash

python3 tools/ckpt_restart_test.py \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the following when I run this:

Test failed: NOT cleaning up
Traceback (most recent call last):
  File "/home/bordner/Cello/enzo-e.mabruzzo.new-ckpt-restart-tests/tools/ckpt_restart_test.py", line 165, in <module>
    main(parser.parse_args())
  File "/home/bordner/Cello/enzo-e.mabruzzo.new-ckpt-restart-tests/tools/ckpt_restart_test.py", line 71, in main
    run_ckpt_restart_test(nominal_input = args.input,
  File "/home/bordner/Cello/enzo-e.mabruzzo.new-ckpt-restart-tests/test/answer_tests/test_utils/ckpt_restart_testing.py", line 464, in run_ckpt_restart_test
    restart_h5obj_map = ckpt_block_file_map(
                        ^^^^^^^^^^^^^^^^^^^^
  File "/home/bordner/Cello/enzo-e.mabruzzo.new-ckpt-restart-tests/test/answer_tests/test_utils/ckpt_restart_testing.py", line 22, in ckpt_block_file_map
    with open(filelist_path, 'r') as f_filelist:
         ^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'my_test/restart_run/Check-02/check.file_list'

@jobordner
Copy link
Contributor

This branch is 11 commits ahead of, 168 commits behind

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io:checkpoint-restart Issue/PR associated with checkpoint/restart testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants