New automated Checkpoint-Restart Tests #324

mabruzzo · 2023-07-04T21:03:33Z

This PR introduces new automated Checkpoint-Restart tests that will be executed by the continuous integration as part of the pytest framework.

Overview

Prior to the migration from Scons to Cmake, we previously had some Checkpoint-Restart tests. Unfortunately, those tests got lost in the transition. At the same time, James has been hard at work developing/refining the new Scalable Checkpoint functionality.

It was initially my intention to just write tests for the new-style checkpoints. However, issue #316 has made it clear that the new-style checkpoints are not ready to completely replace the old-style checkpoints quite yet. For that reason, I have introduced tests of both types of Checkpoints.

I largely rewrote most of the code for the Checkpoint-Restart functionality (hopefully it's easier to digest than the older version of the code that is now removed).

Part of the reason the old code was so hard to understand was that the original version wrote a log file that tried to look a lot like the error-reports written by our unit-tests. The new testing code is written with the intention of being called with pytest (even though they aren't actually implementing answer tests).
The core functionality of can be found test/answer_tests/test_utils/ckpt_restart_testing.py
The actual automated tests are defined in test/answer_tests/test_ckpt_restart.py.
- However, the pytest framework can be pretty cumbersome when you are trying to diagnose issues. For that reason, I have written a command-line tool, called tools/ckpt_restart_test.py, that can be used to help diagnose problems (it makes use of the exact same core-logic as the unit-tests).
- For context, this replaced an old file by the same-name that was previously used to run the automated restart tests that had been integrated with the Scons build system
The implementation of the testing logic is slightly more complicated than I would have liked since ability to test the legacy charm-based restarts were introduced an after-thought

Essentially these tests execute the following logic:

They setup some temporary directories where the simulations will be run
The checkpoint-run is executed
The restart-run is executed, starting from one of the checkpoints
Data dumps from both simulations are compared (the tests require them to be bitwise identical).
- For tests of new-style checkpoints, the test code explicitly compares the checkpoint-dumps themselves. It can optionally be configured to compare dumps generated by the legacy-Output machinery (in the future, we should probably add support for comparing the scalable data-dumps)
- For tests of old-style checkpoints, the code just compares the dumps generated by the legacy-Output machinery.

The website-documentation has been updated for these checkpoint tests, and provides additional detail. If you still have questions, let me know and I can update it.

Notes for Reviewers:

I have confirmed that the tests of the new-style tests all pass when the changes introduced in Texascale bug fixes #313 are included. However, the old-style tests all pass regardless of which version is used. Thus I have commented out the tests for the new-style restarts from test/answer_tests/test_ckpt_restart.py and I will submit another PR after Texascale bug fixes #313 is merged (or update this one if it hasn't merged yet).
I have stumbled onto 2 bugs with these new tests. There seem to be some very subtle issues when the new-style checkpoint-restart tests are used with the VL+CT solver. When old-style data dumps are compared (following a new-style restart), there are also differences in the dataset saved to the root block. I plan to investigate both of these topics slightly more before opening issues. But, they have no impact on whether this PR should be accepted or not.

EDIT: Now that #318 is merged, this PR no longer has any dependencies. The "Files Changed" tab now just includes changes from this PR (rather than from both PRs)

…-restart functionallity.

…new ckpt-restart machinery.

…estart functionality (until PR enzo-project#313 is merged)

doc/source/tests/checkpoint.rst

doc/source/tests/pytest_framework.rst

Co-authored-by: Greg Bryan <[email protected]>

jobordner · 2024-03-14T01:13:34Z

Recommendations:

Change occurrences of "legacy" to e.g. "charm" or "charm-based" checkpoint/restart, and refer to the new approach as e.g. "cello-based"

The Charm++ group has put a lot of effort into developing some very sophisticated algorithms for checkpoint/restart (as well as load-balancing), so for diplomatic reasons I think we should not relegate using their approaches to "legacy" status.

The original Charm++-based CR approach provides some useful functionality not in the new Cello-based approach, for example providing fault-tolerance by checkpointing to remote memory, which can allow simulations to continue running even after a node goes down.

We will want to be in the position to take advantage of any future improvements and capabilities to CR that the Charm++ group develops.

Rename or remove "ppm-implostion2D-NoTestingSection.incl" (and spellcheck!)
Ensure tests run correctly after merging in latest updates from the main branch.

…estingSection.incl

mabruzzo · 2024-05-13T17:24:08Z

@jobordner: I've started to address your recommendations.

I had quick question about this recommendation:

Rename or remove "ppm-implostion2D-NoTestingSection.incl" (and spellcheck!)

I have now renamed this file so that it's called ppm-implosion2D-NoTestingSection.incl.

This file exists because:

it's difficult to run checkpoint-restart tests for a parameter-file that defines a Testing parameter-group or (to a lesser-extent) a Stopping parameter-group.
I wanted to use the test-setup defined in input/PPM/ppm.incl for the checkpoint-restart tests, but that file unfortunately includes Testing and Stopping sections of parameters. (I also thought it would be better to simply make a copy of the contents since that increases the maintenance burden later)
Since input/PPM/ppm.incl is already used in a bunch of test-problems, I decided to avoid directly modifying that problem (I was primarily concerned that it would make this PR very difficult to review).
Consequently, I moved the bulk of ppm.incl's contents into ppm-implosion2D-NoTestingSection.incl other than the Testing and Stopping sections. Then ppm.incl includes those contents and introduces the Testing and Stopping sections

Would you prefer that I do something like:

rename ppm.incl to something like ppm-implosion2D-fulltestcase.incl AND
rename ppm-implosion2D-NoTestingSection.incl to ppm-implosion2D.incl?

jobordner · 2024-05-17T02:27:46Z

it's difficult to run checkpoint-restart tests for a parameter-file that defines a Testing parameter-group or (to a lesser-extent) a Stopping parameter-group.

(I assume you're referring to parameter include files (foo.incl) and not parameter test files (test-foo.in) here? Defining Stopping parameters is required for a simulation to stop).

I wanted to use the test-setup defined in input/PPM/ppm.incl for the checkpoint-restart tests, but that file unfortunately includes Testing and Stopping sections of parameters. (I also thought it would be better to simply make a copy of the contents since that increases the maintenance burden later)

Since input/PPM/ppm.incl is already used in a bunch of test-problems, I decided to avoid directly modifying that problem (I was primarily concerned that it would make this PR very difficult to review).

Consequently, I moved the bulk of ppm.incl's contents into ppm-implosion2D-NoTestingSection.incl other than the Testing and Stopping sections. Then ppm.incl includes those contents and introduces the Testing and Stopping sections

Would you prefer that I do something like:

rename ppm.incl to something like ppm-implosion2D-fulltestcase.incl AND

rename ppm-implosion2D-NoTestingSection.incl to ppm-implosion2D.incl?

IMHO duplicating code (or parameter files) should be avoided, so relocating the Testing and Stopping parameter groups from the parameter include file to the parameter test files would be preferred, despite complicating reviewers' work. Adding them via e.g. ppm-testing.incl and ppm-stopping.incl files would simplify both relocating and reviewing.

…ld produce an error if the mask_list_ attribute was not empty. I didn't realize that it usually held a single nullptr)

jobordner · 2024-09-26T01:46:26Z

doc/source/tests/checkpoint.rst

+1. The new approach that writes checkpoints with the ``"check"`` method, that promises a lot more flexibility.
+2. The legacy approach that uses Charm++'s machinery
+
+The intention is to entirely transition to the new approach. But, at this time the new approach has limitations (see `Issue #316 <https://github.com/enzo-project/enzo-e/issues/316>`_). For that reason, both approaches are automatically tested (by the pytest machinery).


The intention is not to transition to the new approach. Each has its advantages and disadvantages, so both should be kept. Charm++ CR can do lots of things Cello CR cannot.

jobordner · 2024-10-16T21:49:11Z

doc/source/tests/checkpoint.rst

+
+  .. code-block:: bash
+
+      python3 tools/ckpt_restart_test.py \


I get the following when I run this:

Test failed: NOT cleaning up Traceback (most recent call last): File "/home/bordner/Cello/enzo-e.mabruzzo.new-ckpt-restart-tests/tools/ckpt_restart_test.py", line 165, in <module> main(parser.parse_args()) File "/home/bordner/Cello/enzo-e.mabruzzo.new-ckpt-restart-tests/tools/ckpt_restart_test.py", line 71, in main run_ckpt_restart_test(nominal_input = args.input, File "/home/bordner/Cello/enzo-e.mabruzzo.new-ckpt-restart-tests/test/answer_tests/test_utils/ckpt_restart_testing.py", line 464, in run_ckpt_restart_test restart_h5obj_map = ckpt_block_file_map( ^^^^^^^^^^^^^^^^^^^^ File "/home/bordner/Cello/enzo-e.mabruzzo.new-ckpt-restart-tests/test/answer_tests/test_utils/ckpt_restart_testing.py", line 22, in ckpt_block_file_map with open(filelist_path, 'r') as f_filelist: ^^^^^^^^^^^^^^^^^^^^^^^^ FileNotFoundError: [Errno 2] No such file or directory: 'my_test/restart_run/Check-02/check.file_list'

jobordner · 2024-10-16T21:51:15Z

This branch is 11 commits ahead of, 168 commits behind

Laying the foundation for automated tests of the new-style checkpoint…

2dfe9c5

…-restart functionallity.

mabruzzo mentioned this pull request Jul 4, 2023

Improve ergonomics of pytest machinery #318

Merged

mabruzzo added testing io:checkpoint-restart Issue/PR associated with checkpoint/restart labels Jul 4, 2023

mabruzzo force-pushed the new-ckpt-restart-tests branch from 7b982aa to be93341 Compare July 5, 2023 17:33

mabruzzo added 2 commits July 5, 2023 15:45

introduced new ckpt-restart tests to test suite.

597b115

introduce support for testing the legacy charm-based restarts in the …

f922517

…new ckpt-restart machinery.

mabruzzo force-pushed the new-ckpt-restart-tests branch from be93341 to 136c40a Compare July 5, 2023 19:45

mabruzzo added 2 commits July 5, 2023 16:31

Documented the checkpoint restart tests.

346a7b7

comment out the automated answer-tests for the new-style Checkpoint-R…

d3d6a2e

…estart functionality (until PR enzo-project#313 is merged)

mabruzzo force-pushed the new-ckpt-restart-tests branch from 136c40a to d3d6a2e Compare July 5, 2023 20:33

gregbryan approved these changes Jul 21, 2023

View reviewed changes

doc/source/tests/checkpoint.rst Outdated Show resolved Hide resolved

doc/source/tests/pytest_framework.rst Outdated Show resolved Hide resolved

doc/source/tests/pytest_framework.rst Outdated Show resolved Hide resolved

mabruzzo and others added 2 commits July 21, 2023 14:26

Apply suggestions from code review

e09d392

Co-authored-by: Greg Bryan <[email protected]>

Merge branch 'main' into new-ckpt-restart-tests

8cab0f5

mabruzzo force-pushed the new-ckpt-restart-tests branch from 466a928 to 8cab0f5 Compare August 1, 2023 13:06

jwise77 requested a review from jobordner February 13, 2024 14:40

mabruzzo added 2 commits May 13, 2024 12:48

Merge branch 'main' into new-ckpt-restart-tests

470d6b4

renaming ppm-implosion2D-NoTestingSection.incl -> ppm-implosion2D-NoT…

927def2

…estingSection.incl

jobordner closed this May 17, 2024

mabruzzo reopened this May 17, 2024

mabruzzo added 2 commits May 20, 2024 11:13

Merge branch 'main' into new-ckpt-restart-tests

641246c

Fix minor serialization bug I previously introduced (basically, I wou…

a090ec5

…ld produce an error if the mask_list_ attribute was not empty. I didn't realize that it usually held a single nullptr)

jobordner reviewed Sep 26, 2024

View reviewed changes

jobordner reviewed Oct 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New automated Checkpoint-Restart Tests #324

New automated Checkpoint-Restart Tests #324

mabruzzo commented Jul 4, 2023 •

edited

Loading

jobordner commented Mar 14, 2024

mabruzzo commented May 13, 2024

jobordner commented May 17, 2024

jobordner Sep 26, 2024

jobordner Oct 16, 2024

jobordner commented Oct 16, 2024

New automated Checkpoint-Restart Tests #324

Are you sure you want to change the base?

New automated Checkpoint-Restart Tests #324

Conversation

mabruzzo commented Jul 4, 2023 • edited Loading

Overview

Notes for Reviewers:

jobordner commented Mar 14, 2024

mabruzzo commented May 13, 2024

jobordner commented May 17, 2024

jobordner Sep 26, 2024

Choose a reason for hiding this comment

jobordner Oct 16, 2024

Choose a reason for hiding this comment

jobordner commented Oct 16, 2024

mabruzzo commented Jul 4, 2023 •

edited

Loading