This site is currently under development for the RSE-ops community.
Published:
Reading time: 1 min
Multilevel checkpointing allows HPC applications to take both frequent inexpensive checkpoints and less frequent, more resilient checkpoints, resulting in better efficiency and reduced load on the parallel file system. Accordingly, LLNL researchers developed the Scalable Checkpoint/Restart (SCR) library for the large-scale, production system context.
Learn more on the LLNL Computing website. Read the SCR user guide and fork the code on GitHub.
Help improve its content by opening a Pull Request on GitHub.