Multilevel checkpointing allows HPC applications to take both frequent inexpensive checkpoints and less frequent, more resilient checkpoints, resulting in better efficiency and reduced load on the parallel file system. Accordingly, LLNL researchers developed the Scalable Checkpoint/Restart (SCR) library for the large-scale, production system context.

Learn more on the LLNL Computing website. Read the SCR user guide and fork the code on GitHub.