Saving SCF results on disk and SCF checkpoints

For longer DFT calculations it is pretty standard to run them on a cluster in advance and to perform postprocessing (band structure calculation, plotting of density, etc.) at a later point and potentially on a different machine.

To support such workflows DFTK offers the two functions save_scfres and load_scfres, which allow to save the data structure returned by self_consistent_field on disk or retrieve it back into memory, respectively. For this purpose DFTK uses the JLD2.jl file format and Julia package.

Availability of `load_scfres`, `save_scfres` and checkpointing

As JLD2 is an optional dependency of DFTK these three functions are only available once one has both imported DFTK and JLD2 (using DFTK and using JLD2).

DFTK data formats are not yet fully matured

The data format in which DFTK saves data as well as the general interface of the load_scfres and save_scfres pair of functions are not yet fully matured. If you use the functions or the produced files expect that you need to adapt your routines in the future even with patch version bumps.

To illustrate the use of the functions in practice we will compute the total energy of the O₂ molecule at PBE level. To get the triplet ground state we use a collinear spin polarisation (see Collinear spin and magnetic systems for details) and a bit of temperature to ease convergence:

using DFTK
using LinearAlgebra
using JLD2

d = 2.079  # oxygen-oxygen bondlength
a = 9.0    # size of the simulation box
lattice = a * I(3)
O = ElementPsp(:O; psp=load_psp("hgh/pbe/O-q6.hgh"))
atoms     = [O, O]
positions = d / 2a * [[0, 0, 1], [0, 0, -1]]
magnetic_moments = [1., 1.]

Ecut  = 10  # Far too small to be converged
model = model_PBE(lattice, atoms, positions; temperature=0.02, smearing=Smearing.Gaussian(),
                  magnetic_moments)
basis = PlaneWaveBasis(model; Ecut, kgrid=[1, 1, 1])

scfres = self_consistent_field(basis, tol=1e-2, ρ=guess_density(basis, magnetic_moments))
save_scfres("scfres.jld2", scfres);
n     Energy            log10(ΔE)   log10(Δρ)   Magnet   Diag   Δtime
---   ---------------   ---------   ---------   ------   ----   ------
  1   -27.64472103757                   -0.13    0.001    6.5   89.4ms
  2   -28.92294808026        0.11       -0.82    0.673    2.0   76.5ms
  3   -28.93095187052       -2.10       -1.14    1.172    2.0   77.0ms
  4   -28.93763185705       -2.18       -1.18    1.765    2.0   68.9ms
  5   -28.93954139888       -2.72       -1.50    1.997    2.0   92.5ms
  6   -28.93959823871       -4.25       -1.99    1.978    1.0   65.6ms
  7   -28.93961183692       -4.87       -2.84    1.986    1.0   66.6ms
scfres.energies
Energy breakdown (in Ha):
    Kinetic             16.7715770
    AtomicLocal         -58.4947504
    AtomicNonlocal      4.7096482 
    Ewald               -4.8994689
    PspCorrection       0.0044178 
    Hartree             19.3610300
    Xc                  -6.3912242
    Entropy             -0.0008414

    total               -28.939611836918

The scfres.jld2 file could now be transferred to a different computer, Where one could fire up a REPL to inspect the results of the above calculation:

using DFTK
using JLD2
loaded = load_scfres("scfres.jld2")
propertynames(loaded)
(:α, :history_Δρ, :converged, :occupation, :occupation_threshold, :algorithm, :basis, :runtime_ns, :n_iter, :history_Etot, :εF, :energies, :ρ, :n_bands_converge, :eigenvalues, :ψ, :ham)
loaded.energies
Energy breakdown (in Ha):
    Kinetic             16.7715770
    AtomicLocal         -58.4947504
    AtomicNonlocal      4.7096482 
    Ewald               -4.8994689
    PspCorrection       0.0044178 
    Hartree             19.3610300
    Xc                  -6.3912242
    Entropy             -0.0008414

    total               -28.939611836918

Since the loaded data contains exactly the same data as the scfres returned by the SCF calculation one could use it to plot a band structure, e.g. plot_bandstructure(load_scfres("scfres.jld2")) directly from the stored data.

Notice that both load_scfres and save_scfres work by transferring all data to/from the master process, which performs the IO operations without parallelisation. Since this can become slow, both functions support optional arguments to speed up the processing. An overview:

  • save_scfres("scfres.jld2", scfres; save_ψ=false) avoids saving the Bloch wave, which is usually faster and saves storage space.
  • load_scfres("scfres.jld2", basis) avoids reconstructing the basis from the file, but uses the passed basis instead. This save the time of constructing the basis twice and allows to specify parallelisation options (via the passed basis). Usually this is useful for continuing a calculation on a supercomputer or cluster.

See also the discussion on Input and output formats on JLD2 files.

Checkpointing of SCF calculations

A related feature, which is very useful especially for longer calculations with DFTK is automatic checkpointing, where the state of the SCF is periodically written to disk. The advantage is that in case the calculation errors or gets aborted due to overrunning the walltime limit one does not need to start from scratch, but can continue the calculation from the last checkpoint.

The easiest way to enable checkpointing is to use the kwargs_scf_checkpoints function, which does two things. (1) It sets up checkpointing using the ScfSaveCheckpoints callback and (2) if a checkpoint file is detected, the stored density is used to continue the calculation instead of the usual atomic-orbital based guess. In practice this is done by modifying the keyword arguments passed to # self_consistent_field appropriately, e.g. by using the density or orbitals from the checkpoint file. For example:

checkpointargs = kwargs_scf_checkpoints(basis; ρ=guess_density(basis, magnetic_moments))
scfres = self_consistent_field(basis; tol=1e-2, checkpointargs...);
n     Energy            log10(ΔE)   log10(Δρ)   Magnet   α      Diag   Δtime
---   ---------------   ---------   ---------   ------   ----   ----   ------
  1   -27.64055254634                   -0.13    0.001   0.80    6.5   88.8ms
  2   -28.92280959099        0.11       -0.83    0.680   0.80    2.0   96.5ms
  3   -28.93111905191       -2.08       -1.14    1.187   0.80    2.5   88.0ms
  4   -28.93773166436       -2.18       -1.19    1.773   0.80    2.0   76.1ms
  5   -28.93949892388       -2.75       -1.38    1.999   0.80    1.5   82.7ms
  6   -28.93959862561       -4.00       -2.01    1.978   0.80    1.0   80.1ms

Notice that the ρ argument is now passed to kwargsscfcheckpoints instead. If we run in the same folder the SCF again (here using a tighter tolerance), the calculation just continues.

checkpointargs = kwargs_scf_checkpoints(basis; ρ=guess_density(basis, magnetic_moments))
scfres = self_consistent_field(basis; tol=1e-3, checkpointargs...);
n     Energy            log10(ΔE)   log10(Δρ)   Magnet   α      Diag   Δtime
---   ---------------   ---------   ---------   ------   ----   ----   ------
  1   -28.93960745122                   -2.87    1.985   0.80    8.5    112ms
  2   -28.93961218609       -5.32       -3.37    1.985   0.80    1.0   70.2ms

Since only the density is stored in a checkpoint (and not the Bloch waves), the first step needs a slightly elevated number of diagonalizations. Notice, that reconstructing the checkpointargs in this second call is important as the checkpointargs now contain different data, such that the SCF continues from the checkpoint. By default checkpoint is saved in the file dftk_scf_checkpoint.jld2, which can be changed using the filename keyword argument of kwargs_scf_checkpoints. Note that the file is not deleted by DFTK, so it is your responsibility to clean it up. Further note that warnings or errors will arise if you try to use a checkpoint, which is incompatible with your calculation.

We can also inspect the checkpoint file manually using the load_scfres function and use it manually to continue the calculation:

oldstate = load_scfres("dftk_scf_checkpoint.jld2")
scfres   = self_consistent_field(oldstate.basis, ρ=oldstate.ρ, ψ=oldstate.ψ, tol=1e-4);
n     Energy            log10(ΔE)   log10(Δρ)   Magnet   Diag   Δtime
---   ---------------   ---------   ---------   ------   ----   ------
  1   -28.93855940986                   -2.28    1.985    6.5   87.7ms
  2   -28.93945851930       -3.05       -2.69    1.985    1.0   64.9ms
  3   -28.93961067905       -3.82       -2.54    1.985    4.0   92.5ms
  4   -28.93961149147       -6.09       -2.64    1.985    1.0   74.7ms
  5   -28.93961253921       -5.98       -2.86    1.985    1.0   63.5ms
  6   -28.93961256864       -7.53       -2.84    1.985    1.0   62.4ms
  7   -28.93961262684       -7.24       -2.84    1.985    1.0   63.2ms
  8   -28.93961278786       -6.79       -2.89    1.985    1.0   83.5ms
  9   -28.93961307967       -6.53       -3.14    1.985    1.0   78.1ms
 10   -28.93961311117       -7.50       -3.25    1.985    1.0   63.5ms
 11   -28.93961312314       -7.92       -3.34    1.985    1.0   64.6ms
 12   -28.93961314317       -7.70       -3.47    1.985    1.0   66.2ms
 13   -28.93961316552       -7.65       -3.97    1.985    1.5   71.2ms
 14   -28.93961317206       -8.18       -4.81    1.985    2.5   80.0ms

Some details on what happens under the hood in this mechanism: When using the kwargs_scf_checkpoints function, the ScfSaveCheckpoints callback is employed during the SCF, which causes the density to be stored to the JLD2 file in every iteration. When reading the file, the kwargs_scf_checkpoints transparently patches away the ψ and ρ keyword arguments and replaces them by the data obtained from the file. For more details on using callbacks with DFTK's self_consistent_field function see Monitoring self-consistent field calculations.

(Cleanup files generated by this notebook)

rm("dftk_scf_checkpoint.jld2")
rm("scfres.jld2")