How to setup the PyTorch lightning to work on SLURM cluster. In the typical SLURM cluster workflow, you need to submit a job to the cluster. Oftentimes, if you specify the job with a very long timeout, your job will get queued very slowly. The workaround here is to submit a short-time job (i.e., 1-6 hours or time enough for 1 training and 1 validation loops) and checkpoint the training state for the next job. The emphasis of this post would be on this particular topic: how to setup PyTorch Lightning to work with a SLURM cluster and checkpointing training?
A simple walkthrough on how to setup a minimal arch linux