In my last blog post [Link], we’ve talked about how we can run PyTorch Lightning inside a SLURM cluster with checkpointing (i.e., saving and reloading the training states on the next job). In this post, I will focus primarily on how to set up a PyTorch Lightning for multiple GPUs, in particular
DistributedDataParallel
.Here is a simple working setting that makes your PyTorch Lightning script to get multi-GPUs to work.
hydra:
launcher:
_target_: hydra_plugins.hydra_submitit_launcher.submitit_launcher.SlurmLauncher
submitit_folder: ${hydra.sweep.dir}/.submitit/%j
account: realitylab
partition: gpu-a40
timeout_min: 300
cpus_per_task: 5
gpus_per_node: 4
tasks_per_node: 4
mem_gb: 256
nodes: 1
name: ${hydra.job.name}
comment: null
constraint: "a40"
exclude: null
cpus_per_gpu: null
gpus_per_task: 1
mem_per_gpu: null
mem_per_cpu: null
signal_delay_s: 600
max_num_timeout: 100
additional_parameters: {}
array_parallelism: 256
setup: []
[Scenario 1]: Node, GPUs on the same node
Conceptually, you would like SLURM scheduler to spawn Node and tasks for your job. Then, you let each task have 1 GPU, and ideally we would like to have SLURM scheduler automatically run our script times with some GPU ranking information. Finally, each GPU from each task will then communicate to one another and performs the training separately.
To achieve this, we are modified the following parameters for the SLURM launcher.
nodes: 1
cpus_per_task: 5
gpus_per_task: 1
gpus_per_node: 4
tasks_per_node: 4
Here, we setup a job with 1 Node and 4 tasks, where each task has 5 CPUs and 1 GPU. Finally, you only need to set
gpus=-1
in PyTorch Lightning’s Trainer
.[Scenario 2]: Nodes, GPUs equally on each node
Potentially, you should get this to work by changing the number of nodes from to . Note, though, I haven’t tried this, yet. Note that PyTorch Lightning is currently working on the