Blog

In my last blog post [Link], we’ve talked about how we can run PyTorch Lightning inside a SLURM cluster with checkpointing (i.e., saving and reloading the training states on the next job). In this post, I will focus primarily on how to set up a PyTorch Lightning for multiple GPUs, in particular DistributedDataParallel.

Here is a simple working setting that makes your PyTorch Lightning script to get multi-GPUs to work.

hydra:
  launcher:
    _target_: hydra_plugins.hydra_submitit_launcher.submitit_launcher.SlurmLauncher
    submitit_folder: ${hydra.sweep.dir}/.submitit/%j
    account: realitylab
    partition: gpu-a40
    timeout_min: 300
    cpus_per_task: 5
    gpus_per_node: 4                     
    tasks_per_node: 4
    mem_gb: 256
    nodes: 1
    name: ${hydra.job.name}
    comment: null
    constraint: "a40"
    exclude: null
    cpus_per_gpu: null
    gpus_per_task: 1
    mem_per_gpu: null
    mem_per_cpu: null
    signal_delay_s: 600
    max_num_timeout: 100
    additional_parameters: {}
    array_parallelism: 256
    setup: []

[Scenario 1]: Node, GPUs on the same node

Conceptually, you would like SLURM scheduler to spawn Node and tasks for your job. Then, you let each task have 1 GPU, and ideally we would like to have SLURM scheduler automatically run our script times with some GPU ranking information. Finally, each GPU from each task will then communicate to one another and performs the training separately.

To achieve this, we are modified the following parameters for the SLURM launcher.

nodes: 1

cpus_per_task: 5

gpus_per_task: 1

gpus_per_node: 4

tasks_per_node: 4

Here, we setup a job with 1 Node and 4 tasks, where each task has 5 CPUs and 1 GPU. Finally, you only need to set gpus=-1 in PyTorch Lightning’s Trainer.

[Scenario 2]: Nodes, GPUs equally on each node

Potentially, you should get this to work by changing the number of nodes from to . Note, though, I haven’t tried this, yet. Note that PyTorch Lightning is currently working on the

Tutorial on multi-GPUs, PyTorch Lightning, and SLURM

[Scenario 1]: Node, GPUs on the same node

[Scenario 2]: Nodes, GPUs equally on each node