Using SLURM

Normally, you are expected to provide the hostnames argument in torchrunx.Launcher to specify which nodes you would like to launch your function onto.

If your script is running within a SLURM allocation and you set hostnames to "auto" (default) or "slurm", we will automatically detect the available nodes and distribute your function onto all of these. A RuntimeError will be raised if hostnames="slurm" but no SLURM allocation is detected.

With sbatch

You could have a script (train.py) that includes:

def distributed_training():
    ...

if __name__ == "__main__":
    torchrunx.Launcher(
        hostnames = "slurm",
        workers_per_host = "gpu"
    ).run(distributed_training)

And some run.batch file (e.g. allocating 2 nodes with 2 GPUs each):

#!/bin/bash
#SBATCH --job-name=torchrunx
#SBATCH --time=1:00:00
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=2
#SBATCH --gpus-per-node=2

# TODO: load your virutal environment
python train.py

sbatch run.batch should then run python train.py (the launcher process) on the primary machine in your SLURM allocation. The launcher will automatically distribute the training function onto both allocated nodes (and also parallelize it across the allocated GPUs).

With submitit

If we use the submitit Python library, we can do all of this from a single python script.

def distributed_training():
    ...

def launch_training():
    torchrunx.Launcher(
        hostnames = "slurm",
        workers_per_host = "gpu"
    ).run(distributed_training)

if __name__ == "__main__":
    executor = submitit.SlurmExecutor(folder="slurm_outputs")
    executor.update_parameters(
        use_srun=False, time=60, ntasks_per_node=1,
        nodes=2, gpus_per_node=2
    )
    executor.submit(launch_training)