From the CLI¶

With `argparse`¶

We provide some utilities to extend an argparse.ArgumentParser with arguments for building a torchrunx.Launcher.

from argparse import ArgumentParser
from torchrunx.integrations.parsing import add_torchrunx_argument_group, launcher_from_args

if __name__ == '__main__':
    parser = ArgumentParser()
    add_torchrunx_argument_group(parser)
    args = parser.parse_args()

    launcher = launcher_from_args(args)
    launcher.run(...)

python ... --help then results in:

usage: -c [-h] [--hostnames HOSTNAMES [HOSTNAMES ...]]
          [--workers-per-host WORKERS_PER_HOST [WORKERS_PER_HOST ...]]
          [--ssh-config-file SSH_CONFIG_FILE]
          [--backend {nccl,gloo,mpi,ucc,None}]
          [--worker-timeout WORKER_TIMEOUT] [--agent-timeout AGENT_TIMEOUT]
          [--copy-env-vars COPY_ENV_VARS [COPY_ENV_VARS ...]]
          [--extra-env-vars [EXTRA_ENV_VARS ...]] [--env-file ENV_FILE]

optional arguments:
  -h, --help            show this help message and exit

torchrunx:
  --hostnames HOSTNAMES [HOSTNAMES ...]
                        Nodes to launch the function on. Default: 'auto'. Use
                        'slurm' to infer from SLURM.
  --workers-per-host WORKERS_PER_HOST [WORKERS_PER_HOST ...]
                        Processes to run per node. Can be 'cpu', 'gpu', or
                        list[int]. Default: 'gpu'.
  --ssh-config-file SSH_CONFIG_FILE
                        Path to SSH config file. Default: '~/.ssh/config' or
                        '/etc/ssh/ssh_config'.
  --backend {nccl,gloo,mpi,ucc,None}
                        For worker process group. Default: 'nccl'. Use 'gloo'
                        for CPU. 'None' to disable.
  --worker-timeout WORKER_TIMEOUT
                        Worker process group timeout in seconds. Default: 600.
  --agent-timeout AGENT_TIMEOUT
                        Agent communication timeout in seconds. Default: 180.
  --copy-env-vars COPY_ENV_VARS [COPY_ENV_VARS ...]
                        Environment variables to copy to workers. Supports
                        Unix pattern matching.
  --extra-env-vars [EXTRA_ENV_VARS ...]
                        Additional environment variables as key=value pairs.
  --env-file ENV_FILE   Path to a .env file with environment variables.

With automatic CLI tools¶

We can also automatically populate torchrunx.Launcher arguments using most CLI tools, e.g. tyro or any that generate interfaces from dataclasses.

import torchrunx
import tyro

if __name__ == "__main__":
    launcher = tyro.cli(torchrunx.Launcher)
    results = launcher.run(...)

python ... --help then results in:

For configuring the function launch environment.

╭─ options ──────────────────────────────────────────────────────────────────╮
│ -h, --help                                                                 │
│     show this help message and exit                                        │
│ --hostnames {[STR [STR ...]]}|{auto,slurm}                                 │
│     Nodes to launch the function on. By default, infer from SLURM, else    │
│     ``["localhost"]``. (default: auto)                                     │
│ --workers-per-host INT|{[INT [INT ...]]}|{cpu,gpu}                         │
│     Number of processes to run per node. By default, number of GPUs per    │
│     host. (default: gpu)                                                   │
│ --ssh-config-file {None}|STR|PATHLIKE                                      │
│     For connecting to nodes. By default, ``"~/.ssh/config"`` or            │
│     ``"/etc/ssh/ssh_config"``. (default: None)                             │
│ --backend {None,nccl,gloo,mpi,ucc}                                         │
│     `Backend                                                               │
│     <https://pytorch.org/docs/stable/distributed.html#torch.distributed.Ba │
│     ckend>`_                                                               │
│             for worker process group. By default, NCCL (GPU backend).      │
│             Use GLOO for CPU backend. ``None`` for no process group.       │
│     (default: nccl)                                                        │
│ --worker-timeout INT                                                       │
│     Worker process group timeout (seconds). (default: 600)                 │
│ --agent-timeout INT                                                        │
│     Agent communication timeout (seconds). (default: 180)                  │
│ --copy-env-vars [STR [STR ...]]                                            │
│     Environment variables to copy from the launcher process to workers.    │
│     Supports Unix pattern matching syntax. (default: PATH LD_LIBRARY       │
│     LIBRARY_PATH 'PYTHON*' 'CUDA*' 'TORCH*' 'PYTORCH*' 'NCCL*')            │
│ --extra-env-vars {None}|{[STR STR [STR STR ...]]}                          │
│     Additional environment variables to load onto workers. (default: None) │
│ --env-file {None}|STR|PATHLIKE                                             │
│     Path to a ``.env`` file, containing environment variables to load onto │
│     workers. (default: None)                                               │
╰────────────────────────────────────────────────────────────────────────────╯