From the CLI¶
With argparse
¶
We provide some utilities to extend an argparse.ArgumentParser
with arguments for building a torchrunx.Launcher
.
from argparse import ArgumentParser
from torchrunx.integrations.parsing import add_torchrunx_argument_group, launcher_from_args
if __name__ == '__main__':
parser = ArgumentParser()
add_torchrunx_argument_group(parser)
args = parser.parse_args()
launcher = launcher_from_args(args)
launcher.run(...)
python ... --help
then results in:
usage: -c [-h] [--hostnames HOSTNAMES [HOSTNAMES ...]]
[--workers-per-host WORKERS_PER_HOST [WORKERS_PER_HOST ...]]
[--ssh-config-file SSH_CONFIG_FILE]
[--backend {nccl,gloo,mpi,ucc,None}] [--timeout TIMEOUT]
[--copy-env-vars COPY_ENV_VARS [COPY_ENV_VARS ...]]
[--extra-env-vars [EXTRA_ENV_VARS ...]] [--env-file ENV_FILE]
optional arguments:
-h, --help show this help message and exit
torchrunx:
--hostnames HOSTNAMES [HOSTNAMES ...]
Nodes to launch the function on. Default: 'auto'. Use
'slurm' to infer from SLURM.
--workers-per-host WORKERS_PER_HOST [WORKERS_PER_HOST ...]
Processes to run per node. Can be 'cpu', 'gpu', or
list[int]. Default: 'gpu'.
--ssh-config-file SSH_CONFIG_FILE
Path to SSH config file. Default: '~/.ssh/config' or
'/etc/ssh/ssh_config'.
--backend {nccl,gloo,mpi,ucc,None}
For worker process group. Default: 'nccl'. Use 'gloo'
for CPU. 'None' to disable.
--timeout TIMEOUT Worker process group timeout in seconds. Default: 600.
--copy-env-vars COPY_ENV_VARS [COPY_ENV_VARS ...]
Environment variables to copy to workers. Supports
Unix pattern matching.
--extra-env-vars [EXTRA_ENV_VARS ...]
Additional environment variables as key=value pairs.
--env-file ENV_FILE Path to a .env file with environment variables.
With automatic CLI tools¶
We can also automatically populate torchrunx.Launcher
arguments using most CLI tools, e.g. tyro
or any that generate interfaces from dataclasses.
import torchrunx
import tyro
if __name__ == "__main__":
launcher = tyro.cli(torchrunx.Launcher)
results = launcher.run(...)
python ... --help
then results in:
For configuring the function launch environment.
╭─ options ──────────────────────────────────────────────────────────────────╮
│ -h, --help │
│ show this help message and exit │
│ --hostnames {[STR [STR ...]]}|{auto,slurm} │
│ Nodes to launch the function on. By default, infer from SLURM, else │
│ ``["localhost"]``. (default: auto) │
│ --workers-per-host INT|{[INT [INT ...]]}|{cpu,gpu} │
│ Number of processes to run per node. By default, number of GPUs per │
│ host. (default: gpu) │
│ --ssh-config-file {None}|STR|PATHLIKE │
│ For connecting to nodes. By default, ``"~/.ssh/config"`` or │
│ ``"/etc/ssh/ssh_config"``. (default: None) │
│ --backend {None,nccl,gloo,mpi,ucc} │
│ `Backend │
│ <https://pytorch.org/docs/stable/distributed.html#torch.distributed.B… │
│ for worker process group. By default, NCCL (GPU backend). │
│ Use GLOO for CPU backend. ``None`` for no process group. │
│ (default: nccl) │
│ --timeout INT │
│ Worker process group timeout (seconds). (default: 600) │
│ --copy-env-vars [STR [STR ...]] │
│ Environment variables to copy from the launcher process to workers. │
│ Supports Unix pattern matching syntax. (default: PATH LD_LIBRARY │
│ LIBRARY_PATH 'PYTHON*' 'CUDA*' 'TORCH*' 'PYTORCH*' 'NCCL*') │
│ --extra-env-vars {None}|{[STR STR [STR STR ...]]} │
│ Additional environment variables to load onto workers. (default: None) │
│ --env-file {None}|STR|PATHLIKE │
│ Path to a ``.env`` file, containing environment variables to load onto │
│ workers. (default: None) │
╰────────────────────────────────────────────────────────────────────────────╯