# Advanced Usage ## Multiple functions in one script We could also launch multiple functions (e.g. train on many GPUs, test on one GPU): ```python import torchrunx as trx trained_model = trx.launch( func=train, hostnames=["node1", "node2"], workers_per_host=8 ).rank(0) accuracy = trx.launch( func=test, func_args=(trained_model,), hostnames=["localhost"], workers_per_host=1 ).rank(0) print(f'Accuracy: {accuracy}') ``` {mod}`torchrunx.launch` is self-cleaning: all processes are terminated (and the used memory is completely released) before the subsequent invocation. ## CLI integration We can use {mod}`torchrunx.Launcher` to populate arguments from the CLI (e.g. with [tyro](https://brentyi.github.io/tyro/)): ```python import torchrunx as trx import tyro def distributed_function(): pass if __name__ == "__main__": launcher = tyro.cli(trx.Launcher) launcher.run(distributed_function) ``` `python ... --help` then results in: ```bash ╭─ options ─────────────────────────────────────────────╮ │ -h, --help show this help message and exit │ │ --hostnames {[STR [STR ...]]}|{auto,slurm} │ │ (default: auto) │ │ --workers-per-host INT|{[INT [INT ...]]}|{auto,slurm} │ │ (default: auto) │ │ --ssh-config-file {None}|STR|PATH │ │ (default: None) │ │ --backend {None,nccl,gloo,mpi,ucc,auto} │ │ (default: auto) │ │ --timeout INT (default: 600) │ │ --default-env-vars [STR [STR ...]] │ │ (default: PATH LD_LIBRARY ...) │ │ --extra-env-vars [STR [STR ...]] │ │ (default: ) │ │ --env-file {None}|STR|PATH │ │ (default: None) │ ╰───────────────────────────────────────────────────────╯ ``` ## SLURM integration By default, the `hostnames` or `workers_per_host` arguments are populated from the current SLURM allocation. If no allocation is detected, we assume 1 machine (localhost) with N workers (num. GPUs or CPUs). Raises a `RuntimeError` if `hostnames="slurm"` or `workers_per_host="slurm"` but no allocation is detected. ## Propagating exceptions Exceptions that are raised in workers will be raised by the launcher process. A {mod}`torchrunx.AgentFailedError` or {mod}`torchrunx.WorkerFailedError` will be raised if any agent or worker dies unexpectedly (e.g. if sent a signal from the OS, due to segmentation faults or OOM). ## Environment variables Environment variables in the launcher process that match the `default_env_vars` argument are automatically copied to agents and workers. We set useful defaults for Python and PyTorch. Environment variables are pattern-matched with this list using `fnmatch`. `default_env_vars` can be overriden if desired. This list can be augmented using `extra_env_vars`. Additional environment variables (and more custom bash logic) can be included via the `env_file` argument. Our agents `source` this file. We also set the following environment variables in each worker: `LOCAL_RANK`, `RANK`, `LOCAL_WORLD_SIZE`, `WORLD_SIZE`, `MASTER_ADDR`, and `MASTER_PORT`. ## Custom logging We forward all logs (i.e. from {mod}`logging` and {mod}`sys.stdout`/{mod}`sys.stderr`) from workers and agents to the launcher. By default, the logs from the first agent and its first worker are printed into the launcher's `stdout` stream. Logs from all agents and workers are written to files in `$TORCHRUNX_LOG_DIR` (default: `./torchrunx_logs`) and are named by timestamp, hostname, and local_rank. {mod}`logging.Handler` objects can be provided via the `handler_factory` argument to provide further customization (mapping specific agents/workers to custom output streams). You must pass a function that returns a list of {mod}`logging.Handler`s to ``handler_factory``. We provide some utilities to help: ```{eval-rst} .. autofunction:: torchrunx.file_handler ``` ```{eval-rst} .. autofunction:: torchrunx.stream_handler ``` ```{eval-rst} .. autofunction:: torchrunx.add_filter_to_handler ```