# General ## Multiple functions in one script Consider multiple stages of training: pre-training, supervised fine-tuning, RLHF, etc. Normally, this kind of work is delegated to multiple scripts. Why? Each stage is complicated (prone to memory leaks) and we don't want them to interfere with each other. They may even require different degrees of parallelism. `torchrunx` solves these problems — even within a single script — by modularizing workloads into isolated, self-cleaning processes. ```python # 2 nodes x 8 GPUs train_launcher = torchrunx.Launcher(hostnames=["node1", "node2"], workers_per_host=8) # 1 GPU eval_launcher = torchrunx.Launcher(hostnames=["node1"], workers_per_host=1) # Training & testing pretrained_model = train_launcher.run(train).rank(0) pretrained_acc = eval_launcher.run(evaluation, model=pretrained_model).rank(0) print(f"Pre-trained model accuracy: {pretrained_acc}") finetuned_model = train_launcher.run(finetuning, model=pretrained_model).rank(0) finetuned_acc = eval_launcher.run(evaluation, model=finetuned_model).rank(0) print(f"Fine-tuned model accuracy: {finetuned_acc}") ``` ## Exceptions Exceptions that are raised in workers will be raised by the launcher process. A {mod}`torchrunx.AgentFailedError` or {mod}`torchrunx.WorkerFailedError` will be raised if any agent or worker dies unexpectedly (e.g. if sent a signal from the OS, due to segmentation faults or OOM). You can catch these errors and handle them as you wish! ```python for config in configs: # e.g. hyper-parameter sweep try: torchrunx.Launcher().run(train, config) except torch.cuda.OutOfMemoryError: print(f"{config} results in OOM... continuing...") ``` If you are expecting intermittent failures, you can catch errors and invoke retries: ```python for retry in range(3): try: torchrunx.Launcher().run(train, resume_from_checkpoint=True) except torchrunx.WorkerFailedError as e: print(f"Error occurred: {e}") print(f"Retrying ({retry}) ...") else: # if run() is successful break ``` ## Environment variables Environment variables in the launcher process that pattern match the [``copy_env_vars``](../api.md#torchrunx.Launcher.copy_env_vars) argument are automatically copied to agents and workers. We set useful defaults for Python and PyTorch. You could replace these. Or extend these like: ```python torchrunx.Launcher(copy_env_vars=( torchrunx.DEFAULT_ENV_VARS_FOR_COPY + ("HF_HOME", "WANDB_*",) )) ``` You can also pass (1) specific environment variables and values via [``extra_env_vars``](../api.md#torchrunx.Launcher.extra_env_vars) or (2) a ``.env``-style file via [``env_file``](../api.md#torchrunx.Launcher.env_file). Our agents `source {env_file}`. Finally, we set the following environment variables in each worker: `LOCAL_RANK`, `RANK`, `LOCAL_WORLD_SIZE`, `WORLD_SIZE`, `MASTER_ADDR`, and `MASTER_PORT`.