How It Works

If you want to (e.g.) train your model on several machines with N GPUs each, you should run your training function in N parallel processes on each machine. During training, each of these processes runs the same training code (i.e. your function) and communicate with each other (e.g. to synchronize gradients) using a distributed process group.

Your script can call our library (via mod:torchrunx.launch) and specify a function to distribute. The main process running your script is henceforth known as the launcher process.

Our launcher process spawns an agent process (via SSH) on each machine. Each agent then spawns N processes (known as workers) on its machine. All workers form a process group (with the specified mod:torchrunx.launch backend) and run your function in parallel.

Agent–Worker Communication. Our agents poll their workers every second and time-out if unresponsive for 5 seconds. Upon polling, our agents receive None (if the worker is still running) or a RunProcsResult, indicating that the workers have either completed (providing an object returned from or the exception raised by our function) or failed (e.g. due to segmentation fault or OS signal).

Launcher–Agent Communication. The launcher and agents form a distributed group (with the CPU-based GLOO backend) for the communication purposes of our library. Our agents synchronize their own “statuses” with each other and the launcher. An agent’s status can include whether it is running/failed/completed and the result of the function. If the launcher or any agent fails to synchronize, all raise a mod:torchrunx.AgentFailedError and terminate. If any worker fails or raises an exception, the launcher raises a mod:torchrunx.WorkerFailedError or that exception and terminates along with all the agents. If all agents succeed, the launcher returns the objects returned by each worker.