GPU Clustering | Multi-node Multi-gpu
In this world of AI, Models are getting more weighted than ever. This world of exapansion becomes more rapid when we move towards Gen AI. For the LLMs, we can’t simply inference it into single GPU even if we bought $25,000 H100 which has 80 GB VRAM, we would still require 3 of them Clustered for Llama 3.1 405B in 4-bit Low Precision Mode.
Here comes the savior. GPU Clustering. What I mean with Clustering is the use the combined computes of multiple GPUs those are in either inter-node or intra-node communication.
Requirements:
Let’s Break-down from scratch. When we think of 2 GPU combined to work together, then there must be some medium so that they share the work progress to each other, eh? ie. our very first requirement, high bandwidth inter-node communication.
What about the applications, software we write to have the cluster? We use nccl
from NVIDIA as a communication backend.
Hardware, Obviously.
For our Demo, we have 2 Nodes with each Nvidia GPU and a head node.
Cluster Implementation
There’re various options available out there for cluster implementation, which includes Data Parallelism, Model Parallelism, Tensor Parallelism, DDP(Distributed Data Parallelism), FSDP (Fully Shraded Data Parallelism). State of the art here is FSDP, which has the utmost performance with all GPU usage at any time instant.