GPU clusters
You can group multiple VMs with GPUs into a GPU cluster. This will allow you to accelerate high-performance computing (HPC) tasks that require more computing capacity than individual VMs can provide, such as scientific simulations and deep neural networks training and inference.
The GPU clusters are built with NVIDIA InfiniBand secure high-speed networking. Each GPU in a node is connected through a network interface card (NIC) that provides 400 GB/s. As a compute node consists of 8 GPUs, the total bandwidth for a node is 3.2 TB/s.
Nebius AI uses GPUDirect RDMA, an NVIDIA technology of remote direct memory access (RDMA) that allows data to flow directly between each GPU and its NIC, avoiding CPU, thus boosting the data exchange speed.
Requirements
VMs in a GPU cluster must have 8 GPUs each and can use either of these platforms:
- NVIDIA® H100 NVLink with Intel Sapphire Rapids (Type A) (
gpu-h100
) - NVIDIA® H100 NVLink with Intel Sapphire Rapids (Type B) (
gpu-h100-b
) - NVIDIA® H100 NVLink with Intel Sapphire Rapids (Type C) (
gpu-h100-c
)
The NVIDIA H100 platforms provide identical performance and capabilities.
Recommendations
We recommend to create a cluster with the VMs in the same network.
For the cluster VMs to interact properly, we recommend using a security group that allows unrestricted traffic within itself. For example, the default security group meets this requirement.
InfiniBand fabrics
Each GPU cluster is created in one of physical InfiniBand fabrics. This is where GPUs interconnected over InfiniBand are located. Each fabric has limited GPU capacity.
When creating a GPU cluster, select an InfiniBand fabric for it.
Currently, the following InfiniBand fabrics are available:
fabric-1
: Use for creating GPU clusters with NVIDIA® H100 NVLink with Intel Sapphire Rapids (Type A) and NVIDIA® H100 NVLink with Intel Sapphire Rapids (Type B) VMs. Both types could be mixed in the cluster.fabric-4
: Use for creating GPU clusters with NVIDIA® H100 NVLink with Intel Sapphire Rapids (Type C) VMs.
How to use GPU clusters
- Request to increase one of the “Number of NVIDIA® H100 NVLink (Type A, B or C) GPUs” quotas which include H100s. You need at least 16 H100 GPUs for a 2 VMs cluster.
- Create a GPU cluster.
- Create VMs using the dedicated Marketplace product.
- Add the created VMs to the GPU cluster.