Testing a GPU cluster physical state
The guides below will help you to check that InfiniBand connections are established between GPUs in a GPU cluster.
The tests will only work for VMs in the same network, created with the dedicated Marketplace product, and added to a GPU cluster according to the guide.
Testing the port state
-
In the VM's shell, run
ibstatus
command that displays operational information about InfiniBand network devices.Result:
Infiniband device ’mlx5_0’ port 1 status: default gid: fe80:0000:0000:0000:0200:0030:c123:012a base lid: 0xb8 sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 400 Gb/sec (4X NDR) link_layer: InfiniBand Infiniband device ’mlx5_0’ port 2 status: default gid: fe80:0000:0000:0000:0200:0030:d600:123f base lid: 0xb6 sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 400 Gb/sec (4X NDR) link_layer: InfiniBand ...
-
For each port in the result, check the physical state (
phys state
): it should beLinkUp
.
Testing network performance
You can also emulate the network activity by sending some data from GPUs on one VM to GPUs on another:
-
Install the perftest
package on each one of the test VMs:sudo apt install perftest
-
Run
ib_send_bw --report_gbits
. -
Copy the first VM's internal IP address.
-
Run
ib_send_bw <first_VM_IP_address> --report_gbits
.
In the commands output, you should see non-zero values for the bytes sent, average bandwith speed, and average message rate. The bandwidth peak speed might not reach the theoretical maximum 400 Gb/sec.
Example:
+--------------------------------------------------------------------------------+
| #bytes #iterations #BW peak[Gb/sec] #BW average[Gb/sec] #MsgRate[Mpps] |
+--------------------------------------------------------------------------------+
| 65536 1000 360.39 359.91 0.686466 |
+--------------------------------------------------------------------------------+