From bare metal to a 70B model: infrastructure set-up and scripts – imbue

This post focuses on one cluster that had 4,088 H100 GPUs spread across 511 computers, with eight GPUs to a computer. There were 511 computers with GPUs because some connections needed to be reserved for the Unified Fabric Manager nodes, which managed the InfiniBand network. On the 511 hosts with GPUs, each GPU was directly connected to a ConnectX-7 card that could simultaneously transmit and receive at 400 Gbps to any other GPU on the InfiniBand network through its own ConnectX-7 card.

Source: From bare metal to a 70B model: infrastructure set-up and scripts – imbue

From bare metal to a 70B model: infrastructure set-up and scripts – imbue

Related