The CloudLab3@Wisconsin follows the design strategy of CloudLab1/2 and makes incremental infrastructure upgrades. It provides much higher inter-host networking bandwidth (100+GbE), adds 20+ commodity GPU servers, and brings in disaggregated storage. Jeremy Sarauer (from DoIT) is the leading system architect.
1. Hardware Specification
The cluster consists of the following compute/storage nodes and switches.
-
24x Normal GPU Nodes: Each node is a 2U Dell R7525 box, enclosing two AMD 7302 processors (3.0GHz), 256GB DDR4 memory, a Broadcom 25GbE dual-port 57414 NIC, a Mellanox 100GbE dual-port CX6 NIC, a 1.6TB NVMe SSD, two 480GB SATA SSDs, and a NIVIDA A30 GPU. Ten of these nodes are also equipped with a Nvidia/Mellanox BlueField-2 100GbE dual-port SmartNIC.
-
4x Dense GPU Nodes: Each node is a 2U Dell XE8545 box, enclosing two AMD 7413 processors (2.64GHz), 512GB DDR4 memory, a Broadcom 25GbE dual-port 57414 NIC, a Mellanox 100GbE dual-port CX6 NIC, a 1.6TB NVMe SSD, two 480GB SATA SSDs, and four NVIDIA A100 GPUs.
-
20x Thin Server JBOFs: Each node is a 1U Supermicro server (110P-WTR), consisting of two Intel Silver 4310 processors (2.1GHz), 128GB DDR4 memory, an Intel 10GbE dual-port X550 NIC, an Intel 960GB D3-S4610 SATA SSD, a Nvidia/Mellanox 100GbE dual-port CX6 NIC, and four Samsung PM9A3 960GB NVMe SSD.
-
10x Thick Server JBOFs: Each node is a 2U Supermicro server (220U-TNR), consisting of two Intel Silver 4310 processors (2.1GHz), 256GB DDR4 memory, an Intel 10GbE dual-port X550 NIC, an Intel 960GB D3-S4610 SATA SSD, a Nvidia/Mellanox 100GbE dual-port CX6 NIC, and eight Samsung PM9A3 960GB NVMe SSD.
-
1x Fungible FS1600 EBOF: This 2U storage array is based on the Fungible DPU. It has 12x 100GbE Ethernet ports and 24 Samsung NVMe SSDs.
-
1x Dell PowerSwitch Z9432F-ON: It has 32x 400GbE ports.
-
1x Dell PowerSwitch Z9264F-ON: It has 64x 100GbE ports.
-
1x Netberg Aurora 710: It is based on Tofino1 with 32x 100GbE ports.
-
Cables: The cluster uses SFP28, QSFP28, and QSFP56 DAC and AOC cables.
2. System Architecture
The above figure (from Jeremy) depicts the topology of CloudLab3 and how it connects to CloudLab1, CloudLab2, UW-Madison research backbone network, and FABRIC. The cluster co-locates with CloudLab1 and CloudLab2, spanning across three racks. We enable LACP for different traffic groups.
3. Example Use Cases
We believe the cloudlab3 extension could enable distributed GPU computing, storage disaggregation, and high-throughput system design at a small/modest scale.