Ennan Zhai, an engineer and researcher at Alibaba Cloud, shared a research paper on GitHub, revealing the cloud provider's design of its data center for LLM training. Titled “Alibaba HPN: Data Center Network for Large-Scale Language Model Training,” the PDF document outlines how Alibaba will use Ethernet to enable 15,000 GPUs to communicate with each other.
Typical cloud computing generates consistent but small data flows at speeds below 10 Gbps. LLM training, on the other hand, generates periodic data bursts that can reach up to 400 Gbps. According to the paper, “This characteristic of LLM training makes Equal-Cost Multi-Path (ECMP), a load balancing scheme commonly used in traditional data centers, prone to hash bias, resulting in uneven distribution of traffic and other issues.”
To get around this, Zhai and his team developed the High Performance Network (HPN), which uses a “two-tier, dual-plane architecture” that reduces the number of ECMP occurrences while allowing the system to “precisely select network paths that can preserve elephant flows.” The HPN also uses dual top-of-rack (ToR) switches to back up each other. These switches are the most common single point of failure in LLM training, and require GPUs to synchronize to complete iterations.
8 GPUs per host, 1,875 hosts per datacenter
Alibaba Cloud divided its data centers into hosts with eight GPUs per host. Each GPU has a network interface card (NIC) with two ports, and each GPU-NIC system is called a “rail.” The host also has an additional NIC to connect to the back-end network. Each rail is connected to two different ToR switches to ensure that the failure of one switch does not affect the entire host.
Even though Alibaba Cloud has dropped NVlink for host-to-host communication, it still uses Nvidia's proprietary technology for intra-host networking, as communication between GPUs within a host requires more bandwidth, but rail-to-rail communication is much slower, so the “dedicated 400Gbps RDMA network throughput, for a total of 3.2Tbps of bandwidth” per host is more than enough to maximize the bandwidth of PCIe Gen5x16 graphics cards.
Alibaba Cloud also uses 51.2 Tb/s Ethernet single-chip ToR switches because multi-chip solutions are more prone to instability and have a four times higher failure rate than single-chip switches. However, these switches run so hot that heat sinks readily available on the market are not enough to prevent them from shutting down due to overheating. So the company came up with a novel solution: creating a vapor chamber heat sink with an extra column in the middle to carry thermal energy more efficiently.
Ennan Zhai and his team will be presenting their findings at the SIGCOMM (Special Interest Group on Data Communications) conference in Sydney, Australia this August. Many companies, including AMD, Intel, Google, and Microsoft, will be interested in this project, mainly because they are uniting to create Ultra Accelerator Link, an open standard interconnect set to rival NVlink. Alibaba Cloud has been using HPN for over eight months, so the technology is already tried and tested, which is of particular interest to them.
However, HPN still has some drawbacks, the biggest of which is the complex cabling structure. With nine NICs on each host, and each NIC connected to two different ToR switches, it is highly likely to make a mistake about which jack is connected to which port. That said, the technology is probably cheaper than NVlink, so any institution setting up a data center can save a lot on setup costs (and also avoid Nvidia technology, especially if they are one of the companies sanctioned by the US in the ongoing chip war with China).