Why Most HPC Systems Use InfiniBand Interconnection
In addition to the well-known Ethernet, there are many other categories of network architecture. For server-side connection scenarios, InfiniBand (IB) technology is valued and used for its inherent characteristics. It has a nearly dominant position, especially in High-Performance Computing (HPC), large data center storage and other scenarios. So what’s the difference between IB and Ethernet? Why most HPC systems use IB interconnection?
What InfiniBand Is and Where It Is Used
IB is a “cable switching” technology that supports multiple concurrent connections, and it is the I/O standard of the new generation server hardware platform. With the rapid development of CPU performance, the performance of I/O system has become a bottleneck restricting the performance of servers. The PCI bus architecture used in the past does not conform to the new application trend. To overcome the inherent drawbacks of PCI, Intel, Cisco, Compaq, EMC, Fujitsu and other companies jointly launched the IB architecture, the core of which is to separate the I/O system from the server host. At present, only a few companies, such as Mellanox, Intel, Qlogic, can provide IB products. Mellanox is in the leading position. Recently, they deployed the first HDR 200G InfiniBand supercomputer at the University of Michigan.
The figure above shows the IB’s basic protocols. As we can see, the IB protocols adopt a hierarchical structure, including upper protocol, transport layer, network layer, link layer and physical layer. Each layer is independent of each other, and the lower layer provides services for the upper layer, which is similar to TCP/IP protocol. Unlike Ethernet, which is used in high-level network communication, InfiniBand is mainly used in low-level input/output communication scenarios. As mentioned at the beginning of this article, IB architecture has the mission of improving server-side input/output performance, even if Ethernet achieves or exceeds the speed of IB network, IB is irreplaceable under the condition of low-level network communication. In addition, IB’s transmission mode and media are quite flexible. It can be transferred by copper wire foil of printed circuit board in the equipment, and interconnected by DAC or AOC between the equipment.
As Bill Lee, co-chair of the InfiniBand Industry Association Working Group, said, “The goal of InfiniBand is to improve communication between applications.” IB technology includes not only chips and hardware, but also software. In order to play its due role, hardware and software must be fully integrated in the operating system, management and application layer.
Why HPC Data Centers Choose InfiniBand
Addison Snell, CEO of Intersect360 Research, pointed out that “InfiniBand has grown and is now the preferred solution for high performance storage interconnection in HPC systems. At present, the applications of high data throughput such as data analysis and machine learning are expanding rapidly, and the demand for high bandwidth and low delay interconnection is also expanding to a broader market.”
Obviously, our main direction at present and in the future is to solve the problems of science and data analysis, which requires a very high bandwidth between computing nodes, storage and analysis systems in our data center, thus forming a single system environment. In addition, latency (memory and disk access latency) is another performance measure of HPC. So, the reason why HPC data centers choose to use IB network is that it can meet the requirements of high bandwidth and low latency.
IB is currently the preferred interconnection between HPC and AI infrastructures, and the speed is also increasing, from SDR, DDR, QDR, to HDR. The famous Mellanox InfiniBand solution connects most of the Top 500 supercomputers, and they will also start planning NDR 400G InfiniBand technology to support future E-level supercomputing and machine learning platforms. In terms of delay, RDMA (Remote Direct Memory Access) technology allows us to access data directly and remotely throughout the network, and can solve the problem of server-side data processing delay in network transmission. RDMA transfers data directly to the computer’s storage area through the network, moves data from one system to remote system memory quickly, realizes Zero Copy, releases CPU load on the host side, and reduces the delay of data processing in the host from hundreds of microseconds to nanoseconds.
In addition, IB has the advantages of simple protocol stack, high processing efficiency and simple management. Unlike the hierarchical topology of Ethernet, InfiniBand is a flat structure, which means that each node has a direct connection to all other nodes. Compared with TCP/IP network protocol, IB uses trust-based and flow-control mechanism to ensure the integrity of the connection, and data packets are rarely lost. After data transmission, the receiver returns a signal to indicate the availability of the buffer space. Therefore, IB protocol eliminates the delay of retransmitting due to the loss of original data packets, thereby improving the performance of the protocol. Efficiency and overall performance are improved. In addition, considering the problem of signal distortion in ultra-high-speed optical fiber transmission, IB transmits data signals differently in the optical fiber, and adds a filter at the receiving end to filter out the signal noise, which fully guarantees the signal integrity of the connection network.
InfiniBand is a unified interconnection structure that can handle both storage I/O, network I/O and interprocess communication (IPC). It can interconnect disk arrays, SANs, LANs, servers and cluster servers, provide high bandwidth and low latency transmission over relatively short distances, and support redundant I/O channels in single or multiple Internet networks, so that data centers can still operate when local failures occur. Especially in the future, when the internal traffic of HPC data center increases dramatically, InfiniBand will have a broader development space as a network connection technology applied between servers.