The road to achieving teraflop-scale computing in AdvancedTCA
The proposal: Develop a high-performance machine vision design for use in an industrialized environment, requiring overall high availability and reliability. The challenge: To develop a high-density imaging platform that could scale up to multiple 42U racks.
This project called for tight integration of the compute architecture, the interconnect buses, hardware management and thermal household, for which AdvancedTCA (ATCA) came out as the best approach. Optimizing for density means that multiple processing elements and interconnect switching components needed to be combined. One difficulty remained: ATCA was not yet ready to provide power beyond 400 watts per blade.
700 watts in AdvancedTCA
A key advantage that kick-started this project was the availability of several key infrastructure components such as digital signal processor (DSP) advanced mezzanine cards (AMCs) and RapidIO carrier blades. Within three months a set of these products was delivered to accelerate the user’s application development and integration, in a setup that closely resembled the final, optimized system. The project to optimize this image processing system continued in parallel.
For deployment requirements, most notably reducing real estate on the factory floor and increasing reliability by removing cables, a decision had to be made. This decision encompassed the integration of the equivalent of six AMCs and carrier switching functions onto a single blade. Net result: a performance/density improvement of over 50 percent compared to the AMC system, and an additional cost advantage due to improving the function/overhead ratio.
In a 42U rack (Figure 1), this integration amounts to 4,000 CPU cores and 3.5 Terabytes of memory interconnected via 40 Gbps Ethernet and 20 Gbps RapidIO, all liquid-cooled, and capable of dissipating 20 kilowatts. This particular architecture put a requirement of 700 watts per ATCA blade on the table in two ways: First, the local power supply has to deliver this power to the components on the blade, while secondly, the resulting heat has to be transported out of the system. An in-house developed and manufactured closed-loop water-cooling circuit was implemented as a result.
External image sensors generate raw data and bring it into the system at a rate of hundreds of gigabytes per second for processing in a real-time environment. The semiconductor manufacturing line performance directly depends on the image processing throughput, putting hard real-time deadlines on pixel data processing. Data is acquired and distributed by dual slot rear transition modules (RTMs, PICMG 3.8), specially designed to handle the sustained data stream. It distributes data via multiple Serial RapidIO 2.1 quad links to the processing blades (Figure 2). Over a dual 10 Gigabit Ethernet link between the aggregation switch and the FPGA on the acquisition module, the data flow can be reversed, enabling out-of-line processing in addition to the in-line method.
The reduced processed data stream is led out of the system using 40 Gigabit Ethernet links (following PICMG 3.2 standards) by bonding four 10 Gigabit Ethernet links to the central 40 Gigabit Ethernet switch blade. The central switch blade provides the uplink to postprocessing systems using multiple front 40 Gigabit Ethernet links. DSCP-based Quality of Service is used to separate real-time data streams from non-real-time control streams.
Architecture deep dive
First, the 6PU blade: The 6PU blade was introduced as a generic high-performance processing blade. It is optimized for performance per watt, performance per cubic volume, and performance per dollar. This focus resulted in a highly efficient processing blade. The name “6PU” is derived from its six processing units, each of which contains one TI 66AK2H14 DSP combined with two TI TMS320C6678 DSPs, interconnected via TI’s HyperLink protocol. This results in the following key facts:
- 6x Texas Instruments 66AK2H14 KeyStone-2 DSP, running at 1.2GHz; per SoC: 4x ARM MPCore and 8x C6678 DSP CorePac
- 12x Texas Instruments TMS320C6678 KeyStone DSP, running at 1.2GHz; per SoC: 8x C6678 DSP CorePac
- 144 Gbyte of DDR3 memory including ECC
- All elements are interconnected by RapidIO 2.1, 10 Gigabit Ethernet, PCI Express 3.0 and TI’s HyperLink
Figure 3 shows the topology of a single processing unit on the 6PU blade, while Figure 4 shows the topology of the processing units on the 6PU blade.
Extensive simulations and reviews were performed during the PCB design, to ensure the signal integrity of the folded memory topology required for such a high density of embedded DDR3 memory chips. The high density of the board (18,000 surface-mount device [SMD] components and 70,000 nets), in combination with the amount of high-speed serial differential lanes, posed quite a challenge.
Data acquisition: RTM
The PICMG 3.8 rear-transition module (RTM) is a high-density, high-speed rear transition module providing interfacing between the outside world and the processing blades. With RapidIO, it has remote direct memory access to the processing blades, achieving a data-driven architecture and avoiding spending scarce CPU cycles on simply transferring data.
Each RTM on the design was connected by multiple 20 Gbps Serial RapidIO 2.1 links, used for real-time data transfer towards the connected 6PU processing blades. In addition, six PCI Express 3.0 links provided out-of-band, real-time control, separate from the data plane. IPMI-spec management functionality was included to provide out-of-band diagnostics.
Xilinx’s seventh-generation Virtex and Zynq FPGAs provided the remote direct memory access interface towards the 6PU blades. Key facts include:
- Xilinx Virtex-7 FPGA: 4 x 4 Gigabyte DDR3 SDRAM at 1600 MT/s with ECC, Multiple RapidIO 2.1 interfaces at 20 Gbps, Multiple PCI Express 3.0 interfaces, Multiple 10 GBASE-KR Ethernet interfaces
- Xilinx Zynq-7000 FPGA: Dual ARM Cortex-A9 MPCore, 1.5 Gigabyte DDR3 SDRAM with ECC
Figure 5 shows the topology of the data-acquisition RTM.
System integration: cabinet design, cooling, IPMI, and support software
Due to the high density of the processing components and their power consumption, much of the design effort was spent on the cooling architecture. Simulations were performed to get insight into the thermal performance of the system and its individual components. This research resulted in the design of highly efficient heatsinks in combination with push-pull cooling techniques. The end result: a complete closed-loop cooling cabinet that reduced noise and turbulences to a minimum.
Intelligent platform management (IPMI) was used extensively to control and monitor the entire infrastructure of the rack, ATCA subracks, blades, and RTMs. In total, more than 3,000 sensors actively monitor the status of the hardware. Due to the out-of-band nature of IPMI, comprehensive health status can be provided without interfering in the real-time data paths of the user application. Additionally, the IPMI controllers also monitor the link status of the RapidIO and Ethernet links and keep track of bit-error-rate (BER) counters. This feature allows the user to continuously monitor the overall health status of the system using a single IPMI-over-LAN cable.
Board support packages (BSPs) for all different processors facilitate application software development. In addition, a C-based library was developed that abstracts the rack as a whole, including low-level control of the, rack, its PDU, the ATCA subracks, and boards, down to the component level. Board firmware upgrades can be achieved using PICMG HPM.1 IPMI extensions.
Prodrive Technologies Prodrive-technologies.com