Rock solid Xeon performance takes Sandy Bridge(s) to the server room

Intel has done it again, but what is the best board design to take advantage of all that processing horsepower?

3The recent launch of the next-generation Intel Xeon E5 processor family using the Sandy Bridge architecture has set a new milestone in server-class computing performance. However, there are important blade/board design factors to consider if their potential is to be reached.

Whenever Intel launches a new processor generation, the world expects more and more horsepower. But this time, Intel’s Xeon E5 processor family using the Sandy Bridge architecture appears to have taken it to a whole new level. The LINPACK benchmark run by German computer magazine c’t showed an increase in compute performance of over 100 percent compared with the preceding Intel Xeon 56xx family based on the Westmere architecture. The benchmark testing unveiled SPEC CPU2006 results of 602 SPECint_rate_base2006base and 484 SPECfp_rate_2006base with Intel’s flagship Xeon E5-2690 running at 2.9 GHz clock frequency. In the same test bench, the preceding Intel Xeon X5680 (Westmere architecture at 3.33 GHz) achieved 349 (int) and 246 (fp), respectively. Intel itself reports SPEC CPU2006 benchmark results of 671 SPECint_rate_base2006base and 495 SPECfp_rate_2006base for selected server platforms [2], while various Sandy Bridge test platforms also demonstrated environmentally friendly power figures when in idle mode and improved energy efficiency benchmarks. So, the performance looks to be significantly improved this generation, but what has made this possible?

‘Tick-Tock’ around the clock

With every second Intel Xeon generation, the company shrinks die sizes with a newly introduced silicon process technology while maintaining the CPU’s architecture (‘Tick’). And with every alternate Intel Xeon generation, the company reinvents the CPU architecture but uses the same production process as the previous generation (‘Tock’). This so-called ‘Tick-Tock’ process has been very successful for Intel, and helps minimize the risks associated with introducing new technology while staying the course defined by Moore’s Law. So, this time, it’s a ‘Tock’ – the process technology stays at 32 nm while the processor architecture gets greatly improved.

The new processor architecture comes with eight processor cores, each of which represents two logical CPUs by means of multithreading. Server designs can thus show the presence of 16 physical cores and 32 logical cores. To tame this sheer amount of available parallel performance, Xeon designers have provided a ring-shaped, bi-directional data path that connects all cores and all other processor resources. Such processor resources include up to 20 MB of third-level cache (the so-called ‘uncore logic’), memory data paths, and the newly introduced PCI Express Gen 3 based I/O.

Like its predecessor, the Intel Xeon E5 processor family features separate memory controllers for each available memory channel. Intel has increased memory bandwidth by adding another memory channel to the EP version of the Intel Xeon E5, so a server with two CPU sockets now can have as much as eight memory channels working in parallel, four per socket. The DDR3-based memory interfaces come with an improved transfer rate of 1600 MTps when using appropriate memory modules, which is another notable enhancement from the former 1333 MTps. In total, this represents a 60 percent memory bandwidth improvement.

The Intel Xeon processor family EP variants feature two Intel QuickPath Interconnect (QPI) links. These links provide direct connectivity between two CPUs in a shared memory (NUMA) and I/O system, so both CPUs can access memory and I/O resources connected to its counterpart. Though QPI has been in use since earlier Nehalem and Westmere generations, the Intel Xeon processor family EP variant’s bandwidth has been improved by over 100 percent, through both provisioning of the second QPI and support for higher clock rates (8 GTps per QPI).

I/O bottleneck, what I/O bottleneck?

Unlike the Westmere architecture, the new Intel Xeon E5 processor family also comes with integrated PCI Express Gen 3 root complexes. PCIe Gen 3 provides 8 Gbps for each available PCIe lane, thus speeding I/O bandwidth of Gen 2 by a factor of two. As the Intel Xeon EP variants allow for up to 40 PCIe lanes per CPU socket, the achievable I/O bandwidth is impressive. In a two-socket server design, for example, aggregated peak I/O bandwidth can reach unidirectional peaks of up to 640 Gbps. Lanes can be grouped in a variety of fashions so that the available I/O can be optimally connected to the processor complex. Besides the speed enhancements, the processor’s PCIe interface has been optimized to reduce PCIe latency by up to 30 percent over former processor architectures.

Another performance-enhancing feature introduced with the Sandy Bridge architecture is Data Direct I/O (DDIO). Former architectures routed DMA data between I/O devices and external memory, and while this approach is adequate for moderate I/O bandwidth, it can become a bottleneck for I/O with more demanding throughput requirements. Particularly in server architectures equipped with multiple 10G network interfaces or PCIe Gen 3 connected endpoints with very high throughput requirements, I/O traffic can rise dramatically. Here, the Direct Memory Access (DMA) channel between memory and I/O device turns out to be the speed-limiting factor.

With DDIO, the DMA takes place directly between the I/O device and the processor’s internal cache, avoiding traffic to the slower external memory whenever possible. Accessing the CPU’s internal cache can be much faster than transferring data in and out of memory, and as the cache size of the Intel Xeon EP is sufficiently large, data can remain there without needing to be transferred to main memory. This new feature not only increases the maximum achievable I/O bandwidth, but also reduces latency effects. In addition, power consumption can be decreased as the external memory is accessed less often. Not only PCIe Gen3-enabled endpoints, but also widely used 10G network controllers, benefit from this approach. Thus, network throughput can be significantly improved with DDIO turned on.

Secure and efficient

Other architecture enhancements include Intel Advanced Vector Extensions (Intel AVX) and Intel Advanced Encryption Standard (Intel AES). The AVX unit grew from 128 to 256-bit operand size, which can accelerate vector and floating-point operations. Like its predecessors, the Sandy Bridge architecture comes with an AES-NI instruction set and embedded engines that help improve the throughput of encryption/decryption algorithms. Further security features include the Intel Trusted Execution Technology (Intel TXT), a hardware security solution that provides pools of virtual resources from which operating systems and hypervisors can safely boot.

There’s an app for that: Networking

For networking applications, Intel provides the Intel Data Plane Development Kit (Intel DPDK). The Intel DPDK is a software suite that helps developers exploit the performance footprint of the new Intel processor architecture and network silicon in network processing applications. Intel DPDK can reduce the overhead associated with packet handling. It gives designers the ability to scale the packet processing tasks to the available processors and cores. It also allows applications developed on top of the Intel DPDK to smoothly transition to future CPU generations and network devices, thus providing better scaling of networking performance with available hardware resources.

As important as CPU performance is, one has to watch out that power consumption and cooling cost do not run away. Design for energy efficiency is an important consideration, and therefore the Sandy Bridge architecture is not the first server processor architecture from Intel that comes with power saving technology; but provides further improved performance per Watt and idle power modes. So whenever the software load goes down, the processor has the ability to reduce the clock speed of cores or to turn off resources that are not needed. And for applications that cannot make use of all the cores but require full performance, Intel Turbo Boost Technology 2.0 increases clock rates to a certain extent while staying within the power and thermal limits of the specific processor derivative.

Applying Intel’s improvements to the server room

However, in a real-world application of server-class processors, further power consumption, cooling, and implementation concerns must be accounted for when attempting to introduce a new, powerful processing architecture into a networking environment. Of course, there is no one way that this can be pursued, and each method has its own inverse benefits and disadvantages associated with performance and cost. Two such concepts within the xTCA ecosystem are:

1.  A blade that combines an AdvancedMC (AMC) site and CPU sockets

2.  Blades with a dedicated AdvancedTCA (ATCA) Fabric interface site.

 

Blades that provide dual CPU sockets combined with an AMC site may be able to sustain their full processing horsepower combined with high configuration flexibility. But because of the constraints of cooling solutions, they may not meet these performance targets at higher ambient temperatures – they have less memory capacity or less memory bandwidth, or are limited in some other way. Designers may face PCB routing issues associated with real estate constraints such that they are forced to reduce the amount of connected memory channels, which negatively impacts system performance. Or, the thermal budget simply does not allow the operation of as many as eight memory modules in parallel, which also would reduce the number of usable memory channels.

Other market blades feature a dedicated module site for enabling the ATCA Fabric interface and adapting it for different backplane bandwidths. While this sounds like a great idea, the approach has some pitfalls. Designers are forced to make compromises to serve the different market requirements for 10G versus 40G-enabled boards. In fact, this requires support for different architectures that go beyond what a simple mezzanine module could provide. In addition, a mezzanine approach is burdened with the cost of the mezzanine site, the necessary signaling retiming circuitries, and the mezzanine PCB and assembly. It may also block valuable real estate and airflow crucial for sufficient cooling solutions. Further, it is unlikely that field installations will be widely upgraded with different fabric modules, hence the mezzanine concept has little value for end users and they may not be willing to invest for the extra effort.

Therefore, what is needed to house the next generation of server-class processors is a blade that does not make use of space-consuming AdvancedMC mezzanine modules, but is rather optimized for CPU performance, memory capacity, and memory and I/O bandwidth, and also allows for rich I/O and storage extensions via a range of Rear Transition Modules (RTMs).

Averting the AMC

Optimized for connection into 10G-enabled backplanes, Emerson Network Power designed the ATCA-7370 server blade to avoid fancy mezzanine concepts. Further, the blade does not feature an AMC mezzanine site; again the improvements in compute capability of the current generation processors limit the usefulness of AMC-based offload capability, and the combination of processor board plus AMC is simply not as efficient in delivering the performance per watt required for next generation systems, as a well-architected dual socket design. Instead, Emerson provided a capability for user-installed offload functionality via a space and power-optimized proprietary mezzanine carrying a single Intel Communications Chipset 8920 accelerator, and used the additional space for larger heatsinks to support higher performance (and higher power) CPUs.

The processor complex is connected to the ATCA backplane via redundant 10G Ethernet (Fabric Interface) featuring the state-of-the-art Intel 82599 network controller (Figure 1). This is a very cost effective approach, and leaves real estate for sufficient cooling solutions and additional I/O functionality and storage extensions via a range of Rear Transition Modules (RTMs).

21
Figure 1: A block diagram reveals connection of the Intel Xeon E5 processor complex to the ATCA-7370 backplane via the redundant 10G Ethernet Fabric Interface, avoiding space consuming AMC sites and elaborate mezzanine concepts.

The blade hosts two Intel Xeon E5-2648L processors clocked at 1.8 GHz and with 70 W power dissipation each. This processor is particularly suitable for environments with higher ambient temperatures and supports ambient temperatures of up to 55 ºC for 96 hours, as defined by NEBS standards. The EP-compliant processors provide the full-blown capabilities of the Sandy Bridge architecture, including 20 MB of last-level cache and eight cores per socket, though the ATCA-7370 also allows populations of higher clocked processors with 95 W power dissipation.

The board design makes full use of the eight available memory channels and connects one memory module per channel. Also, the processors' PCI Express resources are widely exploited, with local I/O support for 10G and 1G network interfaces as well as RTM connections provided through the PCIe interface. Dual 1G Ethernet interfaces (1000Base-T) are accessible via RJ45 connectors on the faceplate (Figure 2).

22
Figure 2: The Emerson Network Power ATCA-7370 provides the real estate to allow for a wide range of I/O while adequately dissipating the power of high-performance network processors.

 

The Emerson ATCA-7370 meets NEBS level 3 and ETSI standards for integration into telecommunications equipment, and is fully backward compatible with existing RTM products available for Emerson’s current Westmere-based ATCA-736x family, providing a great selection of storage and I/O options. Equipment based on existing Emerson Intel architecture blades can easily be field upgraded while RTM and cabling installations remain untouched.

Server blades and the new network processor architecture

Server blades are ideal platforms to enable the full performance footprint of the new Intel Xeon E5 processor family in communications and high-performance computing applications, but must be designed for optimization. Architectures that emphasize reducing heat and saving real estate, such as the ATCA-7370, provide sufficient horsepower for satisfying the data demand of multiple 10G network connections. The enhanced capabilities make the processor architecture more attractive for packet processing and other networking functions that traditionally have been served by different architectures.

Chris Engels is Senior Technical Marketing Manager of Embedded Computing at Emerson Network Power

Brian Carr is Strategic Marketing Manager of Embedded Computing at Emerson Network Power

Emerson Network Power

www.EmersonNetworkPower.com

EmbeddedComputingSales@Emerson.com