Software tuned for multi-core promises big gains in AdvancedTCA system performance

Moore’s Law came into being back in the heyday of processor design. Moore’s Law states that every 18 months, the number of transistors in a given integrated circuit roughly doubles. As transistor geometries shrank and multiplied, computing performance increased at about the same rate. Unfortunately, gone are the days when systems manufacturers could simply ride the continuous performance improvements of the next-generation processor. Shrinking transistor geometries have reached a point of diminishing returns as regards clock rate, somewhat grinding Moore’s Law to a halt. Couple this with rising heat and memory latencies, and the electronics industry has reached a point where it’s imperative to explore new directions in order to kick increasing performance back into gear.

It has been interesting to observe the innovations in processor technologies over the past eight to 10 years. Back in 1999 hope sprang eternal when network processors promised to be the disruptive technology, enabling performance gains never before seen in the industry. Since that time, a large number of chip manufacturers have dropped their network processor offerings for a variety of reasons. But for those companies, while the formal term “network processor” might no longer be in their product portfolio, the concept lives on in their mainstream offerings. The most notable of these companies is Intel. Intel has largely divested its network processor offering, but its product lines now incorporate multiple cores inside a single chip. Multicore technology provides a theoretical N-times compute power increase, where N is the number of cores inside a specific chip.

For systems companies attempting to take advantage of this technology, it’s not just a matter of lobbing the same old software onto a new multiprocessor platform. That would be like going out and buying the latest professional racing bike and expecting your 13-year-old son to win the 2009 Tour de France! It’s imperative that the software is designed and implemented to take full advantage of the additional compute power of these new multicore processors.

In this month’s issue of Software Corner, we’ll explore the implica-tions of multicore technology on networking and high availability (also known as HA) software solutions.

Networking – the need for speed

Nowhere is the need for increased performance more evident than in the networking segment of the electronics industry. The Internet has revolutionized how the world lives, learns, works, and interacts. Projections for increased use of multimedia services over the Internet require staggeringly high bandwidth and performance.

Networking systems performance starts with the speed of the baseline operating system, protocol stacks, and high avail-ability middleware. Important performance metrics are measured by the number of connections that can be handled over a given period of time (capacity) or how fast a given connection can push data (speed). I recently got the opportunity to speak with Todd Mersch, Senior Product Line Manager of the Trillium software product line within Continuous Computing. Todd mentions that the Trillium team early on recognized that core software components (like protocols and HA middleware) must be able to fully and efficiently utilize these new multicore processors and platforms in order to “get Moore’s Law going again.”

Trillium has had a long and storied history with networking software and protocol stacks, having been actively involved in protocol stack and networking software solutions for more than 20 years. In 2000 Intel acquired Trillium, and this was followed by Continuous Computing’s purchase of Trillium in 2003.

The Trillium team’s two recent initiatives have been:

• Being a protocol software provider with an extensive library of local and wide area networking options along with integrated high availability middleware and reference platforms.

• Taking part in a larger, more comprehensive platform solution initiative called FlexTCA, where all the software is integrated with application programming interfaces on a specific production hardware platform.

Figure 1 shows a block diagram of the Continuous Computing FlexTCA system solution architecture.

21
Figure 1

The “always available” nature of communications systems requires a significant integration effort at many layers – from platform to application.

At the lowest layer, the platform management software gives the system its monitoring and hot-plug capabilities. A system management API lets product manufacturers add monitoring and diagnostics applications to the development mix.

The carrier grade operating system is the first level of software that must take advantage of multicore processor capabilities. For example, Linux-symmetric multiprocessing (also known as Linux-SMP) is implemented to “load balance” processes across cores within a multicore processor. At a basic level, this approach allows the basic execution elements of an operating system (processes and threads) to be spread across the compute cores.

The High Availability middleware and protocol stack software must be designed and implemented in such a way as to allow the SMP operating system to spread the processing across cores. Alternatively, platforms can also be architected such that the middleware and network stacks utilize a specific set of cores while the application(s) run on a separate set of cores. A hybrid approach using both of these methods can also be used.

Todd mentions that from the software design and implementation perspective, multiple things must be done to effectively utilize multicore processors. First, individual protocols themselves need to be multithreaded to allow distribution amongst multiple cores. Another issue is the development of the portability layer so it’s more optimal for, and aware of, multicore processors. For example, if specific threads need to pass data or control to one another and they are running on different cores, the protocol stack software must make provisions for sharing this data and/or providing inter-processor communication for these threads if these are not present within the operating system.

The Trillium Advanced Portability Architecture (TAPA) is the code block responsible for these things. Setting up the protocol stacks requires provisioning the number of threads used per proto-col, shared data constructs, execution control, and thread-to-core assignments. For detailed configuration by the manufacturer, these also must be exposed through the application programming interface.

One performance improvement example Todd gave relates to the Session Initiation Protocol (SIP). In benchmarking examples using 64 multiprocessing threads, the SIP call control performance (i.e., number of calls that can be handled per second) scaled up to 6,000 calls per second – nearly an order of magnitude improvement over a single processor implementation.

Todd says the implementation is “not just screwdriver integration.” Continuous does not merely optimize one piece without regard to other software components in the system. Instead, the FlexTCA integration for multicore extends through the operating system, protocols, and software infrastructure.

The FlexTCA System software web interface allows graphical configuration of each layer within FlexTCA – from system initialization through HA middleware and protocol stacks. Related to the multiprocessor updates, this web interface is also extended to allow for prioritization of functions within the architecture which, within the implementation, translates to what gets loaded where within the multicore architecture.

Picking up right where the swapped board left off

The initialization configuration is kept in the system services HA management component. If a blade needs to be swapped out, the configuration is kept, so it’s easy to replace a blade up through the software configuration. Replacement happens on a slot-by-slot basis. What’s more, the configuration can be done uniquely, based on the slot in the system. This capability extends “hot swap” from simply replacing a nonworking board with another one to actually plugging in a new board that also automatically gets initialized and configured so that the software “picks up” where the old board left off.

Other standards

It’s also worth noting that the FlexTCA architecture includes a number of high availability industry standards.

The HA middleware is a Service Availability Forum (SA Forum) compliant framework for the protocols to work with the SA Forum middleware. Continuous Computing has had a similar proprietary component in its software portfolio (upSuite) and leveraged that knowledge and experience in working with partner GoAhead Software to provide a superset of the HA middleware capabilities required by SA Forum. This adds a higher level of integration resulting in a standards-based integration from platform to application.

The platform management software is also based on the HPI and IPMI standards. Again, Continuous Computing partners with GoAhead using GoAhead’s HA middleware, SAFfire, which Continuous Computing has integrated and extended to fit within the FlexTCA environment. The Continuous Computing Unified Management Interface (UMI) and remote Element Management System (EMS) provide a single view of system alarms, statistics, and events for the system administrator. This ties together all the FlexTCA components into a single unified system with a consistent look and feel.

HA 10 GbE AdvancedTCA system for telecom applications

The Continuous Computing FlexTCA System (Figure 2) brings together all the hardware and software elements to provide a High Availability 10 Gigabit Ethernet AdvancedTCA system for telecom applications.

22
Figure 2

There are currently two classes of payload blades for the FlexTCA system – with latest versions being the FlexPacket ATCA-PP50 and the FlexCompute ATCA-XE50.

The FlexPacket ATCA-PP50 is the traditional data plane processing blade within the FlexTCA system. The PP50 comes with one or two RMI XLR732 packet processors. Each packet processor consists of eight MIPS64 cores per processor, each core with four-way multithreading. There is also a Fulcrum Microsystems 10 gigabit switch on the blade that can provide intelligent switching and load balancing between control plane blades within FlexTCA. Allowing 10 gigabit bidirectional flow, the blade is designed for intelligent switching and load balanc-ing among control plane blades. Typical applications for this blade are things like traffic shaping and traffic management.

The FlexCompute ATCA-XE50 blade handles control and management plane functions within the FlexTCA system. The XE50 comes with one or two Intel Quad-core Low Voltage Xeon processors. This platform is typically where the control and management functions of the telecom solution would normally be run. There is an AdvancedMC slot on the XE50. For some applications having an AdvancedMC daughtercard running the data plane functions and sending to the XE50 for control and management processing can reduce the number of blades needed in FlexTCA.

Todd mentions that although the PP50 and XE50 blades were designed for optimal data plane and control plane processing functions respectively, the lines between data plane and control plane are blurring. Higher performance Deep Packet Inspection (DPI) applications that have little or no control plane processing required may “moonlight” and put the small control plane functions on the PP50 along with the DPI and get the desired performance results. Likewise, people who want to do a small amount of filtering on the 10 gigabit line, with the majority of the processing being the control functions, may just use the XE50 and front-end the simple filtering functions either right on the XE50 blade or on the AdvancedMC daughterboard. It’s also not unusual for the control blade to perform some higher layer deep packet inspection once the data plane processor has completed the rough filtering on the line.

Two-tier real-time load balancing

Another notable capability of the FlexTCA system is its ability to do two-level load balancing. A typical AdvancedTCA system has a switch blade and multiple compute blades. As packets enter the switch, the FlexTCA system can perform statistical load balancing on the FlexCore ATCA-FM40, the switch card used within the FlexTCA system. The FM40 can load balance among the payload blades (both PP50s and XE50s), which in turn can perform more complex load balancing functions based on identifying flows at layers 3, 4, or 7. For more information on FlexTCA and its two-tier load balancing capabilities, I recommend reading “Load Balancing Between Server Blades Within ATCA Platforms” (http://www.ccpu.com/papers/loadbalancing/) by James Radley, Principal Architect, Continuous Computing. This paper covers the load balancing and flow identification standards used to implement two-tier load balancing. Another excellent white paper that covers the FM40 switch blade is “Statistical Load Balancing on the FlexCore ATCA-FM40” (http://www.ccpu.com/papers/loadbalancingFM40/) also by James Radley. These two white papers cover the standards and algorithms used within the FlexTCA two-tier switching and load balancing environment.

Summary

Multicore processing promises to be the next-generation wave that will allow networked systems to achieve never before seen performance and capacity levels. However, it’s important that all software and systems companies understand the ramifications and intricacies of these platforms in order to develop software solutions capable of riding Moore’s Law into the future. Continuous Computing’s FlexTCA architecture provides a software solution from platform to application programming interface that is designed, implemented, and tested to effectively utilize these new multicore processor platforms.

Information on Trillium multicore software is available at http://www.ccpu.com/products/trillium/multi-core.html.

For more information, contact Curt at [email protected]