How multicore packet processing software enables cost-effective equipment for 4G networks

No need for a tug of war between low power and herding LTE traffic efficiently

Charlie discusses how specialized software, designed for high-performance processing of network packets and optimized for multicore processors, enables system designers to meet the conflicting goals of high traffic rates, low system power and minimum system cost.

As advanced services proliferate and video consumes an ever-increasing share of wireless network capacity, the requirements for high-performance processing of network traffic will continue to grow dramatically (Figure 1). Each piece of equipment in the network must achieve higher levels of packet processing performance. At the same time, the equipment must be designed to meet challenging power, cost, and schedule requirements.

Figure 1: Total traffic in the core network is growing.


The performance challenge for 4G networks

Designers of 4G telecom infrastructure products, whether LTE or WiMAX, face challenging performance requirements that cannot be addressed with the same techniques that worked for 2G and 3G equipment.

Driven by high-bandwidth Internet applications, the total traffic in the core network is growing at over 100 percent per year, so service providers expect individual network elements such as packet gateways to increase bandwidth by at least a corresponding amount.

At the same time, telecom equipment is increasingly deployed in commercial and outdoor environments without forced-air cooling, placing severe restrictions on the number of high-performance processor subsystems that can be used.

Finally, equipment suppliers operate under ever more challenging cost constraints. Such restrictions apply to both CAPEX and OPEX. Low product cost is essential to support worldwide deployments of 4G networks. At the same time the operating expense of electrical power, both to run the equipment and for cooling, is a major contributor to the calculation of overall Total Cost of Ownership (TCO).

To be successful, developers of 4G networking equipment must deliver solutions that achieve maximum throughput for tomorrow’s network traffic patterns (dominated by video and data), while minimizing system-level power consumption and cost.

Packet processing is the key

For 4G networks, 3GPP has specified a flat IP-based network architecture, System Architecture Evolution (SAE), aimed at efficiently supporting massive IP service use (Figure 2). As a consequence, the network architecture is much simpler than existing architectures such as 3G. However, as data, voice, and video all use IP packets, processing these packets efficiently becomes critical to ensure LTE system performance.

Figure 2: Processing packets efficiently is key to LTE system performance.


On top of the IPv4 and IPv6 protocols that the SAE architecture supports a large number of protocols have to be implemented:

·    Low-level protocols such as Internet Protocol Security (IPsec), Robust Header Compression (ROHC) and Virtual LAN (VLAN).

·    Within an overall 4G network, a number of protocols support communication between individual subsystems. For example, GPRS Tunneling Protocol (GTP) carries user data via IP tunnels between a Signaling Gateway (SGW) and a base station (eNodeB). Similarly, Stream Control Transmission Protocol (SCTP) implements signaling between the Mobility Management Entity (MME), the SGW, and the eNodeB. Likewise, Generic Routing Encapsulation (GRE) provides VPN connections from the SGW to the Packet Gateway (PGW). And many more protocols are used throughout the network.

·    Differentiating the services is also critical. IP QoS is required to give priorities to real-time traffic against pure data traffic. More packet inspection implements the mechanisms to identify the user traffic to better serve users and/or applications.


All these protocols are encapsulated in IP packets. Starting from layer 2 protocols, packet processing software has to analyze successive encapsulated headers as fast as possible.

The critical performance challenge for 4G networking equipment is to process these IP packets at the highest possible throughput. In general, the designer’s objective is to perform this processing fast enough so that the throughput of the equipment is limited, not by the packet processing performance, but by the speed of the physical network connection, typically 10 Gbps, 40Gbps or, soon, 100 Gbps. If the processing throughput matches the speed of the network, the system is said to be performing at “wire-speed,” maximizing the efficiency of the equipment.

Multicore processors deliver the raw performance

Over the past few years, developers of high-end processors migrated to multicore architectures in order to meet never-ending needs for increased performance. Power is proportional to the square of the clock frequency, thus the traditional processor design approach of continually increasing clock frequencies to boost performance led to prohibitive processor power consumption. The industry adopted multicore architectures in which the cores run at a clock frequency that leads to manageable power consumption for the processor as a whole.

Today, all processors used in high-performance networking products are based on multicore architectures. These platforms provide the ideal environment for implementing the high-performance packet processing that 4G equipment requires.

The Operating System bottleneck

For developers of networking equipment, selecting a multicore processor for their system is only one step in designing a high-performance system solution. Generally, the more complex question is how to architect the software which, as explained above, typically needs to process packets from multiple streams of network traffic at wire-speed.

A standard networking stack uses Operating System (OS) services and falls prey to significant overheads associated with functions such as preemptions, threads, timers, and locking. Each packet passing through the system faces these processing overheads, resulting in a major performance penalty for overall throughput. Furthermore, although some improvements can be made to an OS stack to support multicore architectures, performance fails to scale linearly over multiple cores and a processor with, for example, eight cores may not process packets significantly faster than one with two cores. All in all, a standard operating system stack does a poor job of exploiting the potential packet processing performance of a multicore processor.

The fast path solution

Specialized packet processing software optimized for multicore architectures can do a better job of taking advantage of multicore packet processing performance (Figure 3). In a well-designed implementation, the networking stack is split into two layers. The lower layer, typically called the fast path, processes the majority of incoming packets outside the OS environment and without incurring any of the OS overheads that degrade overall performance. Only those rare packets that require complex processing are forwarded to the OS networking stack, which performs the necessary management, signaling, and control functions.

Figure 3: Devoting cores to running the fast path maximizes overall system throughput.


A multicore processor is well-suited to implementing this kind of software architecture. Most of the cores can be dedicated to running the fast path, in order to maximize the overall throughput of the system, while only one core is required to run the operating system, the OS networking stack, and the application’s control plane.

In practice, the designer will analyze the specific performance requirements for the various software elements in the system (applications, control plane, networking stack, and fast path), deciding on the most appropriate allocation of cores to balance the overall system workload. The only restriction when configuring the platform is dealing with the cores that, by virtue of running the fast path are therefore running outside the OS. These cores must be dedicated exclusively to the fast path and not shared with other software. The system can also be reconfigured dynamically as traffic patterns change.

Splitting the networking stack in this way has no impact on the functionality of application software, which interfaces to the same OS networking stack as previously. Existing applications do not need to be rewritten or recertified, but they run significantly faster because the underlying packet processing is accelerated through the fast path environment.

7x – 10x performance improvement

In a typical 4G application such as a PGW or SGW, when the standard OS networking stacks are replaced by optimized packet processing software based on the fast path concept, the networking performance of the processor subsystems will typically increase by seven to 10 times. As Figure 4 illustrates, this massive increase in performance means the system will be able to manage seven to 10 times more users with the same hardware.

Figure 4: Designers using fast path implementation can achieve system throughput not possible with a single multicore processor when using a standard OS stack.


Fast path implementation can allow the designer to meet system throughput goals that may have been unachievable on a single multicore processor when using a standard OS stack.

For example, 6WIND has recently demonstrated 10 Gbps Ethernet IP forwarding performance using the 6WINDGate packet processing software running on a 2.40 GHz Intel Xeon processor E5645. In this case, only two of the processor’s six cores were required to achieve wire-speed performance, leaving the remaining cores available either for processing more complex fast path protocols or for use by control plane protocols running under the OS.

Energy efficiency and cost reduction

For the network equipment provider, these compelling breakthroughs in system performance translate directly into improvements in energy efficiency and cost.

In a typical telecom infrastructure product, 60 percent of the power is consumed by processor subsystems (including memory), while the remaining 40 percent comes from I/O, system management, and power supply subsystems. If we assume, conservatively, that moving to the fast path software architecture described above yielded a 7x performance improvement, the designer can now achieve the same level of system performance using one-seventh the number of processor subsystems. This is equivalent to removing the need for 51 percent of the system’s power consumption (given the total consumption of the processor subsystems was 60 percent of the system power). At the same time, the power supply requirements are also reduced. If we assume that saves another 4 percent of the original system power (again, keeping the numbers simple), the total system power consumption has been reduced by 55 percent while retaining the original level of performance.

Looking at this another way, the performance-per-watt or energy efficiency of the system has more than doubled, purely as a result of improvements in the system software.

A similar analysis applies to the system cost. With 50 percent of the overall cost of a 4G gateway typically coming from the processor subsystems, a (conservative) 7x reduction in the number of those subsystems, together with reduced requirements for the system power supply and management features, can enable designers to achieve around a 50 percent reduction in overall system cost through moving to the optimized software solution.


A fast path implementation of the critical protocols avoids the performance bottlenecks imposed by standard operating system stacks. Such an implementation can improve system performance as well as increasing energy efficiency and lowering product cost. For network equipment providers, these benefits represent major sustainable business advantages as they address the challenges of 4G deployments.

Charlie Ashton is VP Marketing at 6WIND. He is responsible for 6WIND’s global marketing initiatives. He also manages 6WIND’s partnerships worldwide with semiconductor companies, subsystem providers and embedded software companies. Charlie has extensive experience in the embedded systems industry, with his career including leadership roles in both engineering and marketing at software, semiconductor, and systems companies. He led the introduction of new products and the development of new business at Green Hills Software, Timesys, Motorola (now Freescale), AMCC, AMD, and Dell.

Charlie graduated from the University of Reading in England with a BS degree in Electrical Engineering. He can be reached at