Itís all in the packet processing

Thomas walks us through network processor evolution milestones – context for understanding the role of synchronous dataflow architecture as network processors adapt to meet the needs of content traffic growth.

Bandwidth requirements for carriers and enterprise networks are rising to 10 Gbps, 40 Gbps, 100 Gbps, and beyond. To serve both the consumer and business markets, service providers require reliable, consistent performance from their networks and gear. Service providers are also looking for a level of flexibility that will enable vendor interoperability while allowing them to introduce and adapt to standards.

Network processors must shift to meet the needs of content traffic growth in the emerging infrastructure. Beginning with general-purpose processors and moving to today’s multiprocessors, network processors have evolved over the past 10 years. This evolution has stemmed from the changing needs of the service providers as well as the shift in networks. Today’s networks are “intelligent,” and as consumers and businesses alike continue to demand more applications requiring higher bandwidth, these networks will require significantly more packet processing.

To enable service providers and network equipment vendors to meet the need for increased bandwidth and higher performance, developers must take a fresh look at the overall architecture used in the network as well as revisiting the individual processing elements and topologies.

Network architecture evolution

Before we jump into today’s architecture, let’s take a brief look back at the evolution of network architecture and processors. As the demand for bandwidth steadily grew, so did the need for additional packet processing. To process packets at data rates above a few gigabytes per second, general-purpose architectures could not keep up with high-speed forwarding or other packet processing. Router architecture evolved to include distributed packet processing over line cards interconnected by a high-speed switch fabric. Manufacturers developed fixed-function Application-Specific Integrated Circuits (ASICs) to provide framing, classification, and forwarding. And while fixed-function ASICs were – and still are – an alternative, they have the drawback of being less flexible than other options. Control flow processors, such as Reduced Instruction Set Computers (RISC), featuring high instruction rates, also serve as an alternative, but they do not avoid the memory access bottleneck. Another type of architecture – dataflow architecture – uses the availability of data to fetch instructions rather than the availability of instructions to fetch data. When applied to general-purpose computing problems, programmers deemed pure dataflow machines tougher to program than more conventional architectures, so these machines never found their way into mainstream computing systems. However, when used in data stream processing for graphics and signal processing, dataflow architectures are more common.

One additional architecture is that of the Packet Instruction Set Computer (PISC). With RISC-based architectures, performance analysis often requires profiling the application on real or simulated hardware in order to estimate performance metrics, such as data rate and packet rate, using statistical methods. In contrast, PISC architecture renders profiling unnecessary, as it makes deterministic guarantees for data and packet rates.

The chief differentiators in the four categories we have discussed – fixed-function ASICs, control flow multiprocessors such as RISC, dataflow architecture, and PISC – are the way they process elements and topologies.

The processing elements and topologies

One of the key design decisions that ultimately affects a network processor’s programming complexity is the topology of the processing elements.

To enable network processors to reach the required number of instructions per packet unit, designers often distribute processing over multiple concurrently executing processing elements. After considering the application’s structure and the constraints imposed by tools or processor architecture, designers organize processing elements into a suitable topology.

A network processor can have processing elements organized in two common topologies: parallel processing elements (or a pool) and pipelines of processing elements. Hybrids such as parallel pipelines or pipelines of parallel processing elements are also possible. In parallel topologies, the processing element assigned to a packet might execute the entire program for this packet, and, at a given moment, different processing elements  typically operate on different packets. In pipelined technologies, designers partition the application into shorter blocks of instructions and multiple processing elements process a single packet in a predefined order. While each topology has its strengths and weaknesses, both tend to limit the extent to which they can scale – forcing the application of hybrid topologies for high-performance network processors.

Traditionally, architects have turned to derivatives of general-purpose control flow architectures for their processing elements and organized them into arrays that behave in a dataflow manner, meaning they use the arrival of packets to cause instructions to be executed. While this can work for low bandwidths, hardware inefficiency and complexity become an issue when scaling to 10 Gbps and above. At these bandwidths, high power dissipation and complex programming models can make cost-effective and timely system-level design difficult.

However, a synchronous dataflow architecture for the individual processing elements greatly simplifies the process of organizing them into arrays that behave in a dataflow manner. The simplicity of an approach like this translates into greater hardware efficiency, which translates into smaller die size and lower power dissipation. This simplicity lends itself to a programming model that virtualizes the array of processing elements in a manner that makes them look like a single, synchronous dataflow processor with no data dependencies. This reduces the programming complexities typically associated with NPUs that have large numbers of processors, while maintaining the efficiency and control required for real-time, data-driven applications.

Tuned to the workload

Synchronous dataflow architecture is actually the first architecture highly tuned to match the workload presented by real-time data path applications such as wirespeed packet forwarding. More the offspring of the data-centric architectures employed in forwarding ASICs than the control-centric architectures employed in general-purpose processors, this architecture uses the arrival of data to cause instructions to be fetched rather than using the arrival of instructions to cause data to be fetched.

The key to high performance in applications that are more data intensive than instruction intensive is to optimize for the flow of data through the entire chip rather than to optimize for the flow of instructions through a single processing element.

In order to operate on as much data as possible in parallel while running these application programs, the size of the processing cores and the complexity of the interconnect needs to be minimized as much as possible. There are certain network processors that are based on a simple dataflow processor with a single set of registers, which are capable of executing up to four instructions in parallel. This matches the moderate degree of instruction-level parallelism inherent in data plane application programs. Application-specific hardware is implemented as co-processing engines rather than instruction set extensions so as not to increase processor size. A pure linear pipeline organization of these processors eliminates the need for a complex power-hungry bus or crossbar interconnect between processors. The combination of these attributes makes it possible to fit a much larger number of processors (512+) on a single chip, taking advantage of the high degree of data-level parallelism inherent in data plane application programs.

Executing a single VLIW-like instruction, which specifies up to four operations per processor, leverages the low instruction count of data plane application programs to allow high-instruction bandwidth and eliminate instruction stream stalls without duplicating any instruction storage. Also eliminated is the need for branch delay slots and complex branch prediction logic. The entire branch instruction is executed in a single stage of the pipeline (on a dedicated processor), and the resulting instruction pointer is passed to the next processor along with the data.

Passing the data through the processor removes the need for a large centralized data store and associated internal interconnect and external pin count, and at the same time eliminates the need for load/store instructions. No large separate register set is required because the data flows directly through the registers. This significantly shrinks die area, power dissipation, and pin count. It also dramatically reduces the instruction count and does away with the load delay slots and data stream stalls associated with data-intensive applications.

Using dedicated I/O processors (Engine Access Points) to offload I/O processing from the processing elements eliminates the need for an operating system and its associated high instruction count. It also isolates the processing elements from very high latency I/O operations, so multiple copies of the register set to support multithreading become unnecessary.

Having no shared resources within the processing element results in deterministic, single-cycle instruction execution. There is no need to leave headroom in the instruction budget to guarantee wirespeed operation due to varying levels of resource contention.

Matching the architecture to the application can also simplify the process of writing efficient data plane programs. Much of the programmer visible complexity – multiprocessing, multithreading, instruction pipeline scheduling, multiple instruction sets, and nondeterministic instruction execution times – that is associated with the various methods of attempting to scale the performance of RISC processors can be eliminated. This makes it possible to write efficient programs using a combination of assembly language and GUI-based application wizards and thereby forgo the inefficiencies associated with code that is generated from high-level languages designed for developing large instruction count, control-centric application programs.


By matching architecture to application, synchronous dataflow architecture is the first programmable architecture efficient enough to achieve port densities comparable to ASICs and ASSPs. This should open up a new alternative for designers who are working on high-performance, high-density data plane applications and who want a more flexible solution than what is offered by today's ASICs and ASSPs.

Using synchronous dataflow architecture such as this enables additional packet processing – thus meeting the service providers’ and network equipment vendors’ need for increased bandwidth and higher performance.

Thomas Eklund is Vice President of Business Development and Marketing, Xelerated.