Integrating flow processors to scale cybersecurity to 40 Gbps and beyond, part 1

3Things like optical transmission, Layer 2 switching and Layer 3 routing can operate comfortably at 40 Gbps with current technology and at reasonable cost points. Other applications, though, struggle to operate at 10 Gbps today, not to mention scaling to 40 Gbps and beyond. These are typically applications that require complex processing and all share common characteristics in that they typically convey more Network Intelligence through more exhaustive examination of the packet content, metadata extraction, and stateful management of millions of simultaneous flows. High-touch processing in the areas of cybersecurity and network analytics face complexities to scale to extreme throughputs, and demand processing innovations addressed by Daniel in part one of this two-part series. Part two will explore implementing these innovations into an AdvancedTCA architecture for cybersecurity and other applications.

To protect their critical resources, networks already deploy an array of security applications such as next-generation and traditional firewalls, Intrusion Detection/Prevention Systems (IDSs/IPSs), Distributed Denial of Service (DDoS) mitigation, Data Loss Prevention (DLP), and network analytics/forensics solutions. These security solutions work almost entirely by utilizing DPI and flow analysis to look for known patterns in network flows and block or record them.

However, with the need for application awareness, security processing, and DPI, the amount of processing power required for these compute-intensive applications grows exponentially with increasing line rates. In addition, the dynamic nature of cyber threats necessitates that these applications continually adapt to keep up with the evolving threat landscape. For designers of critical security platforms, this implies the need to architect products based on general-purpose computing platforms that can quickly and easily leverage application software updates to defend against new attacks.

Conversely, general-purpose processing architectures struggle to maintain pace with increasing bandwidth and discrete traffic flows. General-purpose multicore CPUs are ideal for application, control plane, and signaling workloads, but become a networking bottleneck in high-performance designs requiring a very high packet touch rate and large number of instructions per packet over an ever-increasing number of flows. This creates an interesting dichotomy in which the need for Network Intelligence dictates a flexible architecture that cannot readily keep up with the performance needs of the network. As bandwidths continue to rise, all of the varied aspects of communications networks, including optical transmission, networking, security, and application processing, need to scale in parallel.

Enter a new processing paradigm – flow processing

No one would argue the flexibility and ubiquity offered by General-Purpose CPUs (GPCPUs), but to meet the very high instruction rate and throughput demand across millions of flows required for sophisticated packet, flow, and security processing, a fundamental change in processing architecture is needed. Rather than rely on a general-purpose processing engine combined with fixed network I/O that has limited intelligence or programmability, a unique architecture that employs multiple processor types custom designed for specific workload operations can scale to meet these stringent requirements. Adding a software-controlled processor to the solution optimized for networking, flow processing, and security processing provides the ability to offload compute-intensive tasks.

Flow processors differ greatly from cache-based GPCPUs and network processors. Flow processors are massively parallel, completely programmable devices that scale to 200 Gbps of throughput. These devices contain fully programmable, workload-specific cores for Layer 2 through Layer 7 flow processing and Layer 2 and Layer 3 packet processing, delivering hundreds of billions of instructions per second that can be applied to incoming traffic. Flow processors also contain programmable hardware accelerators for computationally intense networking tasks including DPI, regular expression matching, optimized I/O packet transfer, traffic management, security processing, and bulk cryptography. They are used in numerous applications, but are particularly relevant in cybersecurity and analytics applications, Software-Defined Networking (SDN) such as OpenFlow, stateful or application-aware load balancing, SSL Inspection, and Lawful Intercept (LI).

This heterogeneous architecture utilizes multiple proven techniques to reduce overall CPU utilization on the host processor and increase overall system performance. These methods focus on placing the correct workloads onto the right processor, and have shown to increase throughput up to 10 times more than a “one-processor-fits-all” methodology.

Dynamic load balancing

A flow is defined as a unidirectional sequence of packets that share a set of common packet header values, and therefore another important way the valuable resources of the GPCPU can be utilized is by assuring that data is appropriately structured and placed into the right CPU core to assure flow affinity. By programming a flow processor to assign particular data flows to designated cores, optimal performance can be achieved because flows that are always pinned to the same core provide maximum efficiency as hit rates are likely higher in GPCPU on-chip cache memories (Figure 1). This flow-based load balancing can be achieved by analyzing a data stream and assigning flows to cores based on a defined set of packet header values. Common load balancing techniques include two-tuple, three-tuple, and five-tuple flow classification, though other implementations may require over 30-header fields.

Figure 1: Classification and load balancing between flow processors and GPCPUs.

This solution can also balance flows into the CPU cores in an intelligent fashion, rather than the traditional round-robin method used in traditional Network Interface Cards (NICs). By assigning a system threshold value, flow processors front ending GPCPUs can monitor host CPU utilization and only place flows into CPUs with the lowest utilization, thereby offering the greatest system efficiency.

Per flow stateful management and action processing

In many instances, though, not all packets need to be processed by both a flow processor and the GPCPU. In LI, for example, only packets to/from specific individuals are interesting and the rest of the traffic is noise and can be ignored. In these cases, a technique known as flow-based cut through can be implemented in which defined packets can be rerouted by the flow processor to bypass the GPCPU altogether (Figure 2). Tightly coupling the processing elements with an API-based message passing between them allows applications to change the forwarding behavior of traffic on a per-flow basis. Flows can be dynamically load balanced to the host, across egress interfaces, or actively/passively dropped, all based on application policy. In addition, pre-classification of traffic based on common packet header fields allows filtering of traffic, further reducing the exchange of data between processors.

Figure 2: Transmission to the host processor with optional packet and flow-based cut through.

A scaled network

The nature of our networks and the important data traversing them creates an opposing set of forces. Networks need to continue to scale to meet exponentially growing bandwidth demands, and enterprises, carriers, and cloud operators need the ability to effectively monitor these networks at all layers with stateful network intelligence. To meet these needs, a new, distributed, multi-chip, heterogeneous multicore architecture is required, providing distributed workload processing to effectively scale applications to 40 Gbps and beyond. This architecture allows applications access to very high packet touch rates and throughputs, enabling cybersecurity and network analytics applications to maintain pace with the other rapidly scaling aspects of network infrastructures. Part two will examine how this architecture can be leveraged in an AdvancedTCA ecosystem to scale Network Intelligence applications for next-generation throughputs.

Daniel Proch is Director of Product Management at Netronome, responsible for their line of flow processing reference hardware and flow management software. He has 15 years of experience in the networking, security, and telecommunications fields.

Netronome |