Enabling multihost system architectures with PCI Express switches: Innovation in design through multiroot partitionable switch architecture

Matt explains design innovations that have taken the latest generation of PCI Express (PCIe) beyond its traditional role as a chip-to-chip interconnect.

In the high-tech world, the overuse of the term "emergence" has devalued its impact and rendered it somewhat trite. Technologies and applications are deemed "emerging" with a relatively low bar in hopes that users will line up to be early adopters to garner the perceived advantage of being on the cutting edge.

Those who subscribe to a classical, less diluted definition of emergence might bring up "complex systems that arise from the multiplicity of more basic interactions with the resulting system or form", and explain that what's being discussed, though built upon or from previously simpler stages, also has to be "sufficiently disparate from its constituent elements to be worthy of a unique or new classification".

With a nod to the classical definition, a new approach to switching architecture has emerged to solve key system problems previously preventing the adoption of PCI Express as the primary system interconnect in demanding AdvancedTCA and proprietary bladed communications, as well as in embedded and server/storage applications. Although PCIe has become the de facto standard for local and chip-to-chip interconnects in these applications, its reach as the primary system interconnect for systems with distributed intelligence has been minimal. Up until now, it has run into limits when tasked with enabling multiroot system architectures and ef- ficiently managing system resources.

Complex though the barriers are, the solution - a multiroot partitionable PCIe switch architecture - breaks the PCIe protocol and switching tasks down to their most basic elements and leverages the simplest transactions across multiple instances to arrive at a new PCIe switching solution.

PCIe system topology and switching overview

PCIe, a standard serial interconnect that has been widely adopted for its efficiency, scalability, power, and system cost advantages over other competing standards, is built on the foundation of legacy PCI constructs to ensure compatibility with existing system software and firmware code bases. Although the previous bus-based topology of PCI and PCI-X has been replaced by point-to-point connectivity, which utilizes packet switches for distribution, the resultant topography remains, at its base, a simple tree structure with a single root complex (in most cases, a CPU or processor complex) as shown in Figure 1. The root complex is responsible for system configuration and enumeration of PCIe resources, and manages interrupts and errors for the PCIe tree. To further support simplicity and legacy constructs, a root complex and its endpoints share a single address space and communicate through memory reads and writes and interrupts.

Figure 1

Internally, PCIe devices implement virtual PCI-to-PCI bridges and buses and logical structures that mimic the physical bridging and bus functions in legacy PCI-based systems as well as allow for support of the legacy protocols' interrupts and messages. The Figure 1 exploded view shows logical details of virtual bridge and bus hierarchies in a PCIe implementation.

Replicating logical PCI-based structures within PCIe devices eased the hardware and software migration between standards but left the new standard with limited extensibility, particularly to multiroot applications. The protocol remained focused on efficient interconnect for single root constructs - with a one-to-one relationship between a root and an endpoint.

Challenges for PCIe as a multiroot system interconnect

To enable PCIe as the primary system interconnect for multiroot systems, the challenge is to work within or extend the standard specification to enable system architects to leverage the efficiency, scal-ability, and low-power advantages of PCIe and capitalize on its rich and rapidly growing and commoditizing ecosystem of processors and peripherals. This challenge is well-studied with proposed solutions to date that include working within the standard via NonTransparent Bridging (NTB) constructs and the MultiRoot I/O Virtualization (MR-IOV) extension of the standard. Both approaches have merit, but fundamentally fail as they do not utilize the current ecosystem and/or require proprietary software development.

The solution to providing multiroot sup-port and enabling efficient system resource utilization and sharing within the PCIe specification may be much "simpler" than previously offered solutions. Revisiting the classical definition of emergence, a new approach suggests that the problems can be addressed within the current standard and ecosystem by focusing on the most basic PCIe elements, keeping the transactions simple, and rendering small switching solutions that, through multiplicity, combine to solve larger system switching problems.

Partitionable multiroot switch architecture

Leveraging the constituent logical elements of a PCIe switch - the virtual PCI-to-PCI bridges and virtual PCI bus - the multiroot partitionable switch architecture creates multiple logical switches or switch partitions within a single switching device by using physical controls (Figure 2). Each of the resulting switch partitions is logically discrete and adheres to the PCI Express Base 2.0 specification. Each independent partition represents a PCIe hierarchy whose configuration, switching operation, and reset logic are isolated from other partitions.

Figure 2

Replication of the control and management structures associated with the virtual PCIe bus makes it possible to support multiple separate virtual PCI buses. This approach allows multiple distinct root complexes to coexist on a PCIe-compliant switch. The ability to freely associate virtual PCI-to-PCI bridges to any of the established virtual buses adds more architectural flexibility. Association can take place either statically at the time of a fundamental reset to the switch or dynamically while the switch is operating.

The multiroot partitionable switch architecture then allows system designers to partition an n-port physical switch into n-partitions (that is, 16 partitions for a 16-port switch) and the flexibility to assign any switch ports to any of the partitions as well as the unique capability to change the system configuration and port assignments dynamically during switch operation. Further, within a given partition, any port can be assigned as the upstream or root port with the ability to move that root port as well during switch operation.

Despite the independence of the partitions, the shared control logic for the partitions within the switch remains a global resource that can be controlled via the physical switches' System Management Bus (SMBus) or in band from any of the roots attached to any of the logical partitions. This extends the flexibility of the architecture to allow the dynamic reallocation of resources to be initiated by roots within or outside a given partition.

The switch architecture supports multiroot system architectures, enables advanced applications of the functionality that increase system configurability, and optimizes system resource utilization, availability, and security.

Increased system configurability and hardware reuse

The most direct application of the architecture to aid system configurability is to replace multiple discrete physical PCIe switches with a single partitioned switch. Such a replacement shrinks the total cost of ownership by reducing power consumption, decreasing board space requirements, and lowering system interconnect costs. Additionally, the unified switch complex reduces system development costs through hardware reuse paradigms that enable a single hardware platform to serve many end markets and price/performance points.

In one paradigm, a multiroot system with a fixed set of computing resources leverages the partitionable switch architecture to enable a wide array of systems through the ability to efficiently map the compute resources to a varying number and an assortment of peripheral devices or cards. An example of this flexible slot and I/O provisioning is offered in Figure 3 with a contrasting discrete switch implementation.

Figure 3

A second paradigm utilizes the partitionable switch to augment compute resources to a fixed set of I/O and peripherals to provide an overall system per-formance boost by reducing the ratio of I/O peripherals to computing resources. For example, as Figure 4 shows, a multicarded application can be provisioned in the field with an additional compute card to improve performance to the eight feature or line cards.

Figure 4

Optimized resource utilization and QoS management - dynamic resource allocation

In addition to the flexibility in static configurations noted earlier, the multiroot partitionable architecture allows dynamic reconfiguration during switch operation. This capability allows compute and I/O or peripheral elements to be nimbly managed for optimal utilization and sharing of system resources.

In addition to driving down system costs by maximizing resource utilization in the form of optimal compute and I/O resource matching, the switch directly enables management of system-level Quality of Service (QoS) and guarantees of Service-Level Agreements (SLAs) for key traffic or users. Figure 5 provides a basic example of a system reconfiguration event where global system resources have been redistributed and the leftmost root has offloaded supported peripherals to ensure maximum bandwidth for high-priority traffic or users on a single root.

Figure 5

The dynamic allocation of switch resources the architecture supports also adds additional depth to the two static configuration paradigms discussed previously as system peripherals or cards added to or removed from the system through supported hot plug functionality can be redistributed dynamically without taking unaffected system elements offline.

Increased resource availability - advanced failover support

An important extension of the dynamic resource allocation is its application to increase system resource availability and reliability through advanced failover support.

With the multiroot partitionable architecture, resources associated with a failing root can be dynamically reassigned to operational roots as shown in Figure 6. The architecture allows any number of the remaining functional roots to take control of the isolated system resources to reestablish service with minimal interruption and data loss. The architectural flexibility affords system designers the ability to select any number of failover strategies from the simplest reassignment to a predetermined root or more elegant implemen-tations that allow for assignment based on the current state of the switch and the active roots' current loading.

Figure 6

The advanced support for failover in the multiroot partitioning architecture offers significantly greater resource availability and system reliability than previously offered in the PCIe switching ecosystem. Previous approaches to this problem offered relief for dual-host failover and most often did so via proprietary NTB-based algorithms. Although effective for two-root systems, this paradigm becomes overly cumbersome in multiroot systems and increases the burden on system design time and cost because of the software development required to enable the proprietary NTB functionality.

In the case of failover and the underlying dynamic resource allocation capabilities, the multiroot partitionable architecture utilizes standard, off-the-shelf PCIe-based processing and peripheral devices, and leverages existing system firmware and software.


By breaking down the PCIe switch and system switching tasks to the most basic of elements and the simplest of transactions with multiple instances, IDT PCIe switches featuring the multiroot partitionable architecture solve complex system design problems that previously prohibited the adoption of PCIe as the primary interconnect standard in multiroot systems. The new architecture provides multiroot support and enables key system constructs to optimize system resources for maximized utilization, load balancing, and QoS management. Perhaps most importantly, the architecture leverages the rich and growing PCIe ecosystem with no special hardware modifications or software upgrades, keeping system costs low and development cycles short to realize true multiroot PCIe-based systems.

By whatever definition we choose - the current or the classical - the new switch architecture earns the tag of emergence. No shortage of system vendors has adopted the technology, and all will seek to gain early mover advantages by applying the dynamic architecture - an architecture that leverages basic PCIe constructs in multiplicity to form a new class of solutions for system application needs.

Matt Jones is a Product Marketing Manager in the Enterprise Computing Division at IDT. During his 13 years in the semiconductor industry, he has held marketing positions in the IDT Internetworking Products Division, microprocessor products group, and corporate marketing organization. Matt holds a Bachelor of Science in Electrical Engineering and a Bachelor of Arts in Economics from Stanford University.