A few days ago Kurt Bales and Cooper Lees gave me access to a test QFabric environment. I always wanted to know what was really going on behind the QFabric curtain and the moment Kurt mentioned he was able to see some of those details, I was totally hooked.
Short summary: QFabric works exactly as I’d predicted three months before the user-facing documentation became publicly available (the behind-the-scenes view described in this blog post is probably still hard to find).
This post is by no means a critique of QFabric. If anything, I’m delighted there’s still a networking vendor that can create innovative solutions without unicorn tears, relying instead on field-tested technologies ... which might, among other things, make the solution more stable.
It looks like a giant switch
When you log into the QFabric management IP address (VIP), it looks exactly like a giant switch – single configuration, single set of interfaces, show commands etc. All the familiar Junos configuration components are there: system group, interfaces, VLANs and protocols. The only really new component is the fabric object with node-group definitions (more on QFabric node groups).
However, every giant switch needs troubleshooting, which usually requires access to individual components; in QFabric case, the request component login command that unveils the really interesting world behind the curtain.
ip@test> request component login ? Possible completions: <node-name> Inventory name for the remote node DRE-0 Diagnostic routing engine IC-Left/RE0 Interconnect device control board IC-Left/RE1 Interconnect device control board IC-Right/RE0 Interconnect device control board IC-Right/RE1 Interconnect device control board FC-0 Fabric control FC-1 Fabric control FM-0 Fabric manager NW-NG-0 Node group R2-19-Node0 Node device R2-19-Node1 Node device R2-7-Node4 Node device R2-7-Node5 Node device R3-12-Node6 Node device R3-12-Node7 Node device R3-19-Node2 Node device R3-19-Node3 Node device RSNG01 Node group RSNG02 Node group
The names of physical entities (QF/Nodes, QF/Interconnects) could be either their serial numbers (default) or user-configurable names (recommended).
As you can see, you can login to individual physical devices, node groups, and virtual components like fabric controls and fabric manager. These virtual components run on QF/Directors – CentOS boxes running KVM (you can log into the QF/Director Linux shell and see the virtual machines with ps -elf).
Each QF/Director is running a number of common services, including database (MySQL), DHCP, FTP, NTP, SSH, GFS, DLM (distributed lock manager), NFS and Syslog servers:
ip@QFabric> show fabric administration inventory director-group status Director Group Status Sat Aug 25 09:52:08 PDT 2012 Member Status Role Mgmt Address CPU Free Memory VMs Up Time ------ ------ -------- --------------- --- ----------- --- ------------- dg0 online master xxxxxxxxxxxx 10% 17642780k 4 3 days, 16:23 hrs dg1 online backup xxxxxxxxxxxx 6% 20509268k 3 3 days, 16:13 hrs Member Device Id/Alias Status Role ------ ---------------- ------- --------- dg0 xxxxxxxxxxxxxxxx online master Master Services --------------- Database Server online Load Balancer Director online QFabric Partition Address online Director Group Managed Services ------------------------------- Shared File System online Network File System online Virtual Machine Server online Load Balancer/DHCP online Hard Drive Status ---------------- Volume ID:4 optimal Physical ID:1 online Physical ID:0 online SCSI ID:1 100% SCSI ID:0 100% Size Used Avail Used% Mounted on ---- ---- ----- ----- ---------- 423G 6.3G 395G 2% / 99M 20M 75M 21% /boot 93G 2.0G 91G 3% /pbdata Director Group Processes ------------------------ Director Group Manager online Partition Manager online Software Mirroring online Shared File System master online Secure Shell Process online Network File System online DHCP Server master online master FTP Server online Syslog online Distributed Management online SNMP Trap Forwarder online SNMP Process online Platform Management online [... rest deleted ...]
Lo and behold – it’s actually running BGP internally
After logging into one of the fabric control virtual machines, you can execute the show bgp fabric summary command, which clearly indicates the control-plane protocol behind the scenes is multi-protocol BGP running numerous address families. Each fabric control VM runs BGP with all server or network nodes (not individual QF/Nodes) and with all QF/Interconnects.
qfabric-admin@FC-0> show bgp summary fabric | no-more
Groups: 2 Peers: 6 Down peers: 0
Unconfigured peers: 5
Table Tot Paths Act Paths Suppressed History Damp State Pending
bgp.l3vpn.0
42 18 0 0 0 0
Peer AS InPkt OutPkt OutQ Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
128.0.128.4 100 10517 10602 0 0 3d 6:43:58 Establ
bgp.l3vpn.0: 17/17/17/0
bgp.rtarget.0: 28/31/31/0
bgp.fabricvpn.0: 28/28/28/0
bgp.bridgevpn.0: 8/8/8/0
default.inet.0: 17/17/17/0
default.fabric.0: 19/19/19/0
128.0.128.8 100 10594 10593 0 0 3d 6:44:06 Establ
bgp.l3vpn.0: 0/18/18/0
bgp.rtarget.0: 1/32/32/0
bgp.fabricvpn.0: 0/103/103/0
bgp.bridgevpn.0: 0/9/9/0
default.inet.0: 0/18/18/0
default.fabric.0: 0/91/91/0
128.0.130.4 100 10466 10552 0 0 3d 6:35:42 Establ
bgp.rtarget.0: 0/4/4/0
bgp.fabricvpn.0: 34/34/34/0
bgp.bridgevpn.0: 0/0/0/0
default.fabric.0: 34/34/34/0
128.0.130.10 100 9751 9636 0 0 3d 1:04:34 Establ
bgp.rtarget.0: 0/4/4/0
bgp.fabricvpn.0: 34/34/34/0
bgp.bridgevpn.0: 0/0/0/0
default.fabric.0: 34/34/34/0
128.0.130.24 100 10432 10547 0 0 3d 6:18:09 Establ
bgp.l3vpn.0: 1/7/7/0
bgp.rtarget.0: 0/7/7/0
bgp.fabricvpn.0: 7/7/7/0
bgp.bridgevpn.0: 1/1/1/0
default.inet.0: 1/7/7/0
default.fabric.0: 4/4/4/0
128.0.130.26 100 10410 10545 0 0 3d 6:19:11 Establ
bgp.l3vpn.0: 0/0/0/0
bgp.rtarget.0: 0/4/4/0
bgp.fabricvpn.0: 0/0/0/0
bgp.bridgevpn.0: 0/0/0/0
Any other node (example: QF/Interconnect), has two BGP sessions with both fabric control VMs:
qfabric-admin@IC-Left> show bgp summary fabric Groups: 1 Peers: 2 Down peers: 0 Peer AS InPkt OutPkt OutQ Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped... 128.0.128.6 100 9663 9775 0 0 3d 1:16:27 Establ bgp.rtarget.0: 28/32/32/0 bgp.fabricvpn.0: 61/61/61/0 bgp.bridgevpn.0: 0/0/0/0 default.fabric.0: 61/61/61/0 128.0.128.8 100 9667 9773 0 0 3d 1:16:23 Establ bgp.rtarget.0: 0/32/32/0 bgp.fabricvpn.0: 0/61/61/0 bgp.bridgevpn.0: 0/0/0/0 default.fabric.0: 0/61/61/0
Edge nodes use six MP-BGP address families (including default.inet.0 and default.fabric.0), QF/Interconnects have just four.
The fabric control VMs act as BGP route reflectors (exactly as I predicted). You can easily verify that by inspecting any individual BGP entry on one of the node groups – you’ll see the Originator and Cluster List BGP attributes:
65534:1:192.168.13.37/32 (2 entries, 1 announced)
*BGP Preference: 170/-101
Route Distinguisher: 65534:1
Next hop type: Indirect
Address: 0x964f49c
Next-hop reference count: 6
Source: 128.0.128.6
Next hop type: Router, Next hop index: 131070
Next hop: 128.0.130.24 via dcfabric.0, selected
Label operation: PFE Id 7 Port Id 55
Label TTL action: PFE Id 7 Port Id 55
Session Id: 0x0
Next hop: 128.0.130.24 via dcfabric.0
Label operation: PFE Id 8 Port Id 55
Label TTL action: PFE Id 8 Port Id 55
Session Id: 0x0
Protocol next hop: 128.0.130.24:49160(NE_PORT)
Layer 3 Fabric Label 5
Composite next hop: 964f440 1738 INH Session ID: 0x0
Indirect next hop: 92c8d00 131072 INH Session ID: 0x0
State: <Active Int Ext>
Local AS: 100 Peer AS: 100
Age: 3d 6:54:40 Metric2: 0
Validation State: unverified
Task: BGP_100.128.0.128.6+33035
Announcement bits (1): 0-Resolve tree 1
AS path: I (Originator) Cluster list: 0.0.0.1
AS path: Originator ID: 128.0.130.24
Communities: target:65534:117440513(L3:1)
Import Accepted
Timestamp: 0x116
Route flags: arp
Route type: Host
Route protocol : arp
L2domain : 5
SNPA count: 1, SNPA length: 8
SNPA Type: Network Element Port SNPA
NE Port ID: 49160
Localpref: 100
Router ID: 128.0.128.6
Secondary Tables: default.inet.0
Composite next hops: 1
Protocol next hop: 128.0.130.24:49160(NE_PORT)
Layer 3 Fabric Label 5
Composite next hop: 964f440 1738 INH Session ID: 0x0
Indirect next hop: 92c8d00 131072 INH Session ID: 0x0
Indirect path forwarding next hops: 2
Next hop type: Router
Next hop: 128.0.130.24 via dcfabric.0
Session Id: 0x0
Next hop: 128.0.130.24 via dcfabric.0
Session Id: 0x0
Addressing
QFabric control plane uses locally-administered MAC addresses and IP address block 128.0.0.0/16. You can see all the MAC and IP addresses with the show arp command executed on any of the internal components. The bme interfaces are the control-plane interfaces, the vlan interface is a user-facing SVI interface.
qfabric-admin@NW-NG-0> show arp MAC Address Address Name Interface Flags 00:13:dc:ff:72:01 10.73.2.9 10.73.2.9 vlan.501 none 02:00:00:00:40:01 128.0.0.1 128.0.0.1 bme0.2 permanent 02:00:00:00:40:02 128.0.0.2 128.0.0.2 bme0.2 permanent 02:00:00:00:40:05 128.0.0.4 128.0.0.4 bme0.0 permanent 02:00:00:00:40:05 128.0.0.5 128.0.0.5 bme0.1 permanent 02:00:00:00:40:05 128.0.0.5 128.0.0.5 bme0.2 permanent 02:00:00:00:40:05 128.0.0.6 128.0.0.6 bme0.0 permanent 02:00:00:00:40:07 128.0.0.7 128.0.0.7 bme0.1 permanent 02:00:00:00:40:07 128.0.0.7 128.0.0.7 bme0.2 permanent 02:00:00:00:40:08 128.0.0.8 128.0.0.8 bme0.1 permanent 02:00:00:00:40:08 128.0.0.8 128.0.0.8 bme0.2 permanent 02:00:00:00:40:09 128.0.0.9 128.0.0.9 bme0.1 permanent 02:00:00:00:40:09 128.0.0.9 128.0.0.9 bme0.2 permanent [... rest deleted ...]
Look Ma! There are the labels!
In my blog post I predicted QFabric uses MPLS internally. It’s impossible to figure out without a 40Gbps sniffer whether MPLS label stack is the exact encapsulation format QFabric is using, but it sure looks like MPLS from the outside.
The dcfabric interface uses mpls as one of the protocols:
qfabric-admin@RSNG01> show interfaces dcfabric.0
Logical interface dcfabric.0 (Index 64) (SNMP ifIndex 1214251262)
Flags: SNMP-Traps Encapsulation: ENET2
Input packets : 0
Output packets: 0
Protocol inet, MTU: 1558
Flags: Is-Primary
Protocol mpls, MTU: 1546, Maximum labels: 3
Flags: Is-Primary
Protocol eth-switch, MTU: 0
Flags: Is-Primary
You can also see MPLS-like labels in numerous BGP entries, for example in the bridgevpn address family ...
65534:1:5.c8:e2:c3:01:78:8f/144
*[BGP/170] 1w3d 15:28:00, localpref 100
AS path: I, validation-state: unverified
to 128.0.128.4 via dcfabric.0, Push 1730, Push 1, Push 55(top)
> to 128.0.128.4 via dcfabric.0, Push 1730, Push 2, Push 55(top)
[BGP/170] 1w3d 15:28:00, localpref 100, from 128.0.128.8
AS path: I, validation-state: unverified
to 128.0.128.4 via dcfabric.0, Push 1730, Push 1, Push 55(top)
> to 128.0.128.4 via dcfabric.0, Push 1730, Push 2, Push 55(top)
The same set of three labels appears in a host route pointing to a host connected to another QF/Node:
65534:1:10.73.2.9/32
*[BGP/170] 3d 12:32:09, localpref 100
AS path: I, validation-state: unverified
> to 128.0.128.4 via dcfabric.0, Push 5, Push 1, Push 23(top)
[BGP/170] 3d 12:32:09, localpref 100, from 128.0.128.8
AS path: I, validation-state: unverified
> to 128.0.128.4 via dcfabric.0, Push 5, Push 1, Push 23(top)
IP prefixes directly connected to the QFabric have just one label – probably a pointer to an IP forwarding table entry.
65534:1:10.73.2.0/29
*[BGP/170] 3d 12:31:59, localpref 101, from 128.0.128.4
AS path: I, validation-state: unverified
> to 128.0.128.4:129(NE_PORT), Layer 3 Fabric Label 5
[BGP/170] 3d 12:31:59, localpref 101, from 128.0.128.8
AS path: I, validation-state: unverified
> to 128.0.128.4:129(NE_PORT), Layer 3 Fabric Label 5
On the other hand, the MPLS routing and forwarding tables are empty, indicating that this is very probably not the MPLS we’re used to.
Summary
Behind the scenes, QFabric runs like any well-designed service provider network: a cluster of central servers provides common services (including DHCP, NFS, FTP, NTP and Syslog), BGP is used in the control plane to distribute customer prefixes (IP addresses, host/ARP routes, MAC addresses) and MPLS-like encapsulation that can attach a label stack to a L2 frame or L3 datagram is used in the forwarding plane.
The true magic of QFabric is the CLI VM, which presents the internal IP+MPLS-like network as a single switch without any OpenFlow or SDN magic. Wouldn’t it be nice to have something similar in the service provider networks?
2012-12-17: Comments are temporarily disabled, as a moron selling acne-reducing snake oil found this blog post interesting. Contact me using the 'Contact' link at the top of the page.

Hi Ivan,
ReplyDeleteMay be this paper from Juniper will be of interest.
http://www.juniper.net/us/en/local/pdf/whitepapers/2000443-en.pdf
Thank you! Excellent one ;)
DeleteAs usual, excellent work Ivan!
ReplyDeleteVery nice! excellent post and information as always
ReplyDeleteBeautiful...
ReplyDeleteThat is very, very, nice. I look forward to more reports regarding your experiences with it.
ReplyDeleteIvan,
ReplyDeleteAre you perhaps being a little kind to yourself here?
As an example....
Ref this central comment in your original post.
'They would likely keep the individual components in the QFabric pretty autonomous and use distributed processing while using QF/Director as the central management/configuration device (similar to UCS manager in Cisco UCS).'
Ref the role of the Director from Juniper.
'To draw parallels with a traditional chassis-based switch, the QFabric Director is equivalent to the supervisor module and routing engine.'
I am sure you would agree the credit here should be going to the smart guys and girls at Juniper who created the architecture, design and code to make this happen. Like most embedded coders, they don't get to make a big noise on the internet about their rather clever work.
Very timely post. I had in my "homework" list to figure out just what was going on under the hood after you pointed out to me that QFabric is a distributed (not centralized) control-plane. You did most of my homework for me, although I need to re-read this post a few times to digest it completely. As usual. ;-) Thank you.
ReplyDeleteFrom Juniper
ReplyDeleteThe QFabric architecture subscribes to the “centralize what you can, distribute what you must” philosophy by implementing a distributed control plane in order to build scale-out networks.
Network node group Routing Engine: NNG routing engine performs the routing engine functionality on the NNG QFabric Nodes as described earlier. It runs as an active/backup pair of VMs across the physically disjointed compute nodes in the QFabric Director cluster.
QFabric Director compute clusters are composed of two compute nodes, each running identical software stacks.
(although the current implementation supports a maximum of two compute nodes per cluster, the architecture can theoretically support multiple servers.) Two of the compute nodes have a disk subsystem directly attached to them; the remaining compute nodes are diskless and therefore essentially stateless. each disk subsystem consists of two 2 Tb disks arranged in a raID1 mirror configuration, and the contents across these subsystems are synchronously block replicated for redundancy.
Can't you setup mirroring to check forwarding plane headers? As QFabric hardware is 'switch' COTS ASIC, instead of fully programmable NPU like Trio, it seems likely that forwarding plane wouldn't even support custom headers.
ReplyDeleteUnbelievably awesome Ivan! The comment about having this in SP struck home.... I would seriously consider selling my soul to Scratch for the capability Qfabric ha to manage a distributed network. In my 2 x 10 ^ 4 node work network
ReplyDeleteEverytime I look at how you must do things in J vs how you must do them in C, or everytime tcam issues kill a switch, or so-called router in the case of 7600 and n7k, it makes me want to get a hedge trimmer to cut the cables and a screwdriver, and launch a world tour of our edge peering points and data centers, to rip out every C device from peering router, to L2 agg switch to leaf switch to spine switch to access router, and toss them out in the street. I'd really like to see coverage of MPLS VPN on J MX series, and on EX and QFX series devices.. Also... Fw filter policy tricks that just are not possible in Cisco ACLs.
ReplyDeletehttp://www.ietf.org/id/draft-ietf-l2vpn-evpn-01.txt
ReplyDeleteWhat about http://www.heise.de/netze/rfc/rfcs/rfc5735.shtml#page-10
ReplyDelete->128.0/16 to be allocated
The 128.0.128.0 address space is used by the fabric control protocol internally and is based on BGP, it is not part of any external reachability information.
DeleteYou can see 128.0 addresses referenced when using the "show route fabric" command.
But you will not see any 128.0 when using "show route" unless/until it is used on the Internet.
Nice post Ivan.I had heard about the use of BGP & MPLS in QFabric. This post confirms it with more details.
ReplyDeleteWhat's your view on SPB findings its way into DC fabrics by some of the vendors? It also has roots in service provider world with goals of achieving scale, ease of provisioning and O&M?
Good stuff Ivan !
ReplyDeleteHi Ivan,
ReplyDeleteI was wondering what would happen if you were to remove the MAC info from the equation. Instead of mapping L2 to L3 via a distributed ARP table... why not just remove all L2 from the equation and perform pure L3 based forwarding? Terminate ARP at the leafs and you have optimal L2/L3 any node to any node...
That would be ideal, but we both know that we have to support all sorts of crazy non-IP protocols (ex: FCoE :D ) and IP-based abominations that refuse to die (ex: Microsoft NLB).
DeleteAs much as I'd like L3 forwarding everywhere, when reality hits you, you have to implement a mix of L2 and L3.