Brocade VCS fabric has almost-perfect load balancing

Short summary for differently-attentive: proprietary load balancing Brocade uses over ISL trunks in VCS fabric is almost perfect (and way better for high-throughput sessions than what you get with other link aggregation methods).

During the Data Center Fabrics Packet Pushers Podcast we’ve been discussing load balancing across aggregated inter-switch links and Brocade’s claims that its “chip-based balancing” performs better than standard link aggregation group (LAG) load balancing. Ever skeptical, I said all LAG load balancing is chip-based (every vendor does high-speed switching in hardware). I also added that I would be mightily impressed if they’d actually solved intra-flow packet scheduling.

A few days ago Greg (@etherealmind) Ferro, the Packet Pushers host, received a nice e-mail from Michael Schipp containing two slides from a Brocade presentation and “Ivan owes a WOW” PS. I do owe a huge WOW ... but it takes a bit more than just a few slides to impress me (after all, Brook Reams published most of the information contained on those slides a while ago). However, Brook got in touch with me a few days after the podcast was published and provided enough in-depth information to make me a believer (thank you, Brook!).

The first thing Brocade did right (and it should have been standardized and implemented in all switches a long time ago) is automatic trunk discovery: whenever two VDX switches are connected with parallel physical links, those links are automatically trunked. To make use of the advanced load-balancing methods, they also have to be in the same port group (connected to the same chipset), which does reduce the resilience, but if that’s a concern, you can always have two (or more) parallel trunks; TRILL will provide per-MAC-address load balancing across the trunks.

Within each port group, Brocade’s hardware is able to perform per-packet round-robin frame scheduling with guaranteed in-order delivery. It does seem like a magic and it’s not documented anywhere (another painful difference between Brocade and Cisco – Cisco would be more than happy to flaunt its technology wonders), but Brook told me the magic sauce is hidden somewhere within Brocade’s patents and was also kind enough to point me to the most relevant patent.

Based on what’s in that patent (after stripping away all the “we might also be patenting the spread of high-pressure water flows over coffee beans in espresso machine” stuff), it seems that Brocade’s hardware measures link delay and inter-link skew and combines that to schedule the frame transmission in a way that guarantees the frames will always be received in order by the remote switch. They don’t do receiver-side reordering (which is hard), but transmit-side delaying. Very clever solution deserving a huge WOOOOW.

You might wonder how Brocade, a company with historical focus on Fiber Channel, managed to solve one of the tough LAN networking problems almost a decade ago. As you probably know, the networking industry has been in the just-good-enough-to-sell mode for decades. The link aggregation load balancing problem was always way below the pain threshold as a high-speed LAG (port channel) trunk usually carries many flows; doing per-flow (or even per-IP-address) load balancing across a LAG is most often good enough. Storage networking is different: a server servicing hundreds or thousands of users (with at least as many LAN sessions) has only a few storage sessions. Perfect load balancing was thus always a critical component of a SAN network ... and it just happens to be a great solution in LAN environments using iSCSI or NFS storage connectivity.

More information

To learn more about storage networking protocols, including Fiber Channel, iSCSI and NFS, and emerging Data Center technologies including TRILL, Multi-chassis Link Aggregation, FabricPath, FCoE and others watch my Data Center 3.0 for Networking Engineers webinar (buy a recording or yearly subscription).

22 comments:

  1. Just tell me that this doesn't share any pedigree with the MRP link balancing code that has been causing no end of problems for the London Internet Exchange over the past 6 months.

    The other vendors have all been focused on constantly cutting latencies. Good to see Brocade recognising that in some cases artificially increasing latency to match effective circuit lengths can improve overall performance.

    It would be interesting to know whether there is a maximum on the amount of link latency difference that this can cope with. I have seen a production network where a link with 4ms latency was paired with one with 15ms. I am guessing this platform might have trouble balancing in this situation...
  2. If I got the fundamentals sorted out correctly, MRP comes from Foundry, whereas this bag of tricks should have come from Brocade's SAN.

    I would only use it on short (intra-DC) links and it probably works only over physical links with microsecond-level skew.

    BTW, are you telling me the 4ms/15ms links were both P2P physical links (or lambdas) using different fiber runs?
  3. Why do you ask if they were physical layer? Does this tech use physical framing that would not work across a longhaul network carrying standard Ethernet frames transparently?

    This specific ("special") example was several years ago and I am working from memory, but I think those latency figures are in the right ballpark. The circuits were fully transparent Ethernet, but not raw L1, I believe they were both EoSDH. When it was ordered we thought it was primary / backup and the distance difference wouldn't be an issue - wasn't until after it went in we realised the customer was using LACP across both links.
  4. I __guess__ it might work over anything transparent enough (EoSDH for example), but I think I found some delay-related limitations somewhere. Not sure ... maybe Brook will add something.
  5. Ivan,

    I have some information, in good confidence, that what you describe is what Brocade does on their FC Switches. On the VDX, I was told that they do the load balancing a bit differently to achieve perfect load balancing. Perhaps they have learnt a few tricks from their FC SAN expereince to improve things for the LAN folks. You may want to circle back with Brocade on this.
  6. Would be great to see any comparison of in-order per-packet load-sharing to "classic" per flow load-sharing to begin with. In a decent-sized FC network, with tenths of servers and storage devices, there are good statistical odds that all port channel member links will be utilized close to optimum using per-flow load-sharing. Disparity, say on a scale of 10%, is not a serious issue as long as every link's utilization does not exceed 50% (this is when queueing delays become severely noticeable. It all boils down to the size of the network and flow matrix, but apparently the "degenerate" case with only a few servers and storage devices are not commonly seen in modern networks.

    This is why inverse-multiplexing solutions have been efficient all the time, to begin with. As soon as the number of endpoints is above some threshold, there is no significant advantage that one may gain using clever per-packet load-sharing. The per-packet solutions increase complexity, add marketing buzz, but seem to have little real use in decent-scale networks.

    Of course you may get say 25%,25%,25%,25% on a 4-link port-channel as opposed to 20%, 30%, 15%, 35% but does it really matter if you are only using fraction of the bundle capacity? One may say - well a 35% utilized link adds more latency but so does Brocade solution - and the delay is not predictable either. The solution that brocade uses might be seen as "inverse reassembly" where sending side needs to buffer packets to equalize arrival timing. As opposed to receive-side reassembly buffers we now have "shaping" buffers that ensure in-order packet delivery. Complexity did not vanish it just got pushed around.
  7. I think the real differance here is a flow based LACP can only go as fast as single link (if you have four 1 gb links - then the flow can not exceed 1 bg).

    On a per packet base then you can get the full 4gb - of couse as long as both sides can push and receive that amount, as in the case in the VDX (using 10gb) to a max of 8 ports per ISL (80 gb Trill path).

    Now this is used for switch to switch traffic, a server connecting to two or more VDX switches still uses LACP, so we have back to flow based from the serer.

    Of course Brocade could put the ISL feature in their CNA's however that would mean your server could only connect to 1 switch. Not a good idea for HA.

    You could also say what about putting two dual port CNA's in a server, then you would have two trill paths of 20gb to two switches in the fabric - however four ports per server is like going back to 2 FC and 2 NICs.

    Just my thoughts,
    Michael.
  8. As Michael pointed out in his reply, you'll notice the difference primarily when you have a single high-bandwidth flow that would benefit from being able to use multiple links (file transfer, backup ...).

    As soon as you have enough flows, the packet distribution method doesn't matter as long as it's random enough.
  9. Michael Shipp
    MRP (versions 1 and 2) is indeed from Foundry.
    ASIC/s on the VDX is 6th gen ASIC fron the FC side. So maybe the new 16 gbe FC will get the update too for ISL's (I would guess so).

    Now please remeber this for ISL (Inter Switch Links) in a single datacenter (al least at this point), therefor I would suggest that this is P2P phyical layer links only.

    Also the current MAX number of supported VDX switches can form a frabic is 12. However if you find a need to have a larger size fabric then the solution needs to be validated by Brocade (Read there is not a hard limit, but a supported limit - 12 has been tested and approved) This is up from the first release of 10 units.

    Hope this adds value.

    Michael.
  10. Quote "You might wonder how Brocade, a company with historical focus on Fiber Channel, managed to solve one of the tough LAN networking problems almost a decade ago."

    Brocade didn't, Foundry Networks did, Foundry was only recently acquired by Brocade. Foundry has been quietly providing Enterprise quality Ethernet Networking equipment for years. We have used their equipment since 2002 and can attest to their technical achievements.

    FD
  11. Michael Schipp wrote "ASIC/s on the VDX is 6th gen ASIC from the FC side." (see below). Brook Reams pointed me to Brocade patents when I was asking him about the packet distribution algorithms.

    Maybe it's time you guys get your stories straight.
  12. Ivan, Michael,

    For the "fat single flow" example. Normally, endpoints connect at physical line rate that is the same or below that of the "uplink" port. Therefore, a typical *single* flow cannot completely overwhelm ISL link. Packet-level balancing, therefore, would be most efficient if implemented on ever inter-connection (host-switch, switch-switch, etc) to effectively increase single endpoint's transmission rate.

    Link aggregation (inverse multiplexing) has been always used in the case of over-subscription scenarios where N downstream ports send traffic to M upstream and N>M. (compare this to circuit-switched network where over-subscription is not possible). This is how imuxing works in packet networks anyways. It's just different levels of granularity (packets, flows, etc) that you can use in packet networks, with deeper granularity required to optimize for sparse source/receiver topologies.

    One interesting inherent problem with packet networks is that they are always designed contrary to one of their original ideas, which was "maximizing link utilization". PSNs are bound to be "flow oriented" due to upper level requirements and have to be over-provisioned to support QoS needs. One might think upper levels should have been designed to perform packet reordering in the network endpoints, but that never happened due to the fact that most ULPs have been "adapted to" and not "designed for" PSNs.
  13. Sorry FoundryDude but you are mistaken. Brocade has had frame-level trunking in our Fibre Channel products since the launch of our first 2 Gbps FC switch in 2000! We are porting this technology now to the Ethernet space, just like we are porting many other fabric-related technologies into our Ethernet Fabric technology, VCS.

    No other vendor in the entire industry (in either Ethernet or Fibre Channel networks) has or has ever had this type of technology.
  14. Michael, as I pointed out before, Brocade has had frame-based ISL trunking since our 2 Gbps FC products in 2000, so it's a given in our next generation 16 Gbps products.

    This is mainly meant for intra-datacenter ISLs between adjacent switches. Obviously, spraying frames across multiple links means you need to be very careful about in-order delivery, so there are some "limitations". The ASIC controls the timing of the frames within port groups, so all ports belonging to the same frame-based trunk have to reside in the same port group. Initially we have 4-port groups and we could trunk 4 x 2 Gbps into a single 8 Gbps link. Today we support 8 x 8 Gbps links into a single 64 Gbps trunk with frame-level load balanding in Fibre Channel, and as you know, 8 x 10 Gbps in our VDX 6720 switches. BTW, we've had frame-based trunking in Ethernet since we launched our Brocade 8000 top-of-rack FCoE switch, but there it's limited to 40 Gbps (4 x 10 Gbps).

    Another "limitation" is that the difference in cable lengths can't be too big, and that is the main reason this is *mostly* for intra-datacenter connections. But we do support frame-based trunking over long distance (at least in FC) up to hundreds of kms as long as you can guarantee the minimum cable length difference between the links (if all go over the same lambda and physical path, for example). We've also had this for years and you can see it clearly documented in all of our product manuals.

    The benefits are very clear. If you trunk 8 x 10 Gbps links, you are *guaranteed* to be able to use those 80 Gbps of bandwidth, and you won't run into scenarios where one link is congested and you have spare bandwidth on another one, like it can happen with LAG (see http://packetattack.org/2010/11/27/the-scaling-limitations-of-etherchannel-or-why-11-does-not-equal-2/) and even in FC with other approaches (like exchange-based load balancing).
  15. Plapukhov, what about when you have three 6 Gbps flows sharing two 10 Gbps ISLs? One flow will get on ISL, and there will be 4 Gbps of spare bandwidth. The other two flows will share a 10 Gbps ISL and will be limited to 5 Gbps each.

    With frame-based trunking, you are guaranteed to have enough bandwidth for those flows as long as the aggregate bandwidth of the flows is lower than the aggregate bandwidth of the ISLs, and in this case 3 x 6 Gbps = 18 Gbps < 20 Gbps, so you wouldn't congest any of your flows.
  16. I guess what's Petr trying to say is, if you say one flow is limited to single port speed anyway, max. benefit you can get over per-flow balancing is ~17% (and your 2isl/3flow example is more or less best-case with 16.6% benefit) and it drops fast with more or less flows.

    So the question is if speed increase for that particular # of flows is actually worth the extra complexity.
  17. @Guest - That's the beauty of it. While the technology may seem (and be) complex in its hardware implementation details, for the end user it couldn't be simpler. Just connect the ports and the trunks form automatically, and you have [almost]perfect load balancing and fault tolerance. My example was just a small example to make it understood. The bottom line is that because one single flow can't exceed the capacity of one single ISL doesn't mean there isn't a benefit. Who doesn't want the extra bandwidth that is going to waste otherwise?

    I think the Ethernet world has been OK with wasted bandwidth for far too long, considering how we've been living with STP for this long...
  18. Have to chime in on the STP part. We've been willing to live with STP because we knew that there's place for bridging and there's place for routing (where you get to use all the bandwidth) ... and the bridging domains were usually small.

    Now that the hardware vendors have focused their persuasive powers onto server admins who don't understand that long-distance bridging is bad, we have to deal with the fact that STP was broken for the last few decades.
  19. Fair point. STP has clearly served its purpose and it's been a very valuable technology. But as virtualization demands larger L2 domains it's time for fabric-based technologies to take over... :)
  20. Ivan and All the Folks Who Commented,

    Thanks to everyone for providing interesting comments, observations and follow-up questions to this post. I decided to put together more content on the subject of Brocade ISL Trunks and just added it to the Brocade community site on VCS Technology. You will find it here:

    http://community.brocade.com/community/brocadeblogs/vcs/blog/2011/04/06/brocade-isl-trunking-has-almost-perfect-load-balancing

    I think it provides more color on how we extended the original "Brocade Trunking" for Fibre channel, (sometimes referred to as "frame trunking" for obvious reasons) to create "Brocade ISL Trunking" which is included in a VCS Ethernet Fabric. I also provied some additional information at the end of my blog in response to some of the questions, comments and speculations several of you posted here.

    Ivan, as always, you provide sound informative content for the community.
  21. This load-balancing technique was developed by Foundry engineers before Brocade acquired them. One of the major customers that drove the development was the AMS-IX, who was also a major reason for the existence of the 32 slot MLX/XMR chassis.
  22. I'm curious how this plays out today? We're an R&E w/ needs for better LB algorithms and Cisco themselves are telling us the higher throughput links simply use polynomials. Our testing of those polynomials has shown upwards of 15% loss at line rate across 4 x 10 Gigs, and higher at 3 x 100 Gigs.

    Replies
    1. The Brocade hardware is long gone. Most vendors use hash-based load balancing these days, and most of them have a nerd knob to turn on dynamic reshuffling. Obviously that works only on directly-connected egress links. Beyond that, Cisco ACI might be doing something, but most everyone else cannot as they don't have visibility into congestion beyond the egress interface.

      The right way to solve this challenge is to implement uncongested path finding at the source host. Something as simple as FlowBender or MP-TCP could do the trick.

Add comment
Sidebar