Multi-Chassis Link Aggregation (MLAG) and Hot Potato Switching

There are two reasons one would bundle parallel Ethernet links into a port channel (official term is Link Aggregation Group):

  • Transforming parallel links into a single logical link bypasses Spanning Tree Protocol loop avoidance logic; all links belonging to the port channel can be active at the same time (see also: Multi-Chassis Link Aggregation basics).
  • Load sharing across parallel links in a port channel increases the total bandwidth available between adjacent L2 switches or between routers/hosts and switches.

Ethan Banks wrote an excellent explanation of traditional port channel caveats (proving that 1+1 sometimes does not equal 2); things get way worse when you start using Multi-Chassis Link Aggregation due to hot potato switching (the switch tries to forward packets toward destination MAC address as soon as possible) used by all MLAG implementations I’m familiar with.

What Is Hot Potato Switching?

Hot potato switching is a very simple principle (try to get rid of the packet as soon as possible) best explained with a few examples. Let’s start with a very simple architecture: a router connected to two switches (MLAG cluster) through a link aggregation group, and a bunch of servers with a single uplink per server (ESXi hosts usually look that way even when they have redundant uplinks).

A simple data center fabric

A simple data center fabric

When the router needs to communicate with server A, it sends an ARP request. Let’s assume the ARP request goes over the R–S2 link and is then flooded across S2–S1 and S1–A links (as well as a few other links). At this moment all switches know where R is.

When A replies to R, the switches forward packet based on their MAC address tables. Without modifying the MAC learning process, the return packet would go from A through S1 and S2 to reach R (and all the subsequent traffic could flow over the same link) – clearly a suboptimal traffic flow that places unnecessary load on the inter-switch link.

Every decent MLAG implementation thus has to modify the MAC learning process: as soon as a new source MAC address is seen in a packet arriving through a port belonging to a port channel, the MAC reachability information is propagated to all other members of the “stack” and they use their local ports in the same port channel to forward the traffic toward the newly-discovered MAC address.

Architectures supporting more than two nodes in a MLAG group might have to run a routing protocol internally and forward the packets toward the closest port in the port channel. Juniper is (according to the presentation they gave us at Net Field Day) running IS-IS between members of their “distributed stack”. Alternative architectures that lobotomize all but one control plane are simpler: the single active control plane modifies MAC address tables in all switches.

With the modified MAC learning rules, the entry for MAC address R in S1 points to the port in the port channel toward R, not to the inter-switch link and the traffic from A to R flows over the shorter path A–S1–R. As you can see, the switches successfully got rid of the hot potato (= the packet) without warming the inter-switch link.

Load Balancing Implications

To make the long story short (more in a follow-up post): if servers A and B decide to send data to R concurrently, there’s no way to split that load across the links in the port channel between R and S1+S2.

More Information

  • Multi-chassis Link Aggregation (and numerous other LAN, SAN and virtualization technologies) is described in the Data Center 3.0 for Networking Engineers webinar.
  • The term hot potato routing is commonly used in Internet to describe a routing strategy where an ISP sends the packets into adjacent autonomous systems as soon as possible.

Latest blog posts in Multi-Chassis Link Aggregation series

13 comments:

  1. What about interconnecting 2 Switchstacks with Multichassis Etherchannel?

    Thanks
    MB
  2. I guess I might know what you have in mind, but could you be more specific? Describe your setup in a bit more details.
  3. Hi Ivan,

    Very interesting insight. The MC-LAG implementation I've been privy to (ALU TiMOS) does not use load balancing - you always have one or more active and one or more standby group members, with active members always being on the same physical chassis and standby being on the other. While this solution potentially has a disadvantage of not using all the available bandwidth (and yes, potentially using a sub-optimal path for some flows), it on the other hand has an advantage of simplicity (and thus being inherently more stable due to fewer things that could go wrong) and lack of the problem you're describing.
  4. > does not use load balancing

    Cross-chassis load balancing, that is.
  5. Why would you use it then? To reduce the number of STP events?
  6. Hi Ivan,

    I'm thinking of a 3750 Stack at each datacenter, connected through 2 SFP GB links, bundled each Uplink at one chassis, so there can fail one catalyst or line without big trouble.
  7. redundancy, very fast convergence, no STP needed, works for both PW and VPLS (as opposed to BGP MH)
  8. MB, when you spread etherchannel links across multiple members of a 3750 stack, that's what Cisco calls "cross-stack etherchannel", which in my understanding is a very different animal from multi-chassis etherchannel. The way Cisco handles the frame-forwarding, what you've really got in a 3750 stack is more like a single chassis with blades where the stacking cable forms the unified backplane, as opposed to multiple chassis' doing special MLAG magic.

    Speaking to Ivan's example specifically, in a 3750 cross-stack etherchannel, I believe you would see a more balanced use of the cross-stack links. In other words, if servers A & B are uplinked to the same 3750 switch in a stack, the stack would still use all members of the cross-stack etherchannel in accordance with the port-channel load-balancing algorithm you had selected.

    I'll admit this is based purely on my experience managing a couple of hundred 3750 stacks, where we relied on cross-stack etherchannel both for performance and resiliency. We saw good, load-balanced results across the links hashing on port numbers inbound from the agg/core layer, and on source/destination IP outbound from the stack (which doesn't support hashing on L4 port numbers). The load-balancing method appropriate for your stacks might vary depending on your layout.

    Someone who's dug deeper into 3750 control-plane logic will correct me if I'm misunderstanding what's really going on there.
  9. Agree with Ofer below - mainly for redundancy without STP.
  10. I use basically the setup you describe with 3750-E stacks; it works really well and I get a good balance of traffic between ports in the port channel. Ethan is correct in that it's a way different beast than MLAG at least from what I have read about MLAG. The 3750 stacks are essentially a single switch with a robust failure mode. They can't even do rolling upgrades (man I would kill for that feature on 3750). In either case, with only 2 gig-e links in the LAG your biggest setback will be fretting over theoretical problems that just don't really matter much in reality.
  11. Thanks! I'll have a look at MLAG and cross-stack etherchannel :)
  12. If there are two core routing switches and two access switches which are MLAGged together in both directions, and hosts that are dual-active LAGged to the pair of access switches, then the traffic would stay on whichever side the host places it. Also the topology would remain loop-free as you continue to attach paired access switches in this manner to the core routing switch pair and so, although you would want to retain spanning tree "just in case", spanning tree topology events could be effectively eliminated.
  13. Thank you very much about this useful info, but just an urgent question
    could we do that with only two edge network L2 (Juniper ex series) switches and how the router will act if it connect to an external network with out needing to run STP?
Add comment
Sidebar