High Availability Fallacies

I’ve already written about the stupidities of risking the stability of two data centers to enable live migration of “mission critical” VMs between them. Now let’s take the discussion a step further – after hearing how critical the VM the server or application team wants to migrate is, you might be tempted to ask “and how do you ensure its high availability the rest of the time?” The response will likely be along the lines of “We’re using VMware High Availability” or even prouder “We’re using VMware Fault Tolerance to ensure even a hardware failure can’t bring it down.”

I have some bad news for the true believers in virtualization-supported high availability – quite a few of them probably don’t understand how it works. Let’s see what HA products can do … and keep in mind that hardware causes just a few percents of the failures; most of them are caused by software failures or operator errors.

VMware High Availability (or any equivalent product) is a great solution, but the best it can do is to restart a VM after it crashes or after the hypervisor host fails. Assuming you can reliably detect a VM OS or application service (for example, database software) failure, the VM still needs to be restarted. VM-level high availability is thus dangerous, as it gives application developers and server administrators false hopes – they start to believe a magical product can bring high availability to any hodgepodge of enterprise spaghetti code. In reality, the VM has to go through full power-up process and all the services the VM runs have to perform whatever recovery procedures they need to run before the VM (and its services) are fully operational.

VMware Fault Tolerance is an even more interesting case. It runs two parallel copies of the same VM (and ensures they're continuously synchronized) – a perfect solution if you’re running a very lengthy procedure and don’t want a hardware failure to interrupt it. Unfortunately, software failures happen more often than hardware ones ... and if the VM crashes, both copies (running in sync) will crash simultaneously. Likewise, if the application service running in the VM crashes (or hangs), it will do so in both copies of the VM.

As expected, an interesting Twitter discussion followed this blog post. Among other interesting remarks, Duncan Epping (of the Yellow Bricks fame) rightfully pointed out that the VMware HA/FT products function exactly as described. That’s absolutely true – VMware’s documentation is extremely precise in describing how HA and FT work. It’s just that VMware marketing tends to oversell stuff.

High-availability clusters like Windows Server Failover Clustering restart a failed service (for example, the SQL server) on the same or on another server. The restart can take from a few seconds to a few minutes (or sometimes even longer if the database has to do extensive recovery). A nine lost.

Bridging between data centers (the typical design recommended by VMware-focused consultants) might cause long-distance forwarding loops, or you might see the flood of traffic caused by a forwarding loop spilled over the WAN link into the other data center, killing all other inter-DC traffic (including cluster heartbeats if you’re brave enough to use long-distance clusters, and storage replication).

Want a data point: we experienced a forwarding loop caused by an intra-site STP failure. Recovery time: close to 30 minutes with NMS noticing the problem immediately and operator being available on site. Admittedly some of that time has been spent collecting evidence for post-mortem analysis.

Are you really willing to risk your whole IT infrastructure to support an application that was never designed to run on more than one instance? After all, one would hope your server admins do patch the servers … and patches do require an occasional restart, don’t they?

Moral of the story: the “magic” products give you false sense of security; good application architecture and use of truly highly-available products (like MySQL database cluster) combined with load balancing technologies are the only robust solution to the high availability challenge.

Even More Information

You’ll find in-depth discussions of high-availability architectures in the Designing Active-Active and Disaster Recovery Data Centers webinar.

Want to dive deep into the underlying infrastructure technologies? Watch Data Center 3.0 for Networking Engineers, Data Center Interconnects and VSphere 6 Networking Deep Dive webinars.

Revision History

2023-03-03
Rewrote a few paragraphs to make them easier to understand.

10 comments:

  1. Fully agreed. A good typical example is Marathon Everrun VM. At the end of the day, its the end users who feel cheated when their services in VM hung and the VM-HA or FT is not application-aware.
  2. don't forget with FT we can sync with a little different time ! so we can adjust a crash when we change something on the first VM.
  3. I'm not sure what you're trying to tell me. FT syncs every single I/O operation (including KVM events). This blog post has a good introductory explanation:

    http://lonesysadmin.net/2011/04/19/vmware-fault-tolerance-determinism-and-smp/
  4. Ivan, the problem is that creating a high availability solution for the front end is a no brainer. Put more than two instances and a LB in front. Done.

    The problem is to provide a HA solution for anything that has to do with persistent local data. This may include the database in (relatively) modern 3 tiers app but it also includes more traditional Enterprise applications (Exchange being an example).

    It is not even worth discussing how to provide resiliency to the front end. It's done. Focus your energies for the back-end.

    Massimo.
  5. We totally agree - back end is a tough nut. However, until you solve the DB (more precisely, ACID data store) problem, you won't have a truly HA application. VMware HA or Windows failover cluster(s) buy you nothing but automatic restart after a hardware failure. The DB service still has to restart (and roll back all pending transactions) after every failure, which takes a significant amount of time.

    However, both SQL Server and MySQL offer a redundant server configuration, where the second server can take over immediately when the first one fails. High-end MySQL offers an even better distributed solution. So the problems can be solved ... but it's easier to offload them to someone else and believe in unicorn tears.
  6. None of the things you are referring to Ivan provides a consistent failover scenario at the best of my knowledge. The reason for which it starts sooner on the other side is because it has lost all transactions the application think have been committed. It's good if you are hosting an application that shares pictures... not good if you deal with money.

    Having this said there is clearly a trend for which this backend is being made more "scale out" friendly... but it will be a long way to go.

    My 2 cents.
  7. MySQL cluster provides true failover. A data node dies, at least one other node already has all its data. If I remember correctly, it's supported in single IP subnet configuration (with database replication recommended for long-distance needs).

    SQL Server provides database mirroring (which can be synchronous if you want to retain total consistency).

    And we (yet again) agree that the backend has a long way to go ;)
  8. I am reading this article again... The funny thing is that I understand what you are trying to get at but this is only true in an ideal world where Applications are specifically written to support a setup that includes load balancers and a shared database. Although everyone wants this to be true, reality is that we are nowhere near this ideal world.

    In most enterprise organizations I have been at least 80% of the applications, which are essential to the line-of-business day-to-day, don't support this kind of set up. This is one of the reasons HA is so widely adopted today. On top of that there is a substantial cost associated to load balancers and a shared database configuration (yes needs to be clustered / distributed as well) which might be more than the SLA requires. In those cases vSphere HA / FT / VM and App Monitoring are the way to go. 5 clicks and it is configured, no need to have special skills to enable it... just point and click.

    Once again, I agree that using a vFabric load balanced setup (shameless plug :)) would be ideal, but there are far too many legacy apps out there. Even in the largest enterprise orgs the IT department cannot control this, even the line-of-business cannot control it... main reason being that they are suppliers not taking the time to invest.

    Go vSphere HA

    Duncan
    yellow-bricks.com
  9. You are making a lot of assumptions here. You are assuming that all critical applications have a huge database. Many applications that are used on a day-to-day basis have a small database. Many apps for instance used at financial institutions are simple apps just to calculate what a mortgage would cost. Now although this might be 20MB app it is essential to the line-of-business and you might not think it is critical but they feel it is.

    Unfortunately critical doesn't equal current or mature application architecture.
  10. We are planning to use VMware's FT to run a redundant Citrix NetScaler VPX for our internet facing applications.(10-30k req/sec)
    We could go for Netscaler's traditional cluster setup, but that would require us buying 2x licenses. With our existing FT license we get just as much reliability with no extra cost.
    If software inside of that VM were to die, we would be in exactly the same situation as running it on a dedicated box.
Add comment
Sidebar