2 Reasons Why Failover Could Be Slow

Intro

Microsoft SQL Server has had Mirroring starting with version 2005, and now that SQL Server 2012 is released, there is an additional feature set called AlwaysOn (marketing info | technical info). AlwaysOn in currently an Enterprise only feature, so Standard Edition users will still need to use Mirroring. I would describe AlwaysOn this way: it is a beautiful blend of Clustering and Mirroring smeared with Special Sauce. That may not be the best description, but being a fast food junkie, it really hits home with me. Don’t be afraid of the word Clustering though. Shared Storage (e.g. A SAN) is NOT required. AlwaysOn leverages some of the Clustering features to manage Quorum and the actual failover process.

When a failover event occurs, the failover process is usually quite snappy. Somewhere between 15 seconds and a minute usually. That said, there are a few reasons why failover may be on the slower side:

  1. Virtual Log File Count (VLFs)
  2. Cloud Disk Latency

1. High VLF Count Slows Failover Time

In simple terms, a Transaction Log file (.LDF) is made up of 3 or more segments called Virtual Log Files. Each segment can either be Active or not. The sizes are variable based on a formula related to the size of each growth. If SQL Server is set to autogrow and the tlog grows in small 1MB or 10% increments, there could be thousands of VLFs.

When SQL Server performs a failover, lots of transaction log entries need to be reconciled (read and write activities) in a process called Recovery (or Crash Recovery) so SQL Server can bring the database online in a healthy state (see ACID properties of a transactional database). Each VLF requires a little bit of overhead to process, so if there are thousands of VLFs, the process can be noticeably slower.

The solution is to properly shrink (which will deallocate the VLFs), and properly grow (which creates more VLF ‘segments’) the transaction log. The key term is properly. If done correctly, an 8GB transaction log file (.LDF) should contain between 19 and 35 VLFs. See this blog post from Kimberly Trip (b|t) for more details.

And you can check to see how many VLFs your database has by running this undocumented SQL Statement and counting the number of rows returned:

DBCC LogInfo

2. Cloud Disk Latency Slows Failover

Another reason that the failover can be slow, is if the IO Subsystem is pokey. This can be the case with a Direct Attach Storage (DAS) RAID 5 configuration which also uses slow drives. Or a local SAN using an iSCSI interface (large TCP/IP Overhead) combined with a budget 1Gbps pipe rather than 10Gbps. Of course, this is all relative to your servers IO demand (which can be greatly lightened without hardware upgrades by query/index tuning, etc.)

But for today, we are going to focus specifically on cloud storage. Amazon AWS EC2 (which we have a few clients using) is a prime example. True, Amazon has awesome data centers, and lots of computing power. However they also host Netflix and a bunch of other IO intensive companies. If the zillionth person decides to watch a new episode of Arrested Development (did you know it’s coming back?) and the data happens to be sitting on the same disk spindles as your database, you could have some latency issues. So if you are using EC2 or similar to house your replica, the failover will trigger a crash recovery on the EC2 replica to being it online. If this is timed while a lot of other IO unrelated to your account, then you may have a delay getting your server back up and operating.

This is known as the noisy-neighbor problem. And this can happen with any shared hosting, and even on your own corporate SAN (and even on local DAS drives housing multiple databases or other files). You don’t think your SAN has only your database on the underlying physical disks do you? Sure, your SAN Admin told you that this LUN is just for you, but they lie with the truth (and many times they just plain lie) by carving out your LUN from the same disk pool as a million other LUNs and [mis]leading you into believing that what you see on F: is the only data on those disks.

So if you are having slower than expected failovers, you can test the IO latency with a tool like IOMeter or similar.