Failure of FSW Causes Cluster Group to Failover

The following information was written for Exchange 2007 CCR mailbox clusters, but it pertains to any clustering solution that uses the Windows Server 2008 Node and File Share Majority cluster quorum configuration.

How Does Node and File Share Majority Clustering Work?

Exchange 2007 CCR uses two clustered Exchange mailbox nodes, called a Clustered Mailbox Server (CMS). In order for Windows to know which node is active, it utilizes a File Share Witness (FSW) to maintain quorum. The FSW is a network share on a third computer (typically a Hub Transport server in the normally active node's physical site). The active node writes information to files in that share and locks them for writing, preventing the passive node from writing to the FSW and taking quorum. It always take two out of three votes to maintain quorum.

If the active node becomes unavailable, the passive node can write to the FSW and the cluster group fails over. In the case of a total site failure where both the active node and the FSW are offline, both the cluster group and the CMS will fail since there is no quorum (there's only one vote).

What Happens When the FSW Becomes Unavailable?

When the FSW fails, the active CMS node (Exchange) does not fail over because there are still two votes (the active and passive nodes). However, the Windows cluster group will fail over to the other node if the FSW does not come back online within 60 seconds. This is because File Share Witness resource in Windows Server 2008 is configured to fail over the cluster group when the FSW fails, as shown below.

Worse, the FSW resource will not come back online for another 60 minutes. During this time, a failure of either one of the nodes will cause the cluster to fail, even if the FSW is back online.

These default settings are provided so that the cluster event logs don't fill up with constant "Trying to start the resource", "The resource failed to start" events during a prolonged outage.

This is what happens when the FSW server is rebooted (during patch management, for example):

The server holding the FSW resource is rebooted.
The cluster tries to connect to the FSW one minute after failure is detected.
If the FSW is still unavailable (which usually happens - most servers take longer than 60 seconds to restart), the cluster group fails over to another node.
Wait one hour and try connecting to the FSW again. The FSW is finally brought online.

Note: This behavior only pertains to Windows Server 2008. Windows Server 2008 R2 does not have this issue.

It's important to know that even though the cluster group fails over, there really is no effect on Exchange, even with a geographically disbursed CCR cluster (geo-cluster). However, if you're like me, you like symmetry and order. The cluster group should be with the active CMS node.

Here's how to minimize the time that the cluster group is on the (normally) passive node:

Open the Failover Cluster Management console
Add the cluster name, if necessary, and select it
Double-click Cluster Core Resources in the middle pane to expand it
Right-click File Share Witness (\\servername\sharename) and select Properties
Click the Policies tab
For optimal restart performance, change "If all the restart attempts fail, begin restarting again after the specified period (hh:mm)" to 15 minutes, as shown below:

This configuration will cause the cluster service to attempt to bring the FSW resource to online once every 15 minutes, instead of an hour.

Next, logon to the server holding the FSW resource (typically a Hub Transport server in the active site and install the Failover Clustering Tools feature. You'll find it in Remote Server Administration Tools > Feature Administration Tools.

Now create a batch file called FSW_Online.bat. Enter the following two lines:

cluster EXCLUSTER1 res "File Share Witness (\\server\mns_fsw_excluster1)" /online
cluster EXCLUSTER1 group “Cluster Group” /move:node.yourdomain.com

Note: Replace EXCLUSTER1 with your cluster name. Replace \\server\mns_fsw_excluster1 with the name of your FSW resource (enter "cluster res" at a command prompt to find it). Replace node.yourdomain.com with the FQDN of the CMS node you want to keep the cluster group on.

Lastly, configure FSW_Online.bat to run at startup on the FSW resource server:

Open Local Group Policy Editor
Navigate to Computer Configuration > Windows Settings > Scripts (Startup/Shutdown) > Startup
Click Add and browse to the FSW_Online.bat file you created
Click OK twice and close Local Group Policy Editor

This is my current best practice for configuring the File Share Witness resource failure policy.

Special thanks go to Tim McMichael, Senior Support Escalation Engineer on the Exchange product support team, for assisting me with this article.

4 comments:

AnonymousJune 15, 2009 at 7:19 AM
Fanstastic post. I've been looking for this solution for a while as every time we reboot our hub transports the active CCR node fails over. Coming from Exchange 2003 shared storage style clustering CCR has seemed less robust when we have failovers like this. Great solution, thanks.
JeffAugust 18, 2009 at 6:14 PM
In a previous version of this article, I recommended unchecking "If restart is unsuccessful, fail over all resources in this service or application" to prevent the FSW from going offline. This is NOT recommended. Doing so will eliminate the cluster's ability to fail the cluster group to another node when there is actually an issue with the witness not related to reboot.
AnonymousFebruary 17, 2010 at 4:14 AM
I installed the Failover clustering tools, and I can't launch cluster.exe on the server.
AnonymousFebruary 17, 2010 at 4:34 AM
After installing the failover clustering tools, I try and run the "Clsuet Res" command and I get the following error message:

System error 1060 has occurred (0x00000424). The specified service does not exist as an installed service.

Thank you for your comment! It is my hope that you find the information here useful. Let others know if this post helped you out, or if you have a comment or further information.

Pages

Failure of FSW Causes Cluster Group to Failover

4 comments: