Turning a Disaster Recovery Test into a Disaster

I recently assisted a customer with a disaster recovery test for Exchange 2013 that went very wrong. I'm sharing what happened here in case the same unfortunate series of events happen to you so you know how to recover from it, or better yet maybe prevent it in the first place.

The customer's Exchange 2013 environment consists of a three node DAG, two nodes in the primary datacenter and another in the DR datacenter. The DAG is configured for DAC mode. The customer wisely wanted to test the DR failover procedures so they know what to expect in case the primary datacenter goes offline.

The failover process went smoothly. The SMTP gateways and Exchange 2013 servers in the primary datacenter were turned off and the DAG was forced online in the DR datacenter. Internal and external DNS was then updated to point to the DR site. CAS connectivity and mail flow was tested successfully from all endpoints - life was good. The customer wanted to leave it failed over to the DR site for a few hours to confirm there were no issues.

Now it was time to fail back. The documentation says to confirm that the primary datacenter is back online and there's full network connectivity between the Exchange servers in both sites. Then login to each DAG member in the primary site and run "cluster node /forcecleanup" to ensure the servers are ready to be rejoined to the DAG.

But the customer scrolled past the part about where to run the command and ran it on the only node in the DR site. This essentially wiped the cluster configuration from the only node that held it. Instantly, the cluster failed and all the databases went offline. Since no other cluster nodes were online there was nothing to fail back to.

We fixed it by turning on the two DAG members in the primary site and starting the DAG in that site. That brought the databases online, but they were not up to date. We used the Windows Failover Cluster Manager console to evict the DR node and then add it back in. After AD replicated we saw that replication between all three nodes was working and the databases came up to date from Safety Net. We didn't even need to reseed any of the database copies. Disaster averted.

So how did this happen and what can be done to prevent it?

Human nature is to skip large blocks of text and read for the steps that need to be done. This is especially true when you're fairly comfortable with the steps or you're under pressure. For this reason, I keep my procedures pretty concise with maybe a sentence or two explaining why this step or procedure is being done.

In this case, the customer scrolled past the text explaining where to run the command and just ran it from the wrong server.

Here are my suggestions for creating disaster recovery documentation.

Know your audience. You need to make an assumption about who will be reading the DR documentation. Will it be the same people who manage the infrastructure in the primary site? Maybe not, if this is a true disaster. Make sure you write the documentation for the right audience. Avoid acronyms that unfamiliar users may not know, or at least spell if out once and then add the acronym the first time you use it. For example, Client Access Server (CAS).

Keep your DR procedures concise. People skip walls of text. Murphy's Law says that DRs happen at the worst times and people don't want to read a bunch of background information that's not pertinent to the task at hand. In a real disaster there will probably be a lot of other things going on and management asking for status. You might want to write your procedures like a cookie recipe. You don't need to be a chef to follow a recipe, but you do need to know how to fix it if something in the recipe goes wrong. Provide links in the documentation that reference TechNet concepts, as needed.

Highlight important steps. Use highlighting to call out important steps in the procedures, but don't overdo it. Too much highlighting will make it difficult to read. You can highlight using color or simple blocks of text, such as:

Important: The following procedures should be run from SERVER1.

Make sure the steps read top to bottom. Don't bounce around in the document or refer to previous steps unless it's something like, "Repeat for all other client access servers." Avoid procedures like, "Cut the blue wire after cutting the red wire." Try not to allow page breaks between important steps, if possible.

Use targeted commands, when possible. If a command can be targeted to a specific object it won't run if the object is unavailable. For example, the command "cluster node SERVER1 /forcecleanup" will run only if SERVER1 is up, rather than assuming the user is running it from the correct server. This particular suggestion would have prevented the unexpected outage in my example.

Pages

Turning a Disaster Recovery Test into a Disaster

No comments:

Post a Comment