Down Protection Site: VMware SRM TestFailover with the SRDF SRA

One of the great features of VMware SRM is the ability to run a test failover without impacting either the protection or recovery site. When using the SRDF SRA this means the replication between the R1 and the R2, whether asynchronous or synchronous, will not be broken. The SRA uses TimeFinder technology to accomplish this, taking a snapshot of the R2 devices and mounting the linked devices at the recovery site. All of this works perfectly well when both sites are up. When the protection site or a component of it is unavailable, however, the SRA cannot perform the test. It will fail. The SRA can handle a couple different SRDF states when using a global option (TestFailoverForce), but not something amiss with the protection site. This inability to run a test failover when the protection site is down is not unique to the SRDF SRA, but for a number of SRM versions now VMware has permitted it if the SRA allows it, so occasionally it causes confusion. Fortunately we will be adding this feature in a future version, but what to do right now?

I have worked with a number of customers who either by choice or requirement want to run a “real” DR failover. Now I should note that we do have an option in the SRA which gets you almost all the way there, and that is the ability to failover to the R2s directly when running a test failover (TestFailoverWithoutLocalSnapshots). We don’t recommend this option since you are at risk during the test, but it is available. For some, however, it still is not enough – the protection site must be brought down to mimic a disaster. This rest of the post is for those customers.

As I make very clear in the TechBook (just updated!), the best way to ensure a clean DR test (when not using the test failover functionality) is to run a Planned migration. Planned migrations, however, require a viable protection site so we are back to square one. When you make the protection site unavailable, you can only run a DR test. VMware will know and will gray out the Planned migration option.Running a DR test when the protection site is down guarantees errors. There’s no way around it. VMware designed the DR run to ignore most errors because the primary goal is to get the environment on the recovery site operational. There is no concern for having to rebuild the SRM objects (e.g. protection group, recovery plan) since the goal is for the business to continue. When customers want to test the DR run, however, there is a great concern for these SRM objects as rebuilding large environments is very time consuming, and during that time you are at risk. The key then is how do we conduct a test in such a way that allows us to end with a clean DR run which permits a clean reprotect and then the planned migration and reprotect that returns the protection site to its original role. This very issue came across my desk recently so I tested it in my environment and came up with a set of repeatable steps which I will share. A few things about the test. First, the test runs through a complete failover from R1 to R2 and then through reprotect/planned migration/reprotect returns the environment to its beginning state. It is not designed to keep the R1 as the source of truth. That would be a good deal more complicated. Second, the production VMs come down. That’s the nature of the test. Third, the key here is to ensure that once the failover occurs and the VMs are running on the DR site that they are not impacted. This particular customer I worked with found that no matter what they did to clean up the environment they invariably put the SRDF pairs into a state that made both the R1 and R2 write disabled. When that happens the VMs are dead with read only operating systems. That makes for an unhappy customer. This procedure will avoid that.

Rather than use endless screenshots, I’m going to list out the steps at a high level and then provide about a 10 minute demo. I don’t do many callouts in the demo so I am relying on the list of steps to fill in the gaps. The demo uses a single synchronous SRDF pair to make it easier to follow along, but obviously there is no reason the procedure cannot scale to large environments.

  1. Bring down the Protection site – in my environment my Solutions Enabler and SRM are on the same box as the vCenter so I simply disable the network adapter on the VM; however there are a number of ways this can be accomplished and customers are free to do it as they wish.
  2. Disable RDF ports (or drop links) on the Protection array which will cause the RDF pairs to go into a partitioned state.
  3. Run Disaster Recovery (DR) from Recovery site vCenter (make sure you see that the Planned Migration option is grayed out) – it will succeed, VMs will come up, but there will be plenty of errors since the Protection site is down.
  4. Bring up Protection site including the array ports/links – all components. Once restored the RDF pairs will go into a split state.
  5. Log out/log in to the Recovery vCenter and rescan SRAs so they exit the error state and recognize the new state of the RDF pairs.
  6. Run DR again, it will fail but complete some important tasks on the Protection site it was unable to do during the first run when it was down.
  7. Set all Protection site devices (R1) to write disabled through Unisphere for VMAX or CLI (in demo). This will cause all RDF pairs to enter a Failed over state.
  8. Run DR for a third time and it will complete without errors.
  9. Now run Reprotect. It will fail and the RDF pair will enter a suspend state, however the Recovery site devices will remain read/write and thus the VMs up and available. Note that the Recovery site devices will now be R1 not R2 since Reprotect runs the swap command.
  10. Run a manual, incremental establish of the RDF pairs through Unisphere for VMAX or CLI (in demo). This will bring the SRDF pairs to their proper state.
  11. Once RDF pairs are synchronized, run a second Reprotect (not forced) and it will succeed. The procedure is complete.

After step 11, the environment is ready for a Planned migration followed by another Reprotect to return the environment to its original configuration (this is included in the demo). That, however, is not required since it will bring down the VMs again. Some customers with co-located arrays might simply keep the new setup with reversed roles.

 

Advertisement

One thought on “Down Protection Site: VMware SRM TestFailover with the SRDF SRA

Add yours

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.

Up ↑

%d bloggers like this: