SRM reprotect with SRDF/Metro and async leg

Our SRDF SRA supports many configurations – 2 site, 3 site, STAR, SRDF/Metro – well you get the idea. For the majority of these environments, all SRM functions are supported. That includes test, planned migration, failover, failback and reprotect. There are a couple of specific, 3 site setups where we cannot support reprotect. This comes down to an SRDF restriction where we cannot simply swap personalities between 2 of the sites as it is prevented by the third site. There are solutions to this, however as they require intervention on the management side (e.g. SE or Unisphere), it was decided that it would be dangerous to make those decisions for a customer. These management tasks also do not lend themselves well to easy answers so it is not something we could create a global option for in the XML file.

Having recently helped a customer with one of these situations, I thought it would be beneficial to share it as an example. The SRDF SRA TechBook which contains everything you need to know about SRM and the SRDF SRA, purposely avoids documenting any scenarios around the configurations which are not supported with reprotect because doing so might imply it was the only way to do it, which is not the case. So before we begin understand this is just one example of getting around the reprotect issue. In this example my intent is to try to use the SRM reprotect functionality as designed, by preparing the storage environment beforehand. This is not a supported process by Dell EMC or VMware; but unfortunately these types of work-arounds are the only option available and while individual commands (e.g., symrdf) are supported of course, what I outline below is not something Dell EMC will recognize as part of the SRA functionality.

As noted in the first paragraph, the configuration I am going to demonstrate does not support reprotect. This is a 3 site, SRDF/Metro configuration where one side of the Metro relationship, typically the R1, has an asynchronous leg. This is a common configuration because SRDF/Metro is an HA solution, not DR, so it is essential to have an extra copy somewhere not close. Normally in SRM environments that use SRDF/Metro we see 2 vCenters in enhanced linked mode (ELM), which permits the user to failover from the R1 to the R2, despite the fact that both devices and the datastores are active/active. Most customers in these setups use SRM to migrate, rather than failover. For instance, if I needed to do some maintenance on my R1 site (let’s assume a campus cluster with disparate data centers), I would use SRM to run a planned migration of my VMs on the R1 over to the R2. VMware, recognizing the datastore is not changing, would only execute a vMotion, and my VMs would move over very quickly. I could then reverse the process to restore them to the R1 site after maintenance is complete.

Our SRA, though, also offers the chance to use a third, asynchronous site, rather than the R2 of an SRDF/Metro setup. In the asynchronous setup, you could have a single vCenter for the SRDF/Metro portion – a true VMware Metro Storage Cluster – and then the paired SRM site would be the vCenter housing the asynchronous device. This is the configuration we’ll use.

I’m using a very simple setup of 3 devices, my R1/R2 of SRDF/Metro and the R2 for the asynchronous site, because the number of pairs does not change the process. The diagram is below. The devices are listed but they are the following – 56 R1 <-> 2A R2 (Metro) and 56 R1 -> 36 R2 (Async).

For the SRDF/Metro environment, we still recommend a non-uniform configuration, meaning the hosts associated with the R1 array, are only zoned to it, and vice versa with the R2 and the other hosts. If you choose to configure a uniform setup, in which all hosts see both the R1 and R2 (i.e. cross-connect), it will not impact the SRM device pair discovery however.

Once SRM (6.7 in this case) is configured at each site, I paired them together.

In order for the SRDF SRA to failover to this asynchronous site, we need to complete a couple house-keeping steps. First, create a device group on each array manager (Solutions Enabler), one for the R1 device, 56, and one for the R2 device, 36. Note I said a device group and not a consistency group as is typical for an asynchronous pair. This is because SRDF/Metro does not permit the use of consistency groups. The second task is to alter the EmcSrdfSraGlobalOptions.xml file and the parameter FailoverToAsyncSite to Yes.

This will permit the discovery of device 36 at the remote site as shown here. You’ll see it has picked up my device group, datastore, and in this case I have already configured my protection group.

Now I’ll run through the failover and how we can reprotect and get back to the original state. The assumption I am working under is that the primary site remains operational. If the primary site failed, it would be a more involved process than what I am describing.

Start with a typical failover. Given that the primary site is available, it is your choice whether a planned migration is executed or a disaster recovery. In general, the difference is that a planned migration will fail when just about any error is encountered. A disaster recovery will simply ignore the errors. My recommendation is to use a planned migration. Since we wish to use reprotect through SRM, there is a higher likelihood of success if a migration is run.

I run my planned migration, no errors.

After the failover/migration, here is what the pairs look like. Our SRDF/Metro pair is Suspended while the asynchronous pair is appropriately Failed Over.

Now if I hadn’t read the TechBook where it says reprotect is not supported, I might immediately try to run a reprotect. I then might be surprised to see it fail with this:

SRM complains it cannot reverse replication, or from our perspective SRDF cannot swap personalities. This is because SRDF/Metro is part of our configuration so the swap is blocked.

I’m taking a short intermission here because the remaining steps involve removing the SRDF/Metro pair which means at the end of this process you will have to re-create it and incur a full sync. Unfortunately, if we want to use SRM, this is what is needed. As I mentioned in the second paragraph, however, this is not the only way. You can actually avoid the full sync of SRDF/Metro by running one or two Solutions Enabler commands instead. Leaving aside the VMware steps at the moment, if we assume the failover went as planned, our Metro pair is “suspended” and our Async pair is “Failed Over”. At that point we could either immediately run a “symrdf failback” or if we have been using the Async site for some time, we can run the command “symrdf update” which will push any changes in the background first, and then run a failback after changes are sent so the device is immediately up-to-date. Then we re-establish the Metro pair and are back where we started. It is a good way to avoid the full resync if your pairs are large; however you will have to do all the VMware steps manually which would include shutting down all VMs, unregistering them, unmounting datastores, masking, etc. Then you still have to reconfigure SRM. I understand this may not sound better than what I am detailing here but it is another option. OK back to the show.

We need to take steps to prepare the SRDF environment so that the reprotect will be successful – or rather that the reverse replication (swap) works. The way to do this is to remove SRDF/Metro from the 3 site relationship, thus making a 2 site solution. Since the SRDF/Metro pair is already in a suspended state, we simply need to delete the pair.

You might wonder why we couldn’t simply move the Metro pair to a Synchronous pair to avoid deleting it and thus incurring a full sync when the pair is re-created. It’s a fair question. Unfortunately that is blocked in the code, so deleting the Metro pair is our only option here.

With the pair removed, we can now run the reprotect successfully.

At this point, my asynchronous site is now my protection site. I follow the reprotect, therefore, with another planned migration and a final reprotect. This returns me to the original protection site which is the R1 (56) of the SRDF/Metro site recently deleted. For clarity here are the steps:

We have one final step and that is to rebuild the SRDF/Metro pair. This has two parts. The first is we need to set the former R2 device (2A) ready to the host. When you run a deletepair for SRDF/Metro, the R2 is set not ready to the host. In order to use it again we have to ready it, and doing so will require the symforce command. Symforce is a parameter that requires a change to the Solutions Enabler options file wherever you are issuing the command. The parameter SYMAPI_ALLOW_RDF_SYMFORCE should be set to TRUE. It is unnecessary to restart any daemons. (BTW this is a powerful switch that should only be used with great caution in other circumstances.) Once the device is ready, we can recreate the pair and wait for the full sync. Here is an output showing first that only the asynchronous pair is configured, then that the create initially fails because the device is not ready, and then setting the device to ready and finally re-creating the SRDF/Metro pair.

And here is our final state, with everything back where we started.

As I mentioned at the beginning, these types of management actions are ones which we do not enable in the SRDF SRA, nor do we plan to. They should be undertaken with great thought and planning. The process I laid out is certainly not the only way to go about reprotecting the environment, so be sure to do what makes sense for your business and requirements.

3 thoughts on “SRM reprotect with SRDF/Metro and async leg”

Add yours

Abreu Daniel says:

October 2, 2018 at 4:15 pm

Another great post Drew.

One doubt. In a Scenario of 2 sites protected by SRDF/S If I simulate a DR (Stop the comunication SAN/LAN) between sites, and perform a failover in SRM at second site. After perform all tests with SRM and VMs, I’d like to disable/delete SRDF in those LUNs and remove all VMs in this second site, and them establish the comunication between sites and then,use the VMs in a Site 1 in the last state. Is that possible? I know that isn’t a best practice but today My DR is designed like that, because I have some problems to replicate my physical DBs at second site. Today I do it using recoverpoint. When I finish the tests, I remove all VMs and protection groups in Recoverpoint before estabilsh the comunication between sites. Then I recreate the protection group in recoverpoint. With SRDF and VMAX I’m planing to replicate the LUNs of my physical DBs to avoid this “workaround”, but I’d like do know if this scenario is possible an what risks I Have

1. Drew Tonnesen says:
  
  October 2, 2018 at 4:45 pm
  
  So best to read the TechBook I link in the post as it includes the various ways testing can be performed. We don’t recommend running a real DR test, rather you should use the test functionality in SRM. If your business requires it, you can test directly against the R2 devices themselves. The preferred method is the default one which takes snapshots of the R2 devices and makes those available for testing so the R2 is not impacted. That type of testing is still available even if you simulate a DR by dropping the links. If you must run a “real” DR as you propose above, you can dispose of the R2 data but it requires a full resync from the R1 and during that time you are at risk if there was a failure of the R1.
  
Pingback: MetroDR with SRDF SRA – Dell PowerFlex and PowerMax with VMware

Share this:

3 thoughts on “SRM reprotect with SRDF/Metro and async leg”

Add yours

Leave a comment Cancel reply