MetroDR with SRDF SRA

Support for MetroDR in the SRDF SRA has been a long time coming for us. As I’ve noted in other blog posts, we needed some changes in the code, in particular Solutions Enabler, to add the SRA support. Those changes are now in Solutions Enabler 10.1.0, and thus the new SRDF SRA 10.1.0 supports MetroDR. Let’s start the dig…

MetroDR

Just to set the stage, MetroDR is a 3-site solution comprised of SRDF/Metro between two sites, and SRDF/A between both the R1 and R2 of the Metro to a third DR site. SRDF/A is only ever active between one of the Metro pairs and the DR site, though both are capable. In the event of a failure of one of the Metro arrays, the other will continue the SRDF/A relationship. Here is what the VMware setup looks like with MetroDR:

Requirements

Because the changes are in Solutions Enabler, it means you will not require the latest PowerMaxOS (array code) to use MetroDR with the SRA. In fact, we can support a ways back. Here are the requirements:

  • Solutions Enabler 10.1 (in order to take full advantage of the configuration, an external SE is needed, not the SE on the array)
  • HYPERMAXOS 5978 Q1 2020 SR or greater or any PowerMaxOS for each of the arrays
  • FailoverToAsyncSite=YES in EmcSrdfSraGlobalOptions.xml at BOTH sites (this is the current requirement for any 3-site config failing over to async site)
  • Each MetroDR environment must have a unique name (this is already a requirement for MetroDR setups)
  • Device or composite groups are NOT required – this is specific to MetroDR only. The SRA uses the environment name instead.

Restrictions

And as always, with requirements, comes restrictions.

  • No support for REPROTECT operation (same limitation for current 3-site Metro)
  • No support for using only the Metro pair – MetroDR can only failover to the ASYNC site

A few words about these restrictions. The reprotect restriction has been there since we supported 3-site Metro, however, MetroDR is another animal altogether. I was able to provide a couple workarounds to a concurrent 3-site SRDF/Metro,  but I’m afraid to say there is no such thing for MetroDR. Because of the nature of the DR legs going to a single device, there is no easy way to decouple the setup. In fact, if you start using MetroDR in a test environment you’ll see how once you run a failover you can’t simply tear the environment apart, the commands are blocked. If you are into the wonky, you can read about what commands you can run based on the SRDF states here. Ultimately, then, once you failover in SRM and start using the DR, reconfiguring MetroDR will take some doing, and may involve support’s help. And you will need to remove the recovery plan and protection group in SRM since you’ll be rebuilding it after you recover the MetroDR setup. I don’t expect this reprotect issue will be resolved any time soon, though the ability to tear down MetroDR is something that will be addressed in Solutions Enabler/PowerMaxOS in the future.

The second restriction is for a configuration that I think I’ve seen only once at a customer. When you use regular 2-site Metro with SRM, you use two vCenters in Enhanced Linked Mode (ELM). We still allows you to do that with 3-site Metro, but it is not supported with MetroDR.

Setup

When you setup MetroDR with the SRA, there is a preferred configuration, that being a uniform VMware Metro Storage Cluster (vMSC). In a uniform vMSC with SRM, you use a single vCenter for your protection site and present devices from both the R1 and R2 to the ESXi host. This means that if you lose an array, you only lose the paths to that array and VMware continues to operate on the remaining paths to the remaining array. Generally, unless your arrays are very close together, we recommend a non-uniform configuration with vMSC where each ESXi host only sees its local array. In the event of failure, the VMware HA restarts the VMs on another host with access to the other array.

The reason we prefer uniform with MetroDR and the SRA is that it enables you to reconfigure SRM to use the R2 if the R1 fails. Remember, in a typical 3-site Metro, only the R1 is transmitting data to the DR site. If you lose the R1, you have to failover since the R2 is not replicating to the DR. You, of course, can continue to run on the R2 instead, but then the DR site is useless since it immediately is behind production. Since in MetroDR both the R1 and R2 can send data to the R2, if you lose the R1 you can tell SRM to now use the R2, and then you continue to run on the R2. If the R2 then fails, you can failover to the DR site knowing it is in sync (async) with the R2.

Solutions Enabler configuration

So how exactly would this SRM reconfiguration work? Well the key is Solutions Enabler. Normally when you setup SE for the protection and recovery sites, each SE sees its array as local, the other array as remote. This is still true for 3-site Metro since the R1 is the only array replicating. But for MetroDR, since both the R1 and R2 can replicate, you present both of them to the protection site. So it sees both Metro arrays as local, and the recovery site as remote. Conversely on the recovery SE, it sees the DR array as local, the two Metro arrays as remote. Here are both SEs showing that:

Array pairs

When you then go add the array pairs in SRM, you will see both the R1 and R2 as available to enable.

Since you can only configure a protection group with one array pair, you only enable the R1 pair (technically you could do both but it gets confusing as I point out in the demo). If the R1 fails at a later time, you can then reconfigure SRM by:

  1. Remove recovery plan
  2. Remove protection group
  3. Disable the R1 array pair
  4. Enabler the R2 array pair
  5. Create protection group
  6. Create recovery plan

These steps for a typical environment take a matter of minutes to complete. This is how you can remain protected with MetroDR even when the R1 fails.

Demo

The setup and reconfiguration I want to show is not really possible with written words and images without being its own whitepaper. I recorded a demo instead. It’s long for a demo – both the steps and the fact I talk a lot – but I wanted to walk through exactly how to setup the environment and then reconfigure it if the R1 failed. I don’t run a testfailover or failover as there is nothing new to show in that space. And as reprotect is not supported, I could not do that.

In the demo I am using the preferred configuration of a uniform SRDF/Metro, but if you want to use non-uniform (as is generally recommended for vMSC), the SRM reconfigure process is essentially the same if the R1 fails. The only difference would be that since each ESXi host only sees either the R1 or R2, if the R1 fails you should wait for VMware HA to restart the VMs on the R2 before creating the new protection group(s) and recovery plan(s).

One other thing is that if you normally use the embedded Solutions Enabler on the array (which is not recommended), it does not work with MetroDR because there is no way for that implementation to see both Metro arrays as local. You will need to install a separate Solutions Enabler.

If anything is confusing please leave a comment.

One thought on “MetroDR with SRDF SRA

Add yours

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.

Up ↑