Partial SRDF/Metro 3-site with SRM (SRA)

I recently ran across a customer configuration I had not seen before with 3-site SRDF/Metro (ASYNC leg with failovertoasync site set in Global XML), and it gave me some concern. Let me talk about the setup and then cover the concerning aspects. I will say from the outset, I actually didn’t think this setup was possible, so customers are unlikely to configure it, even by accident. But the implications are serious enough I want to cover it.

Background

I suppose it goes without saying that the SRDF SRA and the underlying replication software, SRDF, are not developed in concert. The SRDF SRA is just a consumer of the technology, much like the REST API. Therefore, the SRDF software will have functionality that either the SRDF SRA does not have (yet), e.g., MetroDR, or simply cannot have because of the limitations of VMware SRM, e.g., 2-leg SRDF/Metro. In the situation here, we end up somewhere in the middle – that being SRDF permits the configuration and the SRDF SRA does not have the code to block it.

SRDF Modes

In SRM, we only deal with three of the SRDF modes: SYNC, ASYNC, SRDF/Metro (Active). Each SRDF group can only be in one of these modes (putting aside the other modes for now). When manipulating SRDF device pairs in SRDF groups, the mode dictates whether you can issue commands on individual pairs or the group as a whole. In SYNC SRDF groups, you are allowed to treat each pair as its own entity. This means that if my SYNC group has 10 pairs, I could choose to failover a single device. In working with SRDF pairs in groups with ASYNC or Metro, the pairs must be treated as a whole. If I have 10 pairs, I must fail them all over, or none. I cannot manipulate part of it. This is to ensure consistency. Having that explanation at our command, let’s look at the configuration in question.

Configuration

Three arrays:

  • 302 – Metro R1
  • 305 – Metro R2
  • 341 – Concurrent ASYNC R2 from 302

In this configuration, I began by replicating a storage group, partial_metro_async, with SRDF/Metro between arrays 302 and 305. This group contains four 100 GB devices in SRDF group 11. You can see below I am using the default ActiveBias, though it would not make a difference if I had a witness.

In the second step, I can’t use Unisphere because I want to replicate only a couple devices from SRDF group 11 to my third array, 341. Using CLI, I create two pairs from array 302 to 341. You can see below these pairs are in SRDF group 12.

Testing

As required, I now create a composite group metro_async on each array manager (Solutions Enabler) and discover the devices in SRM.

Now I run a test. No issues there, snapshots created on array 341 and mounted.

Since all the failover test does is create a snapshot on the ASYNC devices on array 341, it never has to run SRDF commands on either the Metro or ASYNC pairs. This is important going forward.

Planned Migration/Failover

Now let’s say I want to run a planned migration or failover. Recall they are essentially the same thing, the difference being planned migration has more checks to ensure the protection site is still operational, while failover barrels through; but both issue SRDF failover commands. After executing my planned migration I am greeted with the following:

Well things didn’t work. If I look at the device pairs now I see:

Strange, my ASYNC devices are Suspended, but the Metro ones remain ActiveBias. Well, as I need to get things over to the DR site, I’m going to do a failover and hopefully SRM forces through the errors. Here goes…

Hmm. Those errors look pretty bad. Let’s look at the SRA log where it appears the SRDF failover command was unable to execute.

But why didn’t it work? Well back in the beginning of the post I noted that both Metro and ASYNC SRDF groups must act as a whole. If you failover one device, you must failover all of them. So when the SRA tries to failover just two of the devices in SRDF group 11, Solutions Enabler responds that it can’t. You must failover all of them. The SRA is not coded to failover all of them which is good because the business would not expect it. Helpfully though, the SRA, upon this failure, returns the ASYNC devices to the original state of Consistent so we are back where we started before planned migration. What to do?

Workaround?

Is there a way to beat the system here? Well a couple things would make a difference. First, if you needed to run a true failover because the protection site was gone and therefore the Metro pairs were no longer in Active state, it is likely the SRM recovery plan will succeed. So often, however, customers are running failover (and planned migration requires it) when the devices are all operational so we need another way. Therefore, you need to manually suspend the Metro SRDF group, i.e. all pairs. You can do this in Unisphere or CLI:

Once the pairs are suspended, you are permitted to act upon the ASYNC group. When I run a failover now, no issues.

Note that reprotect is not supported in 3-site Metro configurations.

Supportability

Having read through this, you might wonder, and understandably, if the SRDF SRA should even support this type of configuration. Personally, I don’t think it should. In fact I don’t think SRDF should allow this period. As I said, I’ve only seen it once. Customers who want 3-site configurations for some of their Metro devices would normally separate them into their own SRDF group. That is the best practice regardless of VMware. (As an aside there are ways to switch the above configuration online to one the SRA can work with all the time.) As of today (10/29/2022), this is still technically supported, and as you read, testing works perfectly fine. But failover is the operation customers want to work when they need it, and in this configuration you will need to ensure the entire Metro SRDF group is not in an active state before running the recovery plan. Just remember, that if the idea of this type of environment is to do a partial failover of only some devices, you can’t achieve that. All Metro devices will need to be suspended in the group. If that is undesirable, best to move the few Metro devices in 3-site configurations to their own SRDF group.

***Update 11/15/2022***

We’ve decided to end support for this configuration. The SRA 10 Release Notes will be updated with this information.

**************************

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.

Up ↑

%d bloggers like this: