KB on 3-site SRDF/Metro Cascaded with SRM

I’ve been helping out development with some future configuration scenarios with SRM and the SRDF SRA and came across a behavior that could cause customers a problem running a 3-site SRDF/Metro cascaded setup. Funny thing is, the behavior of our array code which brings about the issue has been around for at least 2 PowerMaxOS releases but I have not had any customer hit it. This confirms a couple data points I have taken for granted with the SRA. The first is that most of our customers are not running 3-site configurations with SRDF/Metro. Since it requires 3 arrays, it is quite an investment, so that makes sense. The second point is that the majority of customers who do run them are using a concurrent, rather than cascaded model, meaning the asynchronous device is paired to the R1. Again, this makes sense since the bias is on the R1 and if your Metro pair was forced to use bias over witness (if all your witnesses failed or you used ActiveBias), you would want your asynchronous leg to be on that surviving site. But if you do want to use cascaded where the asynchronous leg is off the R2, please continue reading.


To explain the configuration issue, I don’t plan on going full “TechBook” on you, or my fingers will cramp and we’ll be here forever. I am just going to explain it plainly and then show you a couple screenshots for reinforcement. As the English say, let’s crack on.

In any SRDF/Metro configuration we recommend the use of a witness. We offer both physical (array) and virtual witnesses (vWitness) and suggest having multiple ones in a site(s) separate from either array. We have the ability to toggle through 30+ if we need to due to failures. Even if they all fail, we will fall back to the original functionality of bias (mentioned above). Now when the witness was first introduced, and I created an SRDF/Metro pair and let it sync, my original R1 would remain R1. In the grand scheme of things, that really doesn’t matter because in the event of a failure, the witness will choose the proper winner, R1 or R2; but a pair does require an R1 and R2. Now with each subsequent release of PowerMaxOS we’ve made the witness a little more intelligent. It uses a whole slew of factors during establish (sync) to pick the R1. The idea here being that if all witnesses did fail, we want to be sure the bias side is the device most likely to survive, or the one it makes most sense to survive (coming to this part). All things being equal, however, in a single device pair your R1 is still likely to remain R1.

One of the factors that our code takes into account when choosing an R1 is whether there is another replication relationship. It will favor that device over the other and thus if I had an asynchronous leg off one side, that device is going to be the R1. It does make sense since we would prefer to keep the device that has another copy of our data somewhere else. All good so far? OK moving to SRM.

Knowing the behavior, time to see what the SRA does with this. Here is my environment in the following modified Unisphere picture. I started with an SRDF/Metro pair between arrays 355 and 358 with a witness. The pairs completed synchronization and my device 4B on 355 became the R1. I then added an asynchronous leg to the R2, DD on 358, to array 450.

My SRM environment is 2 vCenters, one a vMSC environment seeing both Metro arrays (355 and 358) and the other is at the asynchronous site, 450. I created a single datastore on device 4B/DD. I then went through the SRM setup, configured the SRA global options file to look at the asynchronous site, and discovered my pairs. The SRA finds the cascaded pair of 4B and 7D. It’s all good.

I can setup my SRM objects, do testing, etc. and all works great. Now at some point, for maintenance on the R2 array let’s say, I have to suspend my SRDF/Metro pair. I complete my work and now re-establish my Metro pair (btw if you started with Bias and then during re-establish moved to a Witness, it would lead to the same problem). At this point our intelligent code takes over and decides that the R1 should actually be the device with the asynchronous pair as I explained above. Watch the progression in this screenshot of how the original R1 becomes the R2:

So the code certainly did what it was designed to do, but as SRM is setup to look at array 355 as it was our R1, look what happens:


We’ve lost our datastore and thus our protection group is invalid:


Trouble indeed. Is there a workaround? Nothing great I’m afraid. If, for some reason, your environment chooses the original R1 (based on those other factors I noted exist – however this would probably mean an issue with the R2) then you are in luck; but for the rest, in order to return R1 status to the device, we’d have to suspend and then re-establish with the bias flag, a method we would not recommend as it removes the witness. The best option is going to be to use a concurrent setup from the start, and begin your device pair creation from whichever array you want to have the asynchronous leg on. In my example above, therefore, I would have started my pair creation from array 358 and thus even after a suspend and re-establish it would remain the R1 as it has the asynchronous pair to 450. If you already have a cascaded setup and have not had this problem yet (you’ve never suspended the RDF pair), you can prepare for it by adding Gatekeepers from your current R2 array to the array manager of the R1 and then enabling the new array pair in SRM (alternatively you could disable the array pair and then add a different R1 array manager to SRM). In my example here I have already re-established the pair and now have a concurrent setup.

Unfortunately you will have to re-create the protection group and recovery plan since the array pairs have changed.

I have summarized this in KB 000543358.

In our next PowerMax release we do have some changes to SRDF/Metro 3-site behavior which will makes this post somewhat moot, but in the meantime please keep it in mind if you have or were considering using cascaded with Metro and SRM.

 

*** Update 5-22-20 ***

Discussing this issue internally, it appears that we are going to remove support for all SRDF/Metro 3-site cascaded configurations and only support concurrent setups. As I’ve explained above, short of using Bias instead of a Witness, the configuration is not viable, and we want customers to use the Witness as it is far superior to Bias. If you already have this configuration and, for business reasons, are using Bias and not Witness, we would still support your environment since you can control where the R1 is located.

Advertisement

4 thoughts on “KB on 3-site SRDF/Metro Cascaded with SRM

Add yours

  1. Hi do you happen to have documentation about configuring the connection between 2 PowerMax 2000 for SRDF/Metro or a guide to do this.

      1. I’ve only created SRDF Groups and pairs on a SRDF/Metro config before but I’ve been requested to configure physical connectivity and also create the logical connectivity, zonning and configuration between the 2 PowerMax so it can communicate and be able to create SRDF Groups. I can Zone but the logic on the SDDF zoning is what I have not been able to find.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.

Up ↑

%d bloggers like this: