Occasionally I’ll get a question from a customer or even employee about doing failover testing in SRM with our SRDF SRA. While I have this documented extensively in the TechBook (TB), and most who contact me have been trying to use it, perhaps it is the very detailed nature of it that makes it challenging and gives rise to the problems that customers and employees alike have. As these problems all tend to fall into the same category, I want to see if I can provide some clarity. In all honesty, I’m doing this more for my sake so I can point people to a reference, than any altruistic motive, but if in the end we both get what we need I’m good with that.
Disclaimer: Not sure that is the correct term here, but anyway I just want to emphasize that I'm not providing any new revelations here. All of what I am about to say is covered in the TechBook. I'm simply going to condense the milk (or evaporate it depending on your dessert).
Testfailover seems to be the most troubling part of configuring SRM with SRDF for customers. They are usually able to get disaster recovery working fine, which is good since that is the most important activity you will run with SRM; but testing is a challenge for many. Fortunately, when I am asked for assistance, I see the same issues again and again. So I’m going to try to provide a high-level explanation of testfailover with the SRDF SRA which I hope will give customers a better foundation for configuring their environment. Armed with this information, the details contained in the TechBook will perhaps be clearer.
How does SRM work with arrays?
I want to start with the some information about how SRM works with array replication:
- SRM is a two-site solution, regardless of the underlying replication setup. What I mean by that is whether you run a 2-site SRDF configuration or a 3-site, you can only use one of the pairs with SRM. With 2-site SRDF the configuration works as is, with 3-site SRDF if you want to use the third site (ASYNC), then you need to modify the XML parameter FailoverToAsyncSite to Yes at each SRM site.
- VMware orchestrates everything that happens in the vCenter. When running a test failover, planned migration, or disaster recovery, VMware performs the necessary tasks on the protection site, then passes control to the array (SRA). When the array tasks are complete, it passes control back to VMware and VMware performs the necessary tasks on the recovery site. Depending on availability of the protection site, the VMware tasks there might be skipped.
- All array vendors must follow a set of specifications that VMware has written for SRM, but how each vendor implements their replication solution can be unique.
- A disaster recovery in SRM essentially works the same way across all array vendors. That is why, as I mentioned, it is usually the easiest to succeed in running.
- Failover testing implementations can vary greatly across vendors, but the end result is the same.
Snapshots or no snapshots(?)
It’s that last bullet point that I think confuses customers most. Generally, however, there are two types of testfailover array implementations: with device copies or without. Now you may be thinking doesn’t he mean snapshots or no snapshots? Well, kind of, but not really. I’m sure that clears things up. OK so what do I mean. For the purposes of my explanation, even though I’ll mention other vendors/solutions, let’s use the SRDF model of R1 and R2 since I assume you are familiar with that reading this – R1 being the protection site device and R2 being the recovery site device. Most vendors use snapshots for testfailover but not all vendors use a secondary device for that copy (think SnapVX linked copy of R2). For instance, solutions like RecoverPoint offer a PiT solution, where through their normal functioning of the software they take point-in-time copies (let’s say snapshots). When you run a testfailover in SRM, you can select one of these PiTs and RecoverPoint will present it; but they do not use a device copy, they use the actual remote device (R2) involved in replication. They are able to rewind and fast forward that device through any PiT available so they don’t need a secondary device. I believe Pure works similarly. Other arrays, let’s say Unity, create a copy of the remote device (R2) and present that copied device to the recovery vCenter. Replication between the production devices (R1-R2) continues unabated, while the test is conducted on a completely separate device(s). This has an advantage over other solutions since you are not going against the production volume and are not impacted by a failure of the protection site (having to back out changes of the PiT on the production volume).
Now the other method is no snapshots at all. In that case the vendor breaks replication for the duration of the test and the production recovery device (R2) is mounted and used. After the test is complete, any changes on the recovery device are thrown away and any accumulated changes at the protection site are synced to the recovery device. This is the riskiest implementation because you cannot back out of the changes like RecoverPoint if the protection site is lost.
SRDF SRA implementation
Finally, we can talk about the SRDF implementation. Our SRDF SRA can actually support all of the above options, in some way. Perhaps that is what also makes it challenging. But I’m not covering all of them. I’m going to concentrate on the 3 ways to run a testfailover. Note that the first 2 methods do not require the R2 to be masked to the recovery site, but the third does.
- Allow the SRA to handle everything without user intervention.
- Manually configure devices and XML files.
- Use no snapshots.
First though, there is a prerequisite that is often forgotten, and I’ve frankly lost count of the number of times I’ve had to fix an internal or external setup because this step was skipped. You need to have a device/composite/consistency group at each site for the R1s and the R2s. Just create a group and add the R1s to it on the protection site, and then do the same at the recovery site for the R2s. They don’t have to be the same name (except in complex configs see TB). Asynchronous groups should be consistent as should the SRDF group. After you do this run a rescan in SRM for your device pairs. The new group name(s) should show up in the pairs screen under consistency group. If it does not, check your work. No device other than the R1 or R2 should be in a group. Now the SRA does have a failsafe mechanism where it will try to create temporary groups for you if you don’t, but it does not always work properly and you’ll spin your wheels endless trying to debug. So do yourself a favor, create the groups. OK on to the methods.
This is the preferred method for running testfailover, but it is only available in the most current release of the SRA, 9.0. That should not be an issue since we are backward compatible with many SRM and array releases. In order to use this method, simply change this global option in the EmcSrdfSraGlobalOptions.xml file:
This tells the SRA I want you to do the following:
- Create target devices for my R2s
- Create a snapshot for each R2
- Link that snapshot(s) to the new target device(s)
- Place the target device in the R2 storage group (if you don’t present the R2s then there are ways around that and yes, read the TB)
Click the testfailover button and you’re good to go. That easy. There is one option that is available to change when using automation and that is whether or not to keep the devices around. On cleanup, you can tell the SRA to delete the target devices, or to keep them around for more testing. By default, it will keep them with these parameters:
If you prefer to delete the devices, just change them to their opposite values. These two parameter must always be opposite, otherwise your test will fail because you’d be telling the SRA to reuse the devices and not to reuse them at the same time. Its head will explode.
This is the traditional method of testing, and also the most complicated. You must do the following:
- Create target devices for each R2 (do not place them in any device/composite group)
- On the recovery site, modify the EmcSrdfSraTestFailoverConfig.xml file which is where you show the source R2 device and then the new target device you created. I have examples in the TB but it would look like this:
- Mask the target devices to the recovery site. If you want to mask them during the test, you must use a masking file and update the global options file. Another TB item to review.
- DO NOT create any snapshots or links. The SRA does that.
This method causes the most confusion with customers because the config files are just not fun to update (and get right); however some customers are cache-poor and cannot afford to make new target devices and must use existing ones; or they have designated devices for testing (because they are very large for instance) and want to use those. In those cases this method is best.
This option is one we generally do not recommend. Instead of using a snapshot device, it uses the R2s directly. I’ve already explained why this is a bad idea. So why is it an option you ask? Well customers asked for it. Sometimes it is a cache issue on the box like #2, or it is purely a test environment where they don’t care about the data. If you want to use this method you must modify a global option on the recovery site:
I’ve also written how this works. The SRA will split the SRDF pair and SRM will use the R2 directly for testing. When you run a cleanup, the SRA runs a re-establish and pushes all the data from the R1 to the R2, discarding anything you did on the R2. Of course you are at risk during this entire process. My recommendation if you are going to use this method is to take a targetless snapshot of the R2s before running the testfailover. This will provide a copy to restore to if the protection site is lost during the test. It takes up very little cache so even cache-poor boxes should be able to do it.
Well I hope this 10,000 foot view was useful. All the instructions, subtleties, and advanced options are in the TB so please don’t use this post as your documentation! This is about explaining concepts so that when you open the TB everything makes a little more sense.
A final warning about the methods – be sure you are only using one at a time. If you want to let the SRA create everything, then don’t create an XML file for pairs. If you want to use the XML file, don’t set the option for no snapshots. Keep your global options file clean or your test will surely fail.