Last May we introduced our latest SRDF SRA, version 9. This release offers a new feature, the ability to run a test failover when the production site is unavailable. I spoke about this in a post at that time which is available here. Some customers choose to run the test with just the host network down (e.g. SRM, Solutions Enabler), others with both the host and storage network down. The former should not require any changes to run the test, however the latter will sometimes necessitate setting the parameter Test_Failover_Force to True in the EmcSrdfSraGlobalOptions.xml file. Setting the parameter tells the SRA to ignore consistency errors that will be generated when the SRDF link enters an unsupported state for testing. I have this documented in the TechBook you can find in the documentation library, and it is noted in the SRA Release Notes.
So this is all well and good when you are doing testing with snapshots created on the R2, the preferred method. But what if you want to test directly against the R2 using the parameter TestFailoverWithoutLocalSnapshots when the sites are disconnected? The good news it is possible but it does require some manual intervention. Unfortunately the SRA Release Notes failed to talk about this configuration so when a customer asked me, I did some testing to validate how to go about this. But first, time for the disclaimer.
Disclaimer: We do not recommend using the R2 for testing because while the test is ongoing, you are at risk if the production site is lost. It is an option because there are customers who need it, whether due to business requirements, or perhaps because they do not have the available cache on the array to support the snapshot devices. Whatever the reason, before running a test on the R2 it is advisable to take a targetless snapshot of the R2 in case the production site is lost during testing and you wish the R2 returned to its original state prior to testing.
Differences between using R2 and snapshots with down link
When the RDF link is down between the sites, existing pairs will enter an unresolved state. If the pairs were synced prior to the break, the state will be Transmit Idle or TransIdle (depending on GUI/CLI interface).
In this state the SRA cannot take a consistent split, nor a consistent copy (SnapVX). Either operation would result in an error, each seen below.
symrdf -f /tmp/test.txt split -sid 357 -rdfg 12 -nop An RDF 'Split' operation execution is in progress for device file '/tmp/test.txt'. Please wait... The operation is not allowed because the SRDF links are in the transmit idle state
dsib2017:/tmp # symsnapvx -f test.txt -name test establish -sid 355 -nop Establish operation execution is in progress for the device file test.txt. Please wait... The device is not in a valid RDF state for this operation
In an normal test failover, whether I use the R2s or snapshots, therefore, the test will fail. If, however, I set the parameter Test_Failover_Force to True when using snapshots, the SRA will re-issue the SnapVX command with the -force parameter and the snapshot will succeed. Note, however, we cannot guarantee consistency.
dsib2017:/tmp # symsnapvx -f test.txt -name test establish -sid 355 -force -nop Establish operation execution is in progress for the device file test.txt. Please wait... Polling for Establish.............................................Started. Polling for Establish.............................................Done. Polling for Activate..............................................Not Needed. Establish operation successfully executed for the device file test.txt
So can I do the same thing with the R2 instead of snapshots? Unfortunately, no. We will not allow you to issue a split on a pair in a Transmit Idle state. We need the communication with both arrays to do that. So does this mean I can’t use the R2 when the sites are disconnected? Without some manual intervention, you cannot.
If you are willing to manually issue the split before disconnecting the sites, you can run the test. When you issue the split, the pairs will initially enter the split state, and the invalids begin to accrue. The invalids are the tracks that will have to be written from the R1 to the R2 when the pairs are re-established after testing is complete. Once you disconnect the link between the sites, the pairs will move to the Partitioned state. In this state it will be possible for the SRA to force through testing on the R2. After the testing is complete, you need to re-establish the SRDF link before running the cleanup so that it completes successfully. I’ll run through the steps below with some screenshots.
R2 Disconnected Site Test
I have a simple configuration setup. I have 2 devices, in an asynchronous relationship, each with a datastore. There is a single protection group and recovery plan
- Arrays 357 replicating to 355 asynchronously
- R1 7F, 80; R2 6B, 6C in SRDF Group 12
- Datastores: ASYNC1 and ASYNC2, each with a VM ASYNC13 and ASYNC_12
- Protection Group: R2-Disconnected-Site_Test
- Recovery Plan: test
Now I adjust my EmcSrdfSraGlobalOptions.xml file to enable both failover to R2 as well as forcing the test failover.
Now my pairs, here is my before state.
So the first thing I want to do is issue a split on the pairs. If I fail to do this, and then drop the SRDF link and run the test failover, it will fail. Note that the only reason I am running a split on the pairs is because I am dropping the SRDF link. If I choose to disconnect the production site except for the SRDF link I do not need to run the split prior to the test.
After the split is complete, I can disable my SRDF link. In this case I am just disabling the port.
With the link and network down, my pair state becomes partitioned.
I can now run the test. You’ll notice there are no errors generated here, only the SRA log file will contain that information.
After the recovery plan completes, I can conduct any testing on the environment. When I have finished, I need to online the RDF port which will return my SRDF pairs into a split state. Once it is split, I run the cleanup operation. During the cleanup, SRDF will re-establish the pairs and resync the R1 to the R2, overwriting any changes made to the R2 during the test.
After syncing the SRA returns control to SRM and the test completes.
My final comment about this post is that many might say it seems unnecessary to drop the SRDF link if I have to manually split the link beforehand. In other words I’ve already removed SRDF from the equation by splitting the pairs so why drop the link at that point? A fair question. The post is based on a customer requirement, nothing more, nothing less. My personal recommendation would be either 1) disconnect the sites and run the test failover with snapshots (as recommended), or 2) disconnect the sites except the SRDF link and then run the test failover with the R2.