SRDF SRA & SRM TestFailover with existing snapshots

Can the SRDF SRA use an older snapshot/clone/etc. rather than taking a fresh copy each time? Very occasionally I get this question from customers, particularly those who have seen the RecoverPoint functionality. For those who are not familiar with RecoverPoint (RP), it is another replication technology that we have that permits both local and remote replication on the VMAX3/AFA, albeit only asynchronously. RP has this neat functionality that is most often compared to DVR. Basically what you can do is look through a list of snapshots of devices in a consistency group, pick any point in time, and mount the image. This can be done through the RP GUI but in context of our discussion today, the SRA for RecoverPoint also can do this. I set what image I want, and the SRA uses it. This is the functionality that some VMAX customers would like for the SRDF SRA. In general, the answer to this is that the SRDF SRA does not have the inherent ability to do this in the same way as RP, and it really doesn’t; however there is something we can do which might be good enough for some customers. I’m going to show you a way this can be done manually, but it is definitely using a feature in an unintended way, off-label you might say.

There is a parameter available in the global options file (EmcSrdfSraGlobalOptions.xml) named “IgnoreActivatedSnapshots”. This parameter was designed for environments that are a mix of physical and virtual where consistency is critical between them. Basically what the parameter assumes is that just before running the testfailover in SRM, the user manually takes a snapshot of all devices (those presented to physical and those to virtual) to ensure consistency across them. Then, when this parameter is set and the testfailover is run, the SRA basically ignores the snapshot and skips the creation, moving onto the next step as if it took the snapshot itself. The reason this is done is if the SRA re-created the snapshot on the devices only presented to the VMware environment, consistency would be lost between the physical and virtual devices. This is the use case then that precipitated the parameter. Now I’ll show you how we can use it in just a virtual environment.

Before I proceed, here’s the part where you get the warning, blinking lights, annoying alarm. As I’ve just noted I am going to use the parameter in a manner for which it was not designed. The functionality this parameter permits can only be guaranteed to work properly when the snapshot is taken in close proximity to running the testfailover. This virtually ensures the VMware environment has not changed before the test is run; however, in environments that are stable it is possible to use an older snapshot with this parameter and in my experience that is typically what customers want. BUT, understand that if you choose to use this option with older snapshots, there is no guarantee it will function properly. The best chance of success is in environments that do not change so that the only thing different is the actual data on the vmdks. OK you’ve been warned – onward.

My testing environment is VMAX AFA so I’m using SnapVX as the replication technology, though other TimeFinder technologies can be used. SnapVX is particularly good for this because we don’t need multiple targets. I can take as many snaps as I need and when I choose one, I can use the same set of targets I use in a normal SRM testfailover. I have 4 datastores which are backed by 4 devices in a synchronous SRDF relationship. Though it is not pertinent to the test, I have 4 VMs, one in each datastore, a single protection group and recovery plan.

My example is simple, adhering to the recommendation of using a snapshot that is close in time to the running of the testfailover. I first start by creating a device group that contains both my R2s and the targets to make things easier when I run symsnapvx. You can of course use a file if you want or even supply device IDs on the command line. [Note that I still have a consistency group for my R2 devices on the array manager for normal testfailover operation, as is required.]

My first step is to take an initial snapshot before I have made any changes to the environment.

Next, I browse one of my SRM datastores and add a directory and a file to it before then taking a second snap.

So my first snap doesn’t have the folder, the second does. This is important for later.

Let’s start with a regular testfailover to demonstrate the folder I created is going to be present as the snapshot is generated in real time. Before running the test I need to make two changes to two XML files. First, I’ll make a single change to the EmcSrdfSraGlobalOptions.xml file so that the SRA will terminate the snapshot after the testfailover (so I can use the target devices outside of SRM).

Second, I need to modify my EmcSrdfSraTestFailoverConfig.xml file to include the device targets (the same ones in my device group above) for my 4 SYNC datastores.

Now I run the test successfully.

A quick check of my existing snapshots shows I have 3 on my R2s now, the two I took previously and the auto-generated SRA one.

With my newly signatured datastores, I should expect to see the folder I created previously (after the first manual snapshot) because this test was in real time.  And indeed here it is:

Time for a quick SRM cleanup, and the snapshot is removed.

Now I’m going to try and use the older “first_snap” snapshot (the one without the folder of course). First I link the snapshot to my target devices (which are no longer linked since I told the SRA to remove it in the last test).

Here you can see the linked snapshot.

 

With the snap ready, I just need to adjust the EmcSrdfSraGlobalOptions.xml file to set the IgnoreActivatedSnapshots parameter. While I’m here I also want to change the TerminateCopySessions to “No”, otherwise the SRA will remove my snapshot which I probably don’t want it to do.

I do not need to adjust my EmcSrdfSraTestFailoverConfig.xml file since I am using the same target devices. If I was not, however, I would have to change them here. This XML file is required with this parameter.

Time to execute the testfailover. From a user perspective, the only difference I see when using the parameter is the test takes just a little bit longer than normal since it has to do a bunch of checking to make sure I did everything correctly. In SRM 6.5 longer means less than an additional minute.

Now to confirm it used the first_snap snapshot, we can check if the folder is there, which it should not be.

The test is complete and successful.

So what are the caveats and what could go wrong? Well a boatload of both I suspect but this summary statement is probably enough: If your older, snapped, and linked device copies are not sufficiently similar (i.e. almost exact) to the existing devices – most importantly the VMs – you are going to have issues. If you do have issues, you are on your own to resolve unless you are using the feature as designed – taking a snapshot and linking it right before running a testfailover. I think most customers will not use the parameter in this way, but there are a handful who really want to test older data and know their environment does not change. The parameter is covered in the TechBook so if you have other questions, take a look there and good luck testing.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

Up ↑

%d bloggers like this: