PowerMax File Replication Failover

Sorry for the delay from the previous PowerMax File Replication post, but I had lots of other tasks related to the release I had to complete so I just had a chance to get back to File. I wanted to complete our discussion by showing both planned and unplanned failover. We’ll take it from the end of the last post where I setup a synchronous relationship for my NFS VMware_NFS_302 on my vCenter here:

I left the environment almost as it was in the last post, but since I’ve rebuilt it a bunch of times due to code changes, our SRDF group is now 9 on the production site and 1 on the disaster site. It doesn’t have any bearing on what I am showing you, other than I didn’t want you to get confused when you see it in screenshots. While I have you I should also explain that when you create this replication relationship there is an additional 6 GB device created for the NAS server root and config volume. This is why in the CLI screenshots you will see two devices, the config device and our 1 TB NFS device.

DR with File

I wanted to first frame our discussion since I am going to treat failover of NFS a bit differently than FC or iSCSI. For the quickest of recaps, leaving VMware SRM completely aside since we don’t support File with the SRA, if I had a DR event in my production environment which was using SRDF replicated datastores, I would have to:

  1. Run a manual failover of the SRDF devices
  2. Present the R2 devices to my disaster environment in a masking view
  3. Resignature the datastores (yes, not the only option but preferable)
  4. Register all my VMs in the datastore.

Note that even if it was just my array that failed (the most unlikely of scenarios), I’d still have to move to the disaster site since my R2s are not available to my production site (putting aside unique circumstances). But if just my array failed and not my other infrastructure like ESXi and the vCenter (just use your imagination), PowerMax File handles failover in a unique way which is what I will show.

Remote NAS

Replication with File works a bit differently because we have a remote NAS server which runs on our DR site – shown here as NAS-598-Remote (created in other blog post) which can take over the function of the production site. Note the IP address of this remote NAS server is the same as production 192.168.2.126 (remember both arrays are on the same NAS network).


Practically what this means is that when there is a failure, planned or not, the NFS datastore in my vCenter is transitioned to this remote NAS server transparently (well almost hang tight). Again I am presenting this as likely reality when we know most DR events will render the production site, including the vCenter, completely dead. But I’m about showing you what can happen in this post even if pigs have to fly.

Planned failover

Planned failover means we want this to happen so at least that part is normal. But even in planned failover customers are moving to the disaster site. We are going to steadfastly remain at the production site for this NFS failover. Before beginning, to avoid potential issues, it is best shutdown any VMs that are currently running on the NFS datastore. Wait, it is best? Well, technically (here’s that transparent thing) you can run the failover and leave the VM running. We don’t pause IO or cache or anything like that, but if the VM is running the failover might happen before a timeout is even hit on the application. For example, I ran IOMETER during my test, and it kept chugging like a champ with a minor slowdown. Do I recommend doing this if you are using production data? Um, heck no. But in a lab, well have fun. Let’s do the failover then.

Start by accessing the Data Protection -> File Protection -> REPLICATION -> REPLICATION SESSIONS screen. Select the Planned Failover button. You can see below everything is good – no Alerts, synchronized, etc.

Just for extras, here is the CLI of our pairs showing sync. I have highlighted the switch “-all” because you must use it as File pairs are normally hidden.

When you run a planned failover with File, you’ll have a two options shown here in this pop-up. The key is that checkbox. By default we will not restart SRDF replication – in other words we will run a “failover”. If you check the box as I have, we will run a “swap” which will change the R2 (array 598) to an R1, i.e., reverse replication.


Once the job completes, note how our source and destination systems are reversed, array 598 becoming the R1, array 302 now the R2.


And if I look at my VMware environment, see how the before and after of a planned migration are indistinguishable (yes I suppose I should have stuck a date command in there but you can trust me).

If you shutdown your VMs, start those back up and proceed as usual.

Unplanned failover

Unplanned failover is not wholly different than what we did above, albeit there is no option to switch replication and you can only execute it from the destination site. Furthermore the production site must be down. So first, I reset my environment so we are back where we started. Shutdown your VMs in this scenario since it isn’t a clean failover. Navigation in Unisphere is the same as above for planned migration, though we are now on the destination array 598. Use the three buttons to find the Unplanned Failover (DR) and execute.


As I wrote, the dialog here offers no options – you simply have to run it. It will show you, however, your current replication status. Mine is “Suspended” because my sites cannot communicate. By the way I had to force that state by mucking with my cluster IPs. We will not allow you to run an unplanned failover without the source site being unavailable.


After it completes, see below where my SRDF State is “Failed Over”. I also included the CLI which already lists invalids meaning I am running IO (IOMETER).

Cleanup

If you need to cleanup there are options to stop replication, delete the session, and then you can re-create it. It all depends on the current state so I’ll leave that one to you.

Normal File DR

Coming back full circle to the first topic, I know I work for Dell, but the truth is of all components that are going to fail, the array is the last on the list. So a DR is unlikely to happen the way I’ve illuminated above where the vCenter and hosts survive but not the array. Chances are the entire production site is lost, but that doesn’t mean an NFS failover can’t be easier than FC/iSCSI. Returning to our list of items for a normal failover, can we skip anything?

  1. Run a manual failover of the SRDF devices
  2. Present the R2 devices to my disaster environment in a masking view
  3. Resignature the datastores (yes, not the only option but preferable)
  4. Register all my VMs in the datastore.

Since it is an NFS mount, there is no resignaturing involved so no step three, but we can also avoid the second step by presenting our NFS datastore to our DR site from the start. As long as you name it the same, and use the same protocol (3 or 4.1), there is no reason it can’t be there ahead of time. Then you only have to register the VMs after failover. Here my two vCenters see the same failed over NFS:


The NFS mount on the DR site will just remain idle until a failover.

The Wrap

I don’t suspect File will be heavily used in our customer’s production PowerMax VMware environments, but it is far more robust than our previous file implementation and can be a viable solution for customers who prefer network storage.

Advertisement

One thought on “PowerMax File Replication Failover

Add yours

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.

Up ↑

%d bloggers like this: