vVols, Oracle RAC, and VMware SRM

Recently I’ve been working on a project covering Oracle RAC running on vVols. We have more customers taking a good look at vVols, particularly now as we offer both HA (embedded VASA containers) and replication, and Oracle comes up frequently as an application they might want to run. It’s fair to say we also see SQL frequently, but unfortunately as many of those environments use Windows Failover Cluster, we can’t support it on vVols due to lack of SCSI3-PR . Fortunately Oracle requires none of the reservations so it’s quite portable – VMFS, NFS, RDMs, and even vVols. vVols does have its own challenges and considerations in these large application/database environments which is how I came to do a project on it which will result in a whitepaper.

In the meantime, however, I thought it might be of some interest to show you how to use VMware SRM to run a testfailover of an Oracle RAC environment, utilizing the same production network but without stepping all over the production RAC. I did quite a bit of testing until I came up with a solution that worked well. I’m not ashamed to admit I did hunt Google initially for an answer to this but couldn’t find one. I think customers have other avenues than I do with my limited network and may not even need this solution, but in a lab it works well.

Anyway be that as it may,  I’ll go over the important bits here, though not in excruciating detail (I’ll leave that for the paper 😉 ). I will therefore assume knowledge on your part – most importantly how SRM works with vVols. If you need a refresher there is the whitepaper in the library, or I have a post here. You don’t necessarily need to know Oracle, other than understanding that RAC is a clustered solution so it uses shared vmdks, and that it uses many IPs for both external and internal communication.

Note that the VMware storage, in this case vVols, doesn't impact the solution so you could use it with VMFS or even RDMs as you'll see.

*********** Update 7-13-2021 *********************
I wanted to make you aware of a bug with SRM 8.4/vSphere 7.x. It’s one I only came across when I needed multiple storage containers/datastores. When you create a storage container and then rescan the VASA Provider VMware will automatically mount it as a vVol datastore. This of course is not standard functionality. Although for most customers this is not an issue, if you have multiple vCenters and do not want all storage containers attached to the same vCenter, you may have to unmount the datastore. VMware will fix it in the next release.
**************************************************

Environment

My production and disaster recovery setups are the same for VMware:

  • 4 ESXi nodes and vCenters running 7.0 U2+

On the production side for Oracle:

  • 4 node Oracle 19c RAC on OEL 8.1
    • Each node runs on a ESXi host courtesy of affinity rules
    • ASM clustered file system
    • 5 TB DB (running SLOB)
  •  vVol datastore with replication (ORADB is VASA Replication Group)
    • Shared vmdks

Here is ASM running and DB (including me starting it up) on production. I’ve also included the hosts file since it is at the heart how to keep the test failover environment on the same network as production.

Process

So the SRM setup proceeds as normal. You create a protection group for the ORADB replication group – recall vVols is SRDF/A so all VMs failover together – and then a recovery plan. Now SRM by default will create a test network on which to run the test VMs. The problem is that as I mentioned Oracle RAC is very particular about IPs and so if I use a test network, I’ll only ever get ASM and the DB up on a single node. There will be no inter-communication. If I failed over, of course there would be no issue since my production environment is down and I can re-use the same IPs on the production DR network. I do imagine many customers have the ability to put up their own test network (e.g. VLANs) and thus treat test failover just like production, but I can’t do that so I need to change IPs.

IPs

Later versions of Oracle (11gR2+) give us flexibility to modify the RAC environment without having to undertake major changes like re-running cluster scripts. Such tasks would complicate SRM to the point of rendering it no better than scripting, so the least amount we can do, preferably from within SRM, is the way to go.

The one thing that Oracle will not countenance is changing the hostname. Doing so leads us down the aforementioned path of cluster scripts so we must keep the hostnames. But, we can change the IPs. You’ve seen the hosts file above, and I’ve included it because I can’t use DNS in this environment. Customers, having far better infrastructure than me, could probably do this with DNS on the DR, but my solution is strictly the hosts file. There are 3 IPs we need to alter for each node. I’ll use node dsib2019 as an example:

  • Public IP of the host – 10.228.246.19
  • VIP of the host – 10.228.246.21
  • SCAN IPs – 10.228.246.23-25

The one IP I don’t need to change is the private IP (192.168.1.150). Since the network it uses is internal to my ESXi hosts in DR, they can’t conflict with the private network on production.

SRM

So how do we make these changes during the test failover? The first public IP is straightforward because SRM provides the ability to do that for each VM in recovery. But first, I need to change the default network for the recovery plan, otherwise it will use the test network bubble. So I re-assign the networks to point to the production ones.

The second step is to modify the VMs in the recovery plan so VMware alters the public IP. I have 2 NICs, public and private, but I only need to change NIC1 as that’s the public IP.

The other IPs require a post-script. SRM will let you run both pre and post scripts. In this example I just need one post script. The script will change both the VIP and the SCAN IPs with srvctl. But I do need to make changes to the hosts file to reflect the new IPs. So I create 2 files in my production nodes – a new hosts file (hosts.srm) and the script (srm.sh). The hosts file has the new IPs (except for the private ones). Notice none of the host names changed since that would break RAC:

10.228.246.137 dsib2019.lss.emc.com dsib2019
10.228.246.138 dsib2020.lss.emc.com dsib2020
10.228.246.139 dsib2026.lss.emc.com dsib2026
10.228.246.140 dsib2036.lss.emc.com dsib2036
10.228.245.214 dsib2021.lss.emc.com dsib2021
10.228.245.215 dsib2022.lss.emc.com dsib2022
10.228.245.216 dsib2028.lss.emc.com dsib2028
10.228.245.217 dsib2037.lss.emc.com dsib2037
192.168.1.150 dsib2019-priv
192.168.1.151 dsib2020-priv
192.168.1.152 dsib2026-priv
192.168.1.153 dsib2036-priv
10.228.246.161 dsib-scan dsib-scan.lss.emc.com
10.228.246.162 dsib-scan dsib-scan.lss.emc.com
10.228.246.144 dsib-scan dsib-scan.lss.emc.com

And the script:

sleep 600
. /home/oracle/.grid_profile
rm -f /etc/hosts
cp -f /etc/hosts.srm /etc/hosts
srvctl modify network -k 1 -S 10.228.245.214/255.255.252.0/ens192
srvctl stop vip -n dsib2019 -f
srvctl start vip -n dsib2019
srvctl stop scan_listener
srvctl stop scan
srvctl modify scan -n dsib-scan
srvctl start scan
srvctl start scan_listener

The steps in the script should be clear – first set the environment variables since this is run as root. Then change the hosts file followed by changing the VIP first, then the SCAN IPs. Most important in that script is the first step where I sleep for 10 minutes. This is to allow the cluster to come up. If it isn’t up, then the srvctl commands fail. In SRM it is setup as below. I also pipe it to a log. Be sure you change the timeout, otherwise the 5 minute default will cause the script to fail (since it is still asleep).

And really that’s it. If you are doing this on vVols, just be aware that you should have some delay between when you put the scripts on the production nodes, and when you run the test failover. It is an asynchronous copy to DR and SRM uses an existing snapshot, it does not generate a new one at the time of the test. I typically waited 15 minutes in between tests when I altered the script.

Walkthrough

I created a video which shows the high-level environment, the changes to recovery plan, and then the result of an SRM test. I didn’t mean it to replace the explanation above so there aren’t a lot of callouts but it should give you a good overview. When the paper is done I’ll put a link here.

Advertisement

One thought on “vVols, Oracle RAC, and VMware SRM

Add yours

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.

Up ↑

%d bloggers like this: